Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning

Research output: Contribution to journalArticle

21 Citations (Scopus)

Abstract

Learning from imbalanced datasets is challenging for standard algorithms, as they are designed to work with balanced class distributions. Although there are different strategies to tackle this problem, methods that address the problem through the generation of artificial data constitute a more general approach compared to algorithmic modifications. Specifically, they generate artificial data that can be used by any algorithm, not constraining the options of the user. In this paper, we present a new oversampling method, Self-Organizing Map-based Oversampling (SOMO), which through the application of a Self Organizing Map produces a two dimensional representation of the input space, allowing for an effective generation of artificial data points. SOMO comprises three major stages: Initially a Self-Organizing Map produces a two-dimensional representation of the original, usually high-dimensional, space. Next it generates within-cluster synthetic samples and finally it generates between cluster synthetic samples. Additionally we present empirical results that show the improvement in the performance of algorithms, when artificial data generated by SOMO are used, and also show that our method outperforms various oversampling methods.
Original languageEnglish
Pages (from-to)40-52
Number of pages13
JournalExpert Systems with Applications
Volume82
DOIs
Publication statusPublished - 1 Oct 2017

Fingerprint

Self organizing maps

Cite this

@article{1d8b0719fe91433fb2e4369febb8e422,
title = "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning",
abstract = "Learning from imbalanced datasets is challenging for standard algorithms, as they are designed to work with balanced class distributions. Although there are different strategies to tackle this problem, methods that address the problem through the generation of artificial data constitute a more general approach compared to algorithmic modifications. Specifically, they generate artificial data that can be used by any algorithm, not constraining the options of the user. In this paper, we present a new oversampling method, Self-Organizing Map-based Oversampling (SOMO), which through the application of a Self Organizing Map produces a two dimensional representation of the input space, allowing for an effective generation of artificial data points. SOMO comprises three major stages: Initially a Self-Organizing Map produces a two-dimensional representation of the original, usually high-dimensional, space. Next it generates within-cluster synthetic samples and finally it generates between cluster synthetic samples. Additionally we present empirical results that show the improvement in the performance of algorithms, when artificial data generated by SOMO are used, and also show that our method outperforms various oversampling methods.",
author = "Georgios Douzas and Fernando Ba{\cc}{\~a}o",
note = "Douzas, G., & Ba{\cc}{\~a}o, F. (2017). Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Systems with Applications, 82, 40-52. https://doi.org/10.1016/j.eswa.2017.03.073",
year = "2017",
month = "10",
day = "1",
doi = "10.1016/j.eswa.2017.03.073",
language = "English",
volume = "82",
pages = "40--52",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Science B.V., Amsterdam.",

}

TY - JOUR

T1 - Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning

AU - Douzas, Georgios

AU - Bação, Fernando

N1 - Douzas, G., & Bação, F. (2017). Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Systems with Applications, 82, 40-52. https://doi.org/10.1016/j.eswa.2017.03.073

PY - 2017/10/1

Y1 - 2017/10/1

N2 - Learning from imbalanced datasets is challenging for standard algorithms, as they are designed to work with balanced class distributions. Although there are different strategies to tackle this problem, methods that address the problem through the generation of artificial data constitute a more general approach compared to algorithmic modifications. Specifically, they generate artificial data that can be used by any algorithm, not constraining the options of the user. In this paper, we present a new oversampling method, Self-Organizing Map-based Oversampling (SOMO), which through the application of a Self Organizing Map produces a two dimensional representation of the input space, allowing for an effective generation of artificial data points. SOMO comprises three major stages: Initially a Self-Organizing Map produces a two-dimensional representation of the original, usually high-dimensional, space. Next it generates within-cluster synthetic samples and finally it generates between cluster synthetic samples. Additionally we present empirical results that show the improvement in the performance of algorithms, when artificial data generated by SOMO are used, and also show that our method outperforms various oversampling methods.

AB - Learning from imbalanced datasets is challenging for standard algorithms, as they are designed to work with balanced class distributions. Although there are different strategies to tackle this problem, methods that address the problem through the generation of artificial data constitute a more general approach compared to algorithmic modifications. Specifically, they generate artificial data that can be used by any algorithm, not constraining the options of the user. In this paper, we present a new oversampling method, Self-Organizing Map-based Oversampling (SOMO), which through the application of a Self Organizing Map produces a two dimensional representation of the input space, allowing for an effective generation of artificial data points. SOMO comprises three major stages: Initially a Self-Organizing Map produces a two-dimensional representation of the original, usually high-dimensional, space. Next it generates within-cluster synthetic samples and finally it generates between cluster synthetic samples. Additionally we present empirical results that show the improvement in the performance of algorithms, when artificial data generated by SOMO are used, and also show that our method outperforms various oversampling methods.

UR - http://www.scopus.com/inward/record.url?scp=85017142343&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2017.03.073

DO - 10.1016/j.eswa.2017.03.073

M3 - Article

VL - 82

SP - 40

EP - 52

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

ER -