Deakin University
Browse

Sample subset optimization for classifying imbalanced biological data

conference contribution
posted on 2011-01-01, 00:00 authored by P Yang, Zili ZhangZili Zhang, B Zhou, A Zomaya
Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.

History

Pagination

333-344

Location

Shenzhen, China

Start date

2011-05-24

End date

2011-05-27

ISSN

0302-9743

ISBN-13

9783642208461

ISBN-10

3642208460

Language

eng

Publication classification

E1 Full written paper - refereed

Copyright notice

2011, Springer-Verlag Berlin Heidelberg

Extent

45

Editor/Contributor(s)

Huang J, Cao L, Srivastava J

Title of proceedings

PAKDD 2011 : dvances in knowledge discovery and data mining : 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, proceedings, part II

Event

Knowledge Discovery and Data Mining. Pacific-Asia Conference (15th : 2011 : Shenzhen, China)

Publisher

Springer-Verlag

Place of publication

Berlin, Germany

Series

Lecture notes in artificial intelligence : 6635

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC