File(s) under permanent embargo
Sample subset optimization for classifying imbalanced biological data
conference contribution
posted on 2011-01-01, 00:00 authored by P Yang, Zili ZhangZili Zhang, B Zhou, A ZomayaData in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.
History
Event
Knowledge Discovery and Data Mining. Pacific-Asia Conference (15th : 2011 : Shenzhen, China)Source
Advances in knowledge discovery and data mining : 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, proceedings, part IISeries
Lecture notes in artificial intelligence : 6635Pagination
333 - 344Publisher
Springer-VerlagLocation
Shenzhen, ChinaPlace of publication
Berlin, GermanyPublisher DOI
Start date
2011-05-24End date
2011-05-27ISSN
0302-9743ISBN-13
9783642208461ISBN-10
3642208460Language
engPublication classification
E1 Full written paper - refereedCopyright notice
2011, Springer-Verlag Berlin HeidelbergExtent
45Editor/Contributor(s)
J Huang, L Cao, J SrivastavaTitle of proceedings
PAKDD 2011 : dvances in knowledge discovery and data mining : 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, proceedings, part IIUsage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC