Distributed data augmented support vector machine on spark

Nguyen, Tu Dinh, Nguyen, Tien Vu, Le, Trung Minh and Phung, Quoc-Dinh 2016, Distributed data augmented support vector machine on spark, in 2016 23rd International Conference on Pattern Recognition (ICPR 2016), IEEE, Piscataway, N.J., pp. 498-503, doi: 10.1109/ICPR.2016.7899683.

Attached Files
Name Description MIMEType Size Downloads

Title Distributed data augmented support vector machine on spark
Author(s) Nguyen, Tu Dinh
Nguyen, Tien VuORCID iD for Nguyen, Tien Vu orcid.org/0000-0002-7070-8093
Le, Trung MinhORCID iD for Le, Trung Minh orcid.org/0000-0002-9977-8247
Phung, Quoc-Dinh
Conference name Pattern Recognition. Conference (23rd : 2016 : Cancun, Mexico)
Conference location Cancun, Mexico
Conference dates 2016/12/04 - 2016/12/08
Title of proceedings 2016 23rd International Conference on Pattern Recognition (ICPR 2016)
Editor(s) [Unknown],
Publication date 2016
Start page 498
End page 503
Total pages 6
Publisher IEEE
Place of publication Piscataway, N.J.
Keyword(s) Apache Spark
support vector machine
large-scale classification
Science & Technology
Computer Science, Artificial Intelligence
Computer Science
distributed computing
big data
Summary Support vector machines (SVMs) are widely-used for classification in machine learning and data mining tasks. However, they traditionally have been applied to small to medium datasets. Recent need to scale up with data size has attracted research attention to develop new methods and implementation for SVM to perform tasks at scale. Distributed SVMs are relatively new and studied recently, but the distributed implementation for SVM with data augmentation has not been developed. This paper introduces a distributed data augmentation implementation for SVM on Apache Spark, a recent advanced and popular platform for distributed computing that has been employed widely in research as well as in industry. We term our implementation sparkling vector machine (SkVM) which supports both classification and regression tasks by scanning through the data exactly once. In addition, we further develop a framework to handle the data with new classes arriving under an online classification setting where new data points can have labels that have not previously seen - a problem we term label-drift classification. We demonstrate the scalability of our proposed method on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our method are comparable or better than those of baselines whilst the execution time is much faster at an order of magnitude.
ISBN 9781509048472
ISSN 1051-4651
Language eng
DOI 10.1109/ICPR.2016.7899683
HERDC Research category E1 Full written paper - refereed
ERA Research output type E Conference publication
Copyright notice ©2016, IEEE
Persistent URL http://hdl.handle.net/10536/DRO/DU:30097143

Connect to link resolver
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in TR Web of Science
Scopus Citation Count Cited 6 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 246 Abstract Views, 4 File Downloads  -  Detailed Statistics
Created: Tue, 22 Aug 2017, 18:48:13 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.