A hybrid approach to clustering in big data

Kumar, Dheeraj, Bezdek, James C., Palaniswami, Marimuthu, Rajasegarar, Sutharshan, Leckie, Christopher and Havens, Timothy Craig 2016, A hybrid approach to clustering in big data, IEEE transactions on cybernetics, vol. 46, no. 10, pp. 2372-2385, doi: 10.1109/TCYB.2015.2477416.

Attached Files
Name Description MIMEType Size Downloads

Title A hybrid approach to clustering in big data
Author(s) Kumar, Dheeraj
Bezdek, James C.
Palaniswami, Marimuthu
Rajasegarar, SutharshanORCID iD for Rajasegarar, Sutharshan orcid.org/0000-0002-6559-6736
Leckie, Christopher
Havens, Timothy Craig
Journal name IEEE transactions on cybernetics
Volume number 46
Issue number 10
Start page 2372
End page 2385
Total pages 14
Publisher IEEE
Place of publication Piscataway, N.J.
Publication date 2016-10
ISSN 2168-2267
Keyword(s) Big data cluster analysis
cluster tendency assessment
data analytics
Internet of things
single linkage
Summary Clustering of big data has received much attention recently. In this paper, we present a new clusiVAT algorithm and compare it with four other popular data clustering algorithms. Three of the four comparison methods are based on the well known, classical batch k-means model. Specifically, we use k-means, single pass k-means, online k-means, and clustering using representatives (CURE) for numerical comparisons. clusiVAT is based on sampling the data, imaging the reordered distance matrix to estimate the number of clusters in the data visually, clustering the samples using a relative of single linkage (SL), and then noniteratively extending the labels to the rest of the data-set using the nearest prototype rule. Previous work has established that clusiVAT produces true SL clusters in compact-separated data. We have performed experiments to show that k-means and its modified algorithms suffer from initialization issues that cause many failures. On the other hand, clusiVAT needs no initialization, and almost always finds partitions that accurately match ground truth labels in labeled data. CURE also finds SL type partitions but is much slower than the other four algorithms. In our experiments, clusiVAT proves to be the fastest and most accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s.
Language eng
DOI 10.1109/TCYB.2015.2477416
Field of Research 080109 Pattern Recognition and Data Mining
Socio Economic Objective 899999 Information and Communication Services not elsewhere classified
HERDC Research category C1.1 Refereed article in a scholarly journal
ERA Research output type C Journal article
Copyright notice ©2015, IEEE
Persistent URL http://hdl.handle.net/10536/DRO/DU:30087240

Connect to link resolver
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 31 times in TR Web of Science
Scopus Citation Count Cited 36 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 365 Abstract Views, 6 File Downloads  -  Detailed Statistics
Created: Mon, 17 Oct 2016, 07:51:34 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.