A hybrid approach to clustering in big data

Kumar, D; Bezdek, JC; Palaniswami, M; Rajasegarar, Sutharshan; Leckie, C; Havens, TC

A hybrid approach to clustering in big data

journal contribution

posted on 2024-06-04, 06:14 authored by D Kumar, JC Bezdek, M Palaniswami, Sutharshan RajasegararSutharshan Rajasegarar, C Leckie, TC Havens

Clustering of big data has received much attention recently. In this paper, we present a new clusiVAT algorithm and compare it with four other popular data clustering algorithms. Three of the four comparison methods are based on the well known, classical batch k-means model. Specifically, we use k-means, single pass k-means, online k-means, and clustering using representatives (CURE) for numerical comparisons. clusiVAT is based on sampling the data, imaging the reordered distance matrix to estimate the number of clusters in the data visually, clustering the samples using a relative of single linkage (SL), and then noniteratively extending the labels to the rest of the data-set using the nearest prototype rule. Previous work has established that clusiVAT produces true SL clusters in compact-separated data. We have performed experiments to show that k-means and its modified algorithms suffer from initialization issues that cause many failures. On the other hand, clusiVAT needs no initialization, and almost always finds partitions that accurately match ground truth labels in labeled data. CURE also finds SL type partitions but is much slower than the other four algorithms. In our experiments, clusiVAT proves to be the fastest and most accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s.

History

Journal

IEEE transactions on cybernetics

Volume

46

Pagination

2372-2385

Location

Piscataway, N.J.

Publisher DOI

https://doi.org/10.1109/TCYB.2015.2477416

ISSN

2168-2267

eISSN

2168-2275

Language

eng

Publication classification

C Journal article, C1.1 Refereed article in a scholarly journal

Copyright notice

2015, IEEE

Issue

10

Publisher

IEEE

Usage metrics

Keywords

Big data cluster analysis cluster tendency assessment data analytics Internet of things single linkage 080109 Pattern Recognition and Data Mining 899999 Information and Communication Services not elsewhere classified School of Information Technology 970108 Expanding Knowledge in the Information and Computing Sciences 4603 Computer vision and multimedia computation 4605 Data management and data science

A hybrid approach to clustering in big data

History

Journal

Volume

Pagination

Location

Publisher DOI

ISSN

eISSN

Language

Publication classification

Copyright notice

Issue

Publisher

Usage metrics

Categories

Keywords

Licence

Exports