Dirichlet process mixture models with pairwise constraints for data clustering

Li, Cheng, Rana, Santu, Phung, Dinh and Venkatesh, Svetha 2016, Dirichlet process mixture models with pairwise constraints for data clustering, Annals of data science, vol. 3, no. 2, pp. 205-223, doi: 10.1007/s40745-016-0082-z.

Attached Files
Name Description MIMEType Size Downloads

Title Dirichlet process mixture models with pairwise constraints for data clustering
Author(s) Li, Cheng
Rana, SantuORCID iD for Rana, Santu orcid.org/0000-0003-2247-850X
Phung, DinhORCID iD for Phung, Dinh orcid.org/0000-0002-9977-8247
Venkatesh, SvethaORCID iD for Venkatesh, Svetha orcid.org/0000-0001-8675-6631
Journal name Annals of data science
Volume number 3
Issue number 2
Start page 205
End page 223
Total pages 19
Publisher Springer
Place of publication Berlin, Germany
Publication date 2016
ISSN 2198-5804
Keyword(s) Bayesian nonparametric
Dirichlet process
Mixture models
Pairwise constraints
Constrained clustering
Short-text clustering
Summary The Dirichlet process mixture (DPM) model, a typical Bayesian nonparametric model, can infer the number of clusters automatically, and thus performing priority in data clustering. This paper investigates the influence of pairwise constraints in the DPM model. The pairwise constraint, known as two types: must-link (ML) and cannot-link (CL) constraints, indicates the relationship between two data points. We have proposed two relevant models which incorporate pairwise constraints: the constrained DPM (C-DPM) and the constrained DPM with selected constraints (SC-DPM). In C-DPM, the concept of chunklet is introduced. ML constraints are compiled into chunklets and CL constraints exist between chunklets. We derive the Gibbs sampling of the C-DPM based on chunklets. We further propose a principled approach to select the most useful constraints, which will be incorporated into the SC-DPM. We evaluate the proposed models based on three real datasets: 20 Newsgroups dataset, NUS-WIDE image dataset and Facebook comments datasets we collected by ourselves. Our SC-DPM performs priority in data clustering. In addition, our SC-DPM can be potentially used for short-text clustering.
Language eng
DOI 10.1007/s40745-016-0082-z
Field of Research 080109 Pattern Recognition and Data Mining
Socio Economic Objective 970108 Expanding Knowledge in the Information and Computing Sciences
HERDC Research category C1 Refereed article in a scholarly journal
ERA Research output type C Journal article
Copyright notice ©2016, Springer
Persistent URL http://hdl.handle.net/10536/DRO/DU:30083829

Connect to link resolver
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in TR Web of Science
Scopus Citation Count Cited 1 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 454 Abstract Views, 1 File Downloads  -  Detailed Statistics
Created: Tue, 31 May 2016, 17:00:40 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.