Streaming clustering with Bayesian nonparametric models

Huynh, V; Phung, D

Streaming clustering with Bayesian nonparametric models

journal contribution

posted on 2024-06-04, 07:43 authored by V Huynh, D Phung

Bayesian nonparametric (BNP) models are theoretically suitable for learning streaming data due to their complexity relaxation to growing data observed over time. There is a rich body of literature on developing efficient approximate methods for posterior inferences in BNP models, typically dominated by MCMC. However, very limited work has addressed posterior inference in a streaming fashion, which is important to fully realize the potential of BNP models applied to real-world tasks. The main challenge resides in developing one-pass posterior update which is consistent with the data streamed over time (i.e., data is scanned only once), for which general MCMC methods will fail to address. On the other hand, Dirichlet process-based mixture models are the most fundamental building blocks in the field of BNP. To this end, we develop in this paper a class of variational methods suitable for posterior inference of the Dirichlet process mixture (DPM) models where both the posterior update and data are presented in a streaming setting. We first propose new methods to advance existing variational based inference approaches for BNP to allow the variational distributions growing over time, hence overcoming an important limitation of current methods in imposing parametric, truncated restrictions on the variational distributions. This results in two new methods namely truncation-free variational Bayes (TFVB) and truncation-free maximization expectation (TFME) respectively where the latter further supports hard clustering. These inference methods form the foundation for our streaming inference algorithm where we further adapt the recent Streaming Variational Bayes proposed in Broderick et al. 2013 to our task. To demonstrate our framework for real-world tasks whose datasets are often heterogeneous, we develop one more theoretical extension for our model to handle assorted data where each observation consists of different data types. Our experiments with automatically learning the number of clusters demonstrate the comparable inference capability of our framework in comparison with truncated version variational inference algorithms for both synthetic and real-world datasets. Moreover, an evaluation of streaming learning algorithms with text corpora reveals both quantitative and qualitative efficacy of the algorithms on clustering documents.

History

Journal

Neurocomputing

Volume

258

Pagination

52-62

Location

Amsterdam, The Netherlands

Publisher DOI

https://doi.org/10.1016/j.neucom.2017.02.078

ISSN

0925-2312

eISSN

1872-8286

Language

eng

Publication classification

C Journal article, C1 Refereed article in a scholarly journal

Copyright notice

2017, Elsevier B.V.

Publisher

Elsevier

Usage metrics

Keywords

Streaming learning Bayesian nonparametric Variational Bayes inference Dirichlet process Dirichlet process mixtures Heterogeneous data sources Centre for Pattern Recognition and Data Analytics School of Information Technology 4605 Data management and data science 4611 Machine learning

Streaming clustering with Bayesian nonparametric models

History

Journal

Volume

Pagination

Location

Publisher DOI

ISSN

eISSN

Language

Publication classification

Copyright notice

Publisher

Usage metrics

Categories

Keywords

Licence

Exports