Unsupervised discretization algorithm based on mixture probabilistic model
journal contribution
posted on 2002-01-01, 00:00authored byGang LiGang Li, F Tong
A theoretically rigorous algorithm for discretization of continuous attributes is presented based on mixture probabilistic models. This algorithm can automatically divide the range of specified attribute into intervals without prior knowledge or referencing attributes. A mixture probabilistic model in which each mixture component corresponding to a different interval represents all the attribute values. The Expectation-Maximization algorithm for maximum likelihood determines the parameters for the mixture probabilistic model. One advantage of mixture probabilistic-model approach to discretizing is that it allows the use of approximate Bayes factors to compare models. In order to determine the most suitable number of intervals, the maximum likelihood parameters for mixture probability model with different number of components are calculated, and BIC (Bayesian Information Criteria) of these models are compared. From them, the model with the highest BIC is chosen as the resulting generative probabilistic model and determining the number of intervals. So choosing the best model simultaneously solves the problem of determining the number of intervals and the dividing method. Experimental results show that this form of discretization can have distinct advantages over competing non-probabilistic approaches (such as K-means algorithm) for certain reasons, since it allows uncertainty in interval membership, direct control over the variability is allowed within each interval, and permits an objective treatment of the ever-thorny question of how many intervals are being suggested by data.