Mp-dissimilarity: a data dependent dissimilarity measure
Version 2 2024-06-05, 02:47Version 2 2024-06-05, 02:47
Version 1 2019-03-10, 15:41Version 1 2019-03-10, 15:41
conference contribution
posted on 2024-06-05, 02:47authored bySunil AryalSunil Aryal, KM Ting, G Haffari, T Washio
Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called ’mp-dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
History
Pagination
707-712
Location
Shenzhen, China
Start date
2014-12-14
End date
2014-12-17
ISSN
1550-4786
Language
eng
Publication classification
E1.1 Full written paper - refereed
Copyright notice
2014, IEEE
Editor/Contributor(s)
Kumar R, Toivonen H, Pei J, Huang JZ, Wu X
Title of proceedings
ICDM 2014 : Proceedings of the 14th IEEE International Conference on Data Mining 2014