File(s) under permanent embargo
Mp-dissimilarity: a data dependent dissimilarity measure
conference contribution
posted on 2014-01-01, 00:00 authored by Sunil AryalSunil Aryal, K M Ting, G Haffari, T WashioNearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called ’mp-dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
History
Event
IEEE Computer Society. Conference (14th 2014 : Shenzhen, China)Series
IEEE Computer Society ConferencePagination
707 - 712Publisher
Institute of Electrical and Electronics EngineersLocation
Shenzhen, ChinaPlace of publication
Piscataway, N.J.Publisher DOI
Start date
2014-12-14End date
2014-12-17ISSN
1550-4786Language
engPublication classification
E1.1 Full written paper - refereedCopyright notice
2014, IEEEEditor/Contributor(s)
R Kumar, H Toivonen, J Pei, J Huang, X WuTitle of proceedings
ICDM 2014 : Proceedings of the 14th IEEE International Conference on Data Mining 2014Usage metrics
Keywords
Distance measurelp-normmp-dissimilarityAccuracyInformation retrievalVectorsEducational institutionsApproximation methodsData miningElectronic mailScience & TechnologyTechnologyComputer Science, Artificial IntelligenceComputer Science, Information SystemsComputer Sciencel(p)-normm(p)-dissimilarityRETRIEVALArtificial Intelligence and Image Processing
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC