Deakin University
Browse

File(s) under permanent embargo

Mp-dissimilarity: a data dependent dissimilarity measure

conference contribution
posted on 2014-01-01, 00:00 authored by Sunil AryalSunil Aryal, K M Ting, G Haffari, T Washio
Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called ’mp-dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.

History

Event

IEEE Computer Society. Conference (14th 2014 : Shenzhen, China)

Series

IEEE Computer Society Conference

Pagination

707 - 712

Publisher

Institute of Electrical and Electronics Engineers

Location

Shenzhen, China

Place of publication

Piscataway, N.J.

Start date

2014-12-14

End date

2014-12-17

ISSN

1550-4786

Language

eng

Publication classification

E1.1 Full written paper - refereed

Copyright notice

2014, IEEE

Editor/Contributor(s)

R Kumar, H Toivonen, J Pei, J Huang, X Wu

Title of proceedings

ICDM 2014 : Proceedings of the 14th IEEE International Conference on Data Mining 2014