Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

Aryal, Sunil; Ting, KM; Washio, T; Haffari, G

Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

journal contribution

posted on 2024-06-05, 02:47 authored by Sunil AryalSunil Aryal, KM Ting, T Washio, G Haffari

Nearest neighbor search is a core process in many data mining algorithms. Finding reliable closest matches of a test instance is still a challenging task as the effectiveness of many general-purpose distance measures such as $$\ell _p$$ℓp-norm decreases as the number of dimensions increases. Their performances vary significantly in different data distributions. This is mainly because they compute the distance between two instances solely based on their geometric positions in the feature space, and data distribution has no influence on the distance measure. This paper presents a simple data-dependent general-purpose dissimilarity measure called ‘$$m_p$$mp-dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances as a probability mass in a region that encloses the two instances in every dimension. It deems two instances in a sparse region to be more similar than two instances of equal inter-point geometric distance in a dense region. Our empirical results in k-NN classification and content-based multimedia information retrieval tasks show that the proposed $$m_p$$mp-dissimilarity measure produces better task-specific performance than existing widely used general-purpose distance measures such as $$\ell _p$$ℓp-norm and cosine distance across a wide range of moderate- to high-dimensional data sets with continuous only, discrete only, and mixed attributes.

History

Journal

Knowledge and Information Systems

Volume

53

Pagination

479-506

Location

New York, N.Y.

Publisher DOI

https://doi.org/10.1007/s10115-017-1046-0

ISSN

0219-1377

eISSN

0219-3116

Language

English

Publication classification

C1.1 Refereed article in a scholarly journal

Copyright notice

2017, Springer-Verlag London

Issue

2

Publisher

SPRINGER LONDON LTD

Usage metrics

Keywords

Science & Technology Technology Computer Science, Artificial Intelligence Computer Science, Information Systems Computer Science Distance measure l(p)-norm Cosine distance m(p)-dissimilarity SIMILARITY RETRIEVAL 4605 Data management and data science

Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

History

Journal

Volume

Pagination

Location

Publisher DOI

ISSN

eISSN

Language

Publication classification

Copyright notice

Issue

Publisher

Usage metrics

Categories

Keywords

Licence

Exports