Deakin University
Browse

File(s) under permanent embargo

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

journal contribution
posted on 2020-01-01, 00:00 authored by Sunil AryalSunil Aryal, Kai Ming Ting, Takashi Washio, Gholamreza Haffari
Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as ℓp -norm with p>0 ), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and mp -dissimilarity ( p>0 ), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise mp -dissimilarity where p≥0 by introducing m0 -dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of mp -dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

History

Journal

Data mining and knowledge discovery

Volume

34

Pagination

124 - 162

Publisher

Springer

Location

Cham, Switzerland

ISSN

1384-5810

eISSN

1573-756X

Language

eng

Publication classification

C1 Refereed article in a scholarly journal