File(s) under permanent embargo
Beyond tf-idf and cosine distance in documents dissimilarity measure
conference contribution
posted on 2015-01-01, 00:00 authored by Sunil AryalSunil Aryal, Kai Ming Ting, Gholamreza Haffari, Takashi WashioIn vector space model, different types of term weighting schemes are used to adjust bag-of-words document vectors in order to improve the performance of the most widely used cosine distance. Even though the cosine distance with some term weighting schemes result in more reliable (dis)similarity measure in some data sets, it may not perform well in others because of the underlying assumptions of the term weighting schemes. In this paper, we argue that the explicit adjustment of bag-of-words document vectors using term weighting is not required if a data-dependent dissimilarity measure called $$m_p$$-dissimilarity is used. Our empirical result in document retrieval task reveals that $$m_p$$with the simplest binary bag-of-words representation is either better or competitive to the cosine distance with the best performing state-of-the-art term weighting scheme in four widely used benchmark document collections.
History
Event
Information Retrieval Technology. Conference (11th : 2015 : Brisbane, Queensland)Volume
9460Series
Lecture Notes in Computer SciencePagination
400 - 406Publisher
SpringerLocation
Brisbane, QueenslandPlace of publication
Cham, SwitzerlandPublisher DOI
Start date
2015-12-02End date
2015-12-04ISBN-13
9783319289403Language
engPublication classification
E1.1 Full written paper - refereedCopyright notice
2015, Springer International Publishing SwitzerlandEditor/Contributor(s)
Guido Zuccon, Shlomo Geva, Hideo Joho, Falk Scholer, Aixin Sun, Peng ZhangTitle of proceedings
AIRS 2015 : Information Retrieval Technology: Proceedings of the 11th Asia Information Retrieval Societies ConferenceUsage metrics
Keywords
Science & TechnologyTechnologyComputer Science, Artificial IntelligenceComputer Science, Information SystemsComputer Science, Interdisciplinary ApplicationsComputer Science, Theory & MethodsRoboticsComputer ScienceCosine distanceTerm weightingmp-dissimilarityArtificial Intelligence and Image Processing
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC