Short Text Similarity Measurement Using Context from Bag of Word Pairs and Word Co-occurrence
Version 2 2024-06-04, 13:55Version 2 2024-06-04, 13:55
Version 1 2020-03-19, 09:55Version 1 2020-03-19, 09:55
conference contribution
posted on 2024-06-04, 13:55 authored by S Yang, Guangyan HuangGuangyan Huang, Bahadorreza OfoghiBahadorreza Ofoghi© Springer Nature Singapore Pte Ltd 2020. With the rapid development of social networks, short texts have become a prevalent form of social communications on the Internet. Measuring the similarity between short texts is a fundamental task to many applications, such as social network text querying, short text clustering and geographical event detection for smart city. However, short texts in social media always show limited contextual information and they are sparse, noisy and ambiguous. Hence, effectively measuring the distance between short texts is a challenging task. In this paper, we propose a new heuristic word pair distance measurement (WPDM) technique for short texts, which exploits the corpus level word relations and enriches the context of each short text with bag of word pairs representation. We first adjust Jaccard similarity to measure the distance between words. Then, words are paired up to capture latent semantics in a short text document and thus transfer short text into a bag of word pairs representation. The similarity between short text documents is finally calculated through averaging the distances of the word pairs. Experimental results on a real-world dataset demonstrate that the proposed WPDM is effective and achieves much better performance than state-of-the-art methods.
History
Volume
1179Pagination
221-231Location
Ningbo, ChinaStart date
2019-05-15End date
2019-05-20ISSN
1865-0929eISSN
1865-0937ISBN-13
9789811528095Language
engPublication classification
E1 Full written paper - refereedTitle of proceedings
ICDS 2019 : Data science : 6th international conference, ICDS 2019, Ningbo, China, May 15-20, 2019, revised selected papersEvent
Data Science. Conference (2019 : 6th : Ningbo, China)Publisher
SpringerPlace of publication
Berlin, GermanySeries
Communications in Computer and Information ScienceUsage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorksRefWorks
BibTeXBibTeX
Ref. managerRef. manager
EndnoteEndnote
DataCiteDataCite
NLMNLM
DCDC