Short text similarity measurement using context-aware weighted biterms
Version 2 2024-06-04, 13:55Version 2 2024-06-04, 13:55
Version 1 2020-05-05, 10:22Version 1 2020-05-05, 10:22
journal contribution
posted on 2024-06-04, 13:55 authored by S Yang, Guangyan HuangGuangyan Huang, Bahadorreza OfoghiBahadorreza Ofoghi, John YearwoodJohn Yearwood© 2020 John Wiley & Sons, Ltd. With the development of internet technologies, social media and mobile devices, short texts have become an increasingly popular medium among users to communicate with friends, search information and review products. Measuring the similarity between short texts is a fundamental task due to its importance in many applications, such as text retrieval, topic discovery, and event detection. However, short texts generally comprise sparse, noisy, and ambiguous information. Hence, effectively measuring the distance between short texts is a challenging task. In this paper, we exploit the advantageous corpus-wide word co-occurrence information into document-level feature enrichment to mitigate the challenges caused by the sparseness of short texts for distance measurement. We propose a novel context-aware weighted Biterm method for short text Distance Measurement (BDM). In BDM, we extract biterms (ie, word pairs) from a short text corpus and exploit a biterm topic model to determine the global weights of biterms in the corpus. We then determine the local importance of a biterm in different contexts (ie, short texts) based on the corpus-level biterm weight. The distance between two short texts is computed using the context-aware weighted biterms. Experimental results on three real-world datasets demonstrate better accuracy and effectiveness of the proposed BDM.
History
Journal
Concurrency ComputationVolume
34Article number
e5765Pagination
1-11Location
London, Eng.Publisher DOI
ISSN
1532-0626eISSN
1532-0634Language
EnglishPublication classification
C1 Refereed article in a scholarly journalIssue
8Publisher
WileyUsage metrics
Categories
Keywords
clusteringComputer ScienceComputer Science, Software EngineeringComputer Science, Theory & MethodsScience & Technologyshort textsimilarity measurementTechnologyDepartment of Information Systems and Business AnalyticsDE1401003874605 Data management and data science4606 Distributed computing and systems software
Licence
Exports
RefWorksRefWorks
BibTeXBibTeX
Ref. managerRef. manager
EndnoteEndnote
DataCiteDataCite
NLMNLM
DCDC