Short text similarity measurement using context-aware weighted biterms

Yang, S; Huang, Guangyan; Ofoghi, Bahadorreza; Yearwood, John

Short text similarity measurement using context-aware weighted biterms

journal contribution

posted on 2024-06-04, 13:55 authored by S Yang, Guangyan HuangGuangyan Huang, Bahadorreza OfoghiBahadorreza Ofoghi, John YearwoodJohn Yearwood

© 2020 John Wiley & Sons, Ltd. With the development of internet technologies, social media and mobile devices, short texts have become an increasingly popular medium among users to communicate with friends, search information and review products. Measuring the similarity between short texts is a fundamental task due to its importance in many applications, such as text retrieval, topic discovery, and event detection. However, short texts generally comprise sparse, noisy, and ambiguous information. Hence, effectively measuring the distance between short texts is a challenging task. In this paper, we exploit the advantageous corpus-wide word co-occurrence information into document-level feature enrichment to mitigate the challenges caused by the sparseness of short texts for distance measurement. We propose a novel context-aware weighted Biterm method for short text Distance Measurement (BDM). In BDM, we extract biterms (ie, word pairs) from a short text corpus and exploit a biterm topic model to determine the global weights of biterms in the corpus. We then determine the local importance of a biterm in different contexts (ie, short texts) based on the corpus-level biterm weight. The distance between two short texts is computed using the context-aware weighted biterms. Experimental results on three real-world datasets demonstrate better accuracy and effectiveness of the proposed BDM.

History

Journal

Concurrency Computation

Volume

34

Article number

e5765

Pagination

1-11

Location

London, Eng.

Publisher DOI

https://doi.org/10.1002/cpe.5765

ISSN

1532-0626

eISSN

1532-0634

Language

English

Author URL

http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000525962400001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=a045e4b2bb1f2b747c68c720ec8913b7

Publication classification

C1 Refereed article in a scholarly journal

Issue

8

Publisher

Wiley

Usage metrics

Keywords

clustering Computer Science Computer Science, Software Engineering Computer Science, Theory & Methods Science & Technology short text similarity measurement Technology Department of Information Systems and Business Analytics DE140100387 4605 Data management and data science 4606 Distributed computing and systems software

Short text similarity measurement using context-aware weighted biterms

History

Journal

Volume

Article number

Pagination

Location

Publisher DOI

ISSN

eISSN

Language

Author URL

Publication classification

Issue

Publisher

Usage metrics

Categories

Keywords

Licence

Exports