Deakin University
Browse

File(s) under permanent embargo

A longitudinal study of topic classification on Twitter

Version 3 2024-06-18, 18:00
Version 2 2024-06-05, 06:21
Version 1 2022-09-29, 02:11
journal contribution
posted on 2024-06-18, 18:00 authored by Mohamed Reda BouadjenekMohamed Reda Bouadjenek, S Sanner, Z Iman, L Xie, DX Shi
Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.

History

Journal

PeerJ Computer Science

Volume

8

Article number

e991

Pagination

e991-e991

Location

United States

ISSN

2167-9843

eISSN

2376-5992

Language

en

Publication classification

C1 Refereed article in a scholarly journal

Publisher

PeerJ

Usage metrics

    Research Publications

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC