Deakin University
Browse

File(s) under permanent embargo

Using spatiotemporal distribution of geocoded Twitter data to predict US county-level health indices

journal contribution
posted on 2020-09-01, 00:00 authored by Thin NguyenThin Nguyen, M Larsen, B O'Dea, Dinh Hung Nguyen, Duc Thanh NguyenDuc Thanh Nguyen, John YearwoodJohn Yearwood, Quoc-Dinh Phung, Svetha VenkateshSvetha Venkatesh, H Christensen
For more than three decades, the US has annually conducted Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture health behavior and health status of its people. Though this kind of information at population level is important for local governments to identify local needs, traditional datasets take several years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. Due to the large scale of data, such as approximately two billions of tweets in this work, aggregating the tweets at a population level is common practice. While alleviating the computational cost, the aggregation operation would result in the loss of information on the distribution of data over the population, and such information may be important for identifying the health behavior and health outcomes of the population. In this work, we propose statistical features constructed on-top of primary features to predict county-level health indices. The primary features include topics and linguistic patterns extracted from tweets with county-decoded information. In addition, tweeting behaviors, particularly tweeting time, are used as a predictor of the health indices. Apache Spark, an advanced cluster computing paradigm, was employed to efficiently process the large corpus of tweets, including geo-decoding the geotags, extracting low-level (primary) features, and computing the statistical features. The results show strong correlations between publicly available health indices and the features extracted from geospatially coded Twitter data. Statistical features gained higher correlation coefficients than did the aggregation ones, suggesting the potential and applicability of the proposed features in a wide spectrum of applications on data analytics at population levels. In addition, the prediction performance was also improved when the temporal information was employed. This demonstrates that the real-time analysis of social media data can provide timely insights into the health of populations.

History

Journal

Future generation computer systems

Volume

110

Pagination

620 - 628

Publisher

Elsevier

Location

Amsterdam, The Netherlands

ISSN

0167-739X

Language

eng

Publication classification

C1 Refereed article in a scholarly journal

Copyright notice

2018, Elsevier