Query-driven frequent co-occurring term computation over relational data using MapReduce

Li, Jianxin; Liu, Chengfei; Zhou, Rui; Yu, Jeffrey Xu

File(s) under permanent embargo

Query-driven frequent co-occurring term computation over relational data using MapReduce

journal contribution

posted on 2015-01-01, 00:00 authored by Jianxin LiJianxin Li, Chengfei Liu, Rui Zhou, Jeffrey Xu Yu

Given a keyword query q and a large structured, traditional keyword search may return a large number of relevant results to users, which leads to a frustrating procedure for the users to select their interesting results. To help users understand the data to be searched, in this work we investigate the problem of frequent co-occurring terms (FCTs) in large relational data. By returning a set of most FCTs with the given keywords, we can provide a chance for users to see a big picture of relevant data information. The investigation of FCT problem is also one of the fundamental building blocks of data mining because the discovered FCTs can be employed to analyze the topics or contexts of user interest. Although the problem of FCTs computation was proposed and investigated in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg, Russia, March 23–26, pp. 839–850. ACM, New York, USA], further investigation is needed to improve the performance because FCT computation is very expensive. Especially for the increasing volume of data, the centralized approach in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg, Russia, March 23–26, pp. 839–850. ACM, New York, USA] may incur a big challenge on the efficiency of performing an FCT computation. To do this, we investigate how to perform parallel FCT computation using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. We design an effective mapping mechanism that exploits the approximately maximal workload of FCT computation for balancing the computational cost of each processor, while reducing the shuffling cost and avoiding the data-skewness. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark datasets with different sizes.

History

Journal

Computer journal

Volume

58

Issue

3

Pagination

497 - 513

Publisher

Oxford University Press

Location

Oxford, Eng.

Publisher DOI

https://doi.org/10.1093/comjnl/bxu090

Language

eng

Publication classification

C1.1 Refereed article in a scholarly journal

Copyright notice

2014, The British Computer Society

Usage metrics

Keywords

data analytics term co-occurrence structured data MapReduce Science & Technology Technology Computer Science, Hardware & Architecture Computer Science, Information Systems Computer Science, Software Engineering Computer Science, Theory & Methods Computer Science

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Query-driven frequent co-occurring term computation over relational data using MapReduce

History

Journal

Volume

Issue

Pagination

Publisher

Location

Publisher DOI

Language

Publication classification

Copyright notice

Usage metrics

Categories

Keywords

Licence

Exports