Deakin University
Browse

File(s) under permanent embargo

Query-driven frequent co-occurring term computation over relational data using MapReduce

journal contribution
posted on 2015-01-01, 00:00 authored by Jianxin LiJianxin Li, Chengfei Liu, Rui Zhou, Jeffrey Xu Yu
Given a keyword query q and a large structured, traditional keyword search may return a large number of relevant results to users, which leads to a frustrating procedure for the users to select their interesting results. To help users understand the data to be searched, in this work we investigate the problem of frequent co-occurring terms (FCTs) in large relational data. By returning a set of most FCTs with the given keywords, we can provide a chance for users to see a big picture of relevant data information. The investigation of FCT problem is also one of the fundamental building blocks of data mining because the discovered FCTs can be employed to analyze the topics or contexts of user interest. Although the problem of FCTs computation was proposed and investigated in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg, Russia, March 23–26, pp. 839–850. ACM, New York, USA], further investigation is needed to improve the performance because FCT computation is very expensive. Especially for the increasing volume of data, the centralized approach in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg, Russia, March 23–26, pp. 839–850. ACM, New York, USA] may incur a big challenge on the efficiency of performing an FCT computation. To do this, we investigate how to perform parallel FCT computation using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. We design an effective mapping mechanism that exploits the approximately maximal workload of FCT computation for balancing the computational cost of each processor, while reducing the shuffling cost and avoiding the data-skewness. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark datasets with different sizes.

History

Journal

Computer journal

Volume

58

Issue

3

Pagination

497 - 513

Publisher

Oxford University Press

Location

Oxford, Eng.

Language

eng

Publication classification

C1.1 Refereed article in a scholarly journal

Copyright notice

2014, The British Computer Society