A novel parallel algorithm for frequent itemsets mining in massive small files datasets

Xia, D; Rong, Z; Zhou, Y; Li, Y; Shen, Y; Zhang, Zili

zhang-novelparallelalgorithm-2014.pdf (3.17 MB)

A novel parallel algorithm for frequent itemsets mining in massive small files datasets

journal contribution

posted on 2014-01-01, 00:00 authored by D Xia, Z Rong, Y Zhou, Y Li, Y Shen, Zili ZhangZili Zhang

In big data analysis, frequent itemsets mining plays a key role in mining associations, correlations and causality. Since some traditional frequent itemsets mining algorithms are unable to handle massive small files datasets effectively, such as high memory cost, high I/O overhead, and low computing performance, we propose a novel parallel frequent itemsets mining algorithm based on the FP-Growth algorithm and discuss its applications in this paper. First, we introduce a small files processing strategy for massive small files datasets to compensate defects of low read-write speed and low processing efficiency in Hadoop. Moreover, we use MapReduce to redesign the FP-Growth algorithm for implementing parallel computing, thereby improving the overall performance of frequent itemsets mining. Finally, we apply the proposed algorithm to the association analysis of the data from the national college entrance examination and admission of China. The experimental results show that the proposed algorithm is feasible and valid for a good speedup and a higher mining efficiency, and can meet the actual requirements of frequent itemsets mining for massive small files datasets. © 2014 ISSN 2185-2766.

History

Journal

ICIC express letters, part B: applications

Volume

5

Issue

2

Pagination

459 - 466

Publisher

ICIC International

Location

China

ISSN

2185-2766

Language

eng

Publication classification

C Journal article; C1 Refereed article in a scholarly journal

Copyright notice

2014, ICIC International

Usage metrics

Keywords

Big data analysis Frequent itemsets mining Hadoop MapReduce Parallel FP-growth Small files problem

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

A novel parallel algorithm for frequent itemsets mining in massive small files datasets

History

Journal

Volume

Issue

Pagination

Publisher

Location

ISSN

Language

Publication classification

Copyright notice

Usage metrics

Categories

Keywords

Licence

Exports