A parallel random forest algorithm for big data in a spark cloud computing environment

Chen, Jianguo, Li, Kenli, Tang, Zhuo, Bilal, Kashif, Yu, Shui, Weng, Chuliang and Li, Keqin 2017, A parallel random forest algorithm for big data in a spark cloud computing environment, IEEE transactions on parallel and distributed systems, vol. 28, no. 4, pp. 919-933, doi: 10.1109/TPDS.2016.2603511.

Attached Files
Name Description MIMEType Size Downloads

Title A parallel random forest algorithm for big data in a spark cloud computing environment
Author(s) Chen, Jianguo
Li, Kenli
Tang, Zhuo
Bilal, Kashif
Yu, ShuiORCID iD for Yu, Shui orcid.org/0000-0003-4485-6743
Weng, Chuliang
Li, Keqin
Journal name IEEE transactions on parallel and distributed systems
Volume number 28
Issue number 4
Start page 919
End page 933
Total pages 1514
Publisher IEEE
Place of publication Piscataway, NJ
Publication date 2017-04
ISSN 1045-9219
Keyword(s) Apache Spark
Big Data
Cloud Computing
Data Parallel
Random Forest
Task Parallel
Distrubted databases
Decisions trees
Science & Technology
Computer Science, Theory & Methods
Engineering, Electrical & Electronic
Computer Science
Summary With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability.
Language eng
DOI 10.1109/TPDS.2016.2603511
Field of Research 080109 Pattern Recognition and Data Mining
Socio Economic Objective 970108 Expanding Knowledge in the Information and Computing Sciences
HERDC Research category C1 Refereed article in a scholarly journal
ERA Research output type C Journal article
Copyright notice ©2016, IEEE
Persistent URL http://hdl.handle.net/10536/DRO/DU:30086984

Connect to link resolver
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 53 times in TR Web of Science
Scopus Citation Count Cited 0 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 264 Abstract Views, 2 File Downloads  -  Detailed Statistics
Created: Wed, 12 Oct 2016, 08:55:25 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.