Deakin University
Browse
chen-poberypossiblycomplete-2021.pdf (3.69 MB)

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

Download (3.69 MB)
journal contribution
posted on 2021-08-01, 00:00 authored by Jie Song, Qiang He, Feifei ChenFeifei Chen, Ye Yuan, Ge Yu
In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.

History

Journal

ACM/IMS Transactions on Data Science

Volume

2

Article number

23

Pagination

1-28

Location

New York, N.Y.

Open access

  • Yes

ISSN

2691-1922

Language

eng

Publication classification

C1 Refereed article in a scholarly journal

Issue

3

Publisher

Association for Computing Machinery

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC