Deakin University
Browse

Toward High‐Availability Distributed Stream Computing Systems via Checkpoint Adaptation

Version 2 2025-07-11, 00:35
Version 1 2025-07-09, 04:19
journal contribution
posted on 2025-07-11, 00:35 authored by Dawei Sun, Jia Peng, Ting Zhu, Jonathan KuaJonathan Kua, Shang GaoShang Gao, Rajkumar Buyya
ABSTRACTThe importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca‐Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource‐constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca‐Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.

History

Related Materials

  1. 1.

Location

London, Eng.

Open access

  • No

Language

eng

Publication classification

C1 Refereed article in a scholarly journal

Journal

Concurrency Computation Practice and Experience

Volume

37

Article number

e70171

ISSN

1532-0626

eISSN

1532-0634

Issue

15-17

Publisher

Wiley

Usage metrics

    Research Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC