SparkFlow: towards high-performance data analytics for Spark-based genome analysis

Rosa Filgueira, Feras M. Awaysheh, Adam Carter, Darren J. White, Omar Rana

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Downloads (Pure)


The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.
Original languageEnglish
Title of host publication20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
ISBN (Electronic)9781665499569
ISBN (Print)9781665499576
Publication statusPublished - 19 Jul 2022
EventWorkshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) - Taormina, Italy
Duration: 16 May 2022 → …


WorkshopWorkshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022)
Abbreviated titleCCGrid Life 2022
Period16/05/22 → …
Internet address


  • Big data
  • Scientific workflow
  • HPC
  • Genome analysis
  • Apache Spark
  • High-performance data analytics


Dive into the research topics of 'SparkFlow: towards high-performance data analytics for Spark-based genome analysis'. Together they form a unique fingerprint.

Cite this