Abstract
The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.
Original language | English |
---|---|
Title of host publication | 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) |
Publisher | IEEE |
Pages | 1007-1016 |
ISBN (Electronic) | 9781665499569 |
ISBN (Print) | 9781665499576 |
DOIs | |
Publication status | Published - 19 Jul 2022 |
Event | Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) - Taormina, Italy Duration: 16 May 2022 → … http://lsgc.org/ccgrid-life/ |
Workshop
Workshop | Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) |
---|---|
Abbreviated title | CCGrid Life 2022 |
Country/Territory | Italy |
City | Taormina |
Period | 16/05/22 → … |
Internet address |
Keywords
- Big data
- Scientific workflow
- HPC
- Genome analysis
- Apache Spark
- High-performance data analytics