Abstract
The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.
| Original language | English |
|---|---|
| Title of host publication | 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) |
| Publisher | IEEE |
| Pages | 1007-1016 |
| ISBN (Electronic) | 9781665499569 |
| ISBN (Print) | 9781665499576 |
| DOIs | |
| Publication status | Published - 19 Jul 2022 |
| Event | Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) - Taormina, Italy Duration: 16 May 2022 → … http://lsgc.org/ccgrid-life/ |
Workshop
| Workshop | Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) |
|---|---|
| Abbreviated title | CCGrid Life 2022 |
| Country/Territory | Italy |
| City | Taormina |
| Period | 16/05/22 → … |
| Internet address |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Big data
- Scientific workflow
- HPC
- Genome analysis
- Apache Spark
- High-performance data analytics
Fingerprint
Dive into the research topics of 'SparkFlow: towards high-performance data analytics for Spark-based genome analysis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver