Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data
Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data.The available genetic data is increasing rapidly, with new high-throughput and low-cost technologies. While this data has enormous potential to impact scientific and medical advances, such data volumes cannot be processed without the use of parallelism. Most of the existing work on analysis of this datahas focused on the accuracy of the analyses, and not performance, i.e. either the algorithms are serial and/or very simple and non-scalable parallelization techniques have been used. In this paper, we address the problem of identification of variants in large–scale genome sequencing data. After examining different possible approaches, we identify one which does not require any communication.
However, achieving load-balance is non-trivial, because of the data-dependent nature of the processing. We develop three scheduling schemes including a dynamic scheme, which reduces scheduling overheads by using two different chunk sizes, a static scheme, which uses a pre-processing step to estimate workloads, and a combined scheme. In evaluating our schemes, we find that use of a pre-processing step (histogram computation) to estimate workloads is very effective, and thus, our combined scheme gives the best results. With a 32× increase in the number of cores, approximately a 24× performance improvement is seen, establishing that scalable processing of genomic data is possible. We also perform a comparison against an implementation based on Hadoop, and show that with our combined scheme, our implementation outperforms the one using Hadoop.
Similar IEEE Project Titles
- Omni-Kernel: An Operating System Architecture for Pervasive Monitoring and Scheduling
- Analysing Hadoop performance in a multi-user IaaS Cloud .
- Design and Evaluation of Network-Levitated Merge for Hadoop Acceleration .
- Perldoop: Efficient execution of Perl scripts on Hadoop clusters .
- HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis Using Hadoop.