HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis Using Hadoop.
HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis Using Hadoop.Hadoop is a convenient framework in e-Science enabling scalable distributed data analysis. In molecular biology, next-generation sequencing produces vast amounts of data and requires flexible frameworks for constructing analysis pipelines. We extend the popular HTSeq package into the Hadoop realm by introducing massively parallel versions of short read quality assessment as well as functionality to count genes mapped by the short reads.
We use the Hadoop-streaming library which allows the components to run in both Hadoop and regular Linux systems and evaluate their performance in two different execution environments: A single node on a computational cluster and a Hadoop cluster in a private cloud. We compare the implementations with Apache Pig showing improved runtime performance of our developed methods. We also inject the components in the graphical platform Cloudgene to simplify user interaction.
Similar IEEE Project Titles
- A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop.
- Dynamic data rebalancing in Hadoop.
- Performance evaluation of HDD and SSD on 10GigE, IPoIB & RDMA-IB with Hadoop Cluster Performance Benchmarking System .
- Workload Analysis, Implications, and Optimization on a Production Hadoop Cluster: A Case Study on Taobao.
- Job scheduling in Hadoop with Shared Input Policy and RAMDISK .