Accelerating Spark with RDMA for Big Data Processing: Early Experiences.
Accelerating Spark with RDMA for Big Data Processing: Early Experiences.Apache Hadoop Map Reduce has been highly successful in processing large-scale, data-intensive batch applications on commodity clusters. However, for low-latency interactive applications and iterative computations, Apache Spark, an emerging in-memory processing framework, has been stealing the limelight. Recent studies have shown that current generation Big Data frameworks (like Hadoop) cannot efficiently leverage advanced features (e.g. RDMA) on modern clusters with high-performance networks. One of the major bottlenecks is that these middleware are traditionally written with sockets and do not deliver the best performance on modern HPC systems with RDMA-enabled high-performance interconnects.
In this paper, we first assess the opportunities of bringing the benefits of RDMA into the Spark framework. We further propose a high-performance RDMA-based design for accelerating data shuffle in the Spark framework on high-performance networks. Performance evaluations show that our proposed design can achieve 79-83% performance improvement for Group By, compared with the default Spark running with IP over Infini Band (IPoIB) FDR on a 128-256 core cluster. We adopt a plug-in-based approach that can make our design to be easily integrated with newer Spark releases. To the best our knowledge, this is the first design for accelerating Spark with RDMA for Big Data processing.
Similar IEEE Project Titles
- Big data technologies in support of real time capturing and understanding of electric vehicle customers dynamics.
- Simulating Big Data Clusters for System Planning, Evaluation, and Optimization.
- MRPrePostA parallel algorithm adapted for mining big data.
- Bwasw-Cloud: Efficient sequence alignment algorithm for two big data with MapReduce.
- Towards a Collective Layer in the Big Data Stack.