Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases
Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL DatabasesSocial media data analysis demonstrates two special characteristics in Big Data processing. First, most analyses focus on data subsets related to specific social events or activities instead of the whole dataset. Second, analysis workflows consist of multiple stages, and algorithms applied in each stage may use different computation and communication patterns depending on processing frameworks.This paper presents our efforts in supporting the data storage and processing requirements for such characteristics. To achieve efficient queries about target datasubsets, we propose a general customizable and scalable indexingframework that can be built over distributed NoSQL databases.
This framework allows users to define suitable customized index structures for their query patterns against social media data, and supports scalableindexing of both historical and streaming data. We implement this framework on HBase, and name it IndexedHBase. Starting from IndexedHBase, we build a distributed analysis stack based on YARN tosupport analysis algorithms using different processing frameworks, such as Hadoop MapReduce, Harp, and Giraph. This analysis stack is used to host the Truthy social media data observatory, and we have applied the customized index structures in supporting both query evaluation and sophisticated analysis algorithms. Performance tests show that our solutions outperform implementations using both direct raw data scans and current indexing mechanisms in existing NoSQL databases.
Similar IEEE Project Titles
- High level programming framework for FPGAs in the data center
- People in motion: Spatio-temporal analytics on Call Detail Records
- hatS: A Heterogeneity-Aware Tiered Storage for Hadoop
- Performance Implications of SSDs in Virtualized Hadoop Clusters
- ALOJA: A systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness