Supported By

Map Reduce Plus

Simplified Processing of Unstructured Data on Large Scale Computing Clouds
MapReduce, and its open-source implementation by Apache called Hadoop is popularly seen as a clean, convenient framework to parallelize the processing of large amounts of arbitrary data. However, we argue that while MapReduce affords flexibility for processing unstructured data, it implicitly assumes certain properties of input data, intermediate results and application semantics which limit its performance and utility.

Introduction

 

Our analysis and experimentation has led us to believe that MapReduce suffers from two key limitations:

(1) it does not perform well when intermediate results have skew;

(2) its two-staged architecture has no provision for estimating results even when the data is structured.

Our hypothesis is that we can deal with both these scenarios by a few changes to the Master and Worker implementation of MapReduce.

Our improvements over the original architecture will maintain the clean, convenient abstraction of MapReduce. The system will offer an architectural shift of interleaving map and reduce stages to address data skew. It will also iteratively process the input data to estimate results. Our MapReduce++ (MR+) will effectively neutralize the heterogeneity of real-life clusters by allowing resource aware scheduling.

Documents

 

Key Milestones and Deliverable Submitted

People

 

  • Ibrahim Ghaznavi
  • Farhan Ul Haq
  • Furqan Baig
  • Momina Azam
  • Talal Ahmed
  • Usama Mahmood

Supported By:
 
 

National ICT RnD Fund (Government of Punjab)