Open Access Open Access  Restricted Access Subscription or Fee Access

Enhancing HiveQL Engine Using Map-Join-Reduce

Amruta Kulkarni, Shweta Dharmadhikari, M. Emmanuel

Abstract


Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. This HiveQL is allowing enhancement of MapReduce to MapJoinReduce for our convenience. This will lead us for detailed study of performance improvement. The programmer is only required to write specialized map and reduce functions as part of the Map/Reduce job. Framework takes care of the rest. But MapReduce finds performance issue. The performance issue is mainly due to MapReduce sequential data processing strategy which frequently checkpoints and shuffles intermediate results in data processing. So MapReduce can be improved to increase scalability and efficiency. And proposed solution is Map-Join-Reduce. Map-Join-Reduce remove the burden of presenting complex join algorithms to the system. We first proposed filter-join-aggregate mathematical model which is an extension of MapReduce model. To support this mathematical model we present a MapJoinReduce architecture design for HiveQL engine. This architecture design will put light on strategy of query processing by Hive system and Hadoop system. Benefit of this approach is minimized check pointing and shuffling of intermediate result and further more improves performance of system.

Keywords


CPU and Memory Analysis, Hadoop, HiveQL

Full Text:

PDF

References


―MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters‖Dawei Jiang, Anthony K. H. Tung, and Gang Chen. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

―A Comparison of Join Algorithms for Log Processing in MapReduce‖, Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian, SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0032-2/10/06.

―Optimizing Joins in a Map-Reduce Environment‖, Foto N. Afrati, Jeffrey D. Ullman, ACM. EDBT 2010, March 22-26, 20010

―Graph Twiddling in a MapReduce World‖, Jonathan Cohen,IEEE July/August 2009

www.facebook.com, Join Optimization in Apache Hive by Liyin Tang

http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html

http://blog.cloudera.com/blog/category/hive

http://hadoop.apache.org/

http://hive.apache.org/

Hadoop in Action, Chuck Lam, Volume 1

http://developer.yahoo.com/hadoop/blogs

https://issues.apache.org/jira

http://blog.cloudera.com/blog

http://blog.cloudera.com/blog/category/mapreduce

http://www.cloudera.com/content/dam/cloudera/documents/Using-Cloudera-to-Improve-Data-Processing_WP_2012-09.pdf


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.