Built-in Big Data Applications Using Restful Web Services
Apache Hive is a widely used data warehousing and analysis tool. Developers write SQL like HIVE queries, which are converted into MapReduce programs to runs on a cluster. Despite its popularity, there is little research on performance comparison and diagnose. Part of the reason is that instrumentation techniques used to monitor execution cannot be applied to intermediate MapReduce code generated from Hive query. Because the generated MapReduce code is hidden from developers, run time logs are the only places a developer can get a glimpse of the actual execution. Having an automatic tool to extract information and to generate report from logs is essential to understand the query execution behavior.
In this paper designed a tool to build the execution profile of individual Hive queries by extracting information from HIVE and Hadoop logs. The profile consists of detailed information about MapReduce jobs, tasks and attempts belonging to a query. It is stored as a JSON document in MongoDB and can be retrieved to generate reports in charts or tables. The profiling tool tested with several experiments on AWS with TPC-H datasets and queries, it is found that the profiling tool is able to assist developers in comparing HIVE queries written in different formats, running on different data sets and configured with different parameters. It is also able to compare tasks/attempts within the same job to diagnose performance issues.
Apache hive. [Online]. Available: http://hive.apache.org/
Apache hadoop. [Online]. Available: http://hadoop.apache.org/
Apache hadoopnextgenmapreduce (yarn). [Online]. Available: http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarnsite/YARN.html
Tpc-h benchmark. [Online]. Available: http://www.tpc.org/tpch/
M. Poess and C. Floyd, “New tpc benchmarks for decision support and web commerce,” SIGMOD Rec., vol. 29, no. 4, pp. 64–71, Dec. 2000. [Online]. Available: http://doi.acm.org/10.1145/369275.369291
R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang, “Ysmart: Yet another sql-to-mapreduce translator,” in Distributed Computing Systems (ICDCS), 2011 31st International Conference on. IEEE, 2011, pp. 25 – 36.
Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson, O. O’Malley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang, “Major technical advancements in apache hive,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014, pp. 1235–1246.
Srivastava and A. Eustace, ATOM: A system for building customized program analysis tools. ACM, 1994, vol. 29, no. 6.
Q. Gao, F. Qin, and D. K. Panda, “Dmtracker: finding bugs in largescale parallel programs by detecting anomaly in data movements,” in Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 2007, p. 15.
H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
Btrace: A dynamic instrumentation tool for java. [Online]. Available: https://kenai.com/projects/btrace
X. Zhao, Y. Zhang, D. Lion, M. Faizan, Y. Luo, D. Yuan, and M. Stumm, “lprof: A nonintrusive request flow profiler for distributed systems,” in Proceedings of the 11th Symposium on Operating Systems Design and Implementation, 2014.
P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, “Magpie: Online modelling and performance-aware systems.” in HotOS, 2003, pp. 85–90.
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-trace: A pervasive network tracing framework,” in In NSDI, 2007.
R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger, “Diagnosing performance changes by comparing request flows.” in NSDI, 2011.
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.