Open Access Open Access  Restricted Access Subscription or Fee Access

Reusing Values in Big Data Frameworks

P. Surya, M. Abivarsha, S. Gopinath


In recent study Big Data has been a very hot and active research during the past few years. It is receiving hard to proficiently execute data investigation task with traditional data warehouse solutions. Parallel dealing out platforms and matching dataflow systems running on top of them are increasingly popular. They have greatly improved the throughput of data analysis tasks. The trade-off is the consumption of more computation resources. Tens or hundreds of nodes run together to execute one task. However, it might still take hours or even days to complete a task. It is very important to improve resource operation and computation effectiveness. According to research conducted by Microsoft, there exist around 40% of common sub-computations in usual workloads. Computation redundancy is a waste of time and resources. Apache Pig is a parallel dataflow system runs on top of Apache Hadoop, which is a parallel processing platform. Pig/Hadoop is one of the most popular combinations used to do large scale data processing. This thesis project proposed a framework which materializes and reuses previous computation results to avoid computation redundancy on top of Pig/Hadoop. The idea came from the materialize view technique in Relational Databases. Computation outputs were selected and stored in the Hadoop File System due to their large size. The effecting statistics of the outputs were stored in MySQL Cluster. The framework used a plan matcher and rewriter module to find the maximally shared common-computation with the query from MySQL Cluster, and rewrite the query with the materialized outputs. The framework was evaluated with the TPC-H Benchmark. The outcome showed that execution time had been considerably condensed by avoiding redundant computation. By reusing sub-computations, the query finishing time was reduced by 85% on average; while it only took around 40 ˜ 55 seconds when reuse whole computations. Besides, the results showed that the overhead is only around 35% on average.


Hadoop, PIG, Map Reduce, HDFS, Cluster.

Full Text:



MySQL Cluster. mysql-cluster.html. Last accessed: June 25, 2013.

Mysql Cluster Architecutre. mysql-cluster-overview.html. Last accessed: June 25, 2013.

MySQL Cluster Connector for Java. mccj.html. Last accessed: June 25, 2013.

MySQL Cluster Installation. mysql-cluster-installation.html/. Last accessed: June 25, 2013.

Pig Latin Basics. Last accessed: June 25, 2013.

TPC-H Benchmark. Last accessed: June 25, 2013.

User Defined Functions. Last accessed: June 25, 2013.

A.S.Syed Navaz, S.Gopalakrishnan & R.Meena “Anomaly Detections in Internet Using Empirical Measures” February 2013, International Journal of Innovative Technology and Exploring Engineering, Vol 2 – Issue 3. pp. 58-61.

A.S.Syed Navaz, M.Ravi & T.Prabhu, “Preventing Disclosure of Sensitive Knowledge by Hiding Inference” February 2013, International Journal of Computer Applications, Vol 63 – No 1. pp. 32-38.

Parag Agrawal, Daniel Kifer, and Christopher Olston. Scheduling shared scans of large data files. Proceedings of the VLDB Endowment, 1(1):958–969, 2008.

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R Narasayya. Automated selection of materialized views and indexes in SQL databases. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 496–505, 2000.

Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265–1276, 2008.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

Iman Elghandour and Ashraf Aboulnaga. Restore:Reusing results of mapreduce jobs. Proceedings of the VLDB Endowment, 5(6):586–597, 2012.

Alan Gates. Programming Pig. O’Reilly Media, 2011.

Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009.

A.S.Syed Navaz & Dr.G.M. Kadhar Nawaz & A.S.Syed Fiaz “Slot Assignment Using FSA and DSA Algorithm in Wireless Sensor Network” October – 2014, Australian Journal of Basic and Applied Sciences, Vol No –8, Issue No –16, pp.11-17.

A.S.Syed Navaz & A.S.Syed Fiaz, “Load Balancing in P2P Networks using Random Walk Algorithm” March – 2015, International Journal of Science and Research, Vol No – 4, Issue No – 3, pp.2062-2066.

Jonathan Goldstein and Per-Åke Larson. Optimizing queries using materialized views: a practical, scalable solution. In ACM SIGMOD Record, volume 30, pages 331–342. ACM, 2001.

Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A Thekkath, Yuan Yu, and Li Zhuang. Nectar: automatic management of data and computation in datacenters. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1–8. USENIX Association, 2010.

Ashish Gupta and Inderpal Singh Mumick. Maintenance of materialized views: Problems, techniques, and applications. Data Engineering Bulletin, 18(2):3–18, 1995.

A.S.Syed Navaz, H.Iyyappa Narayanan & R.Vinoth.” Security Protocol Review Method Analyzer (SPRMAN)”, August – 2013, International Journal of Advanced Studies in Computers, Science and Engineering, Vol No – 2, Issue No – 4, pp. 53-58.

A.S.Syed Navaz, J.Antony Daniel Rex, P.Anjala Mary. “An Efficient Intrusion Detection Scheme for Mitigating Nodes Using Data Aggregation in Delay Tolerant Network” September – 2015, International Journal of Scientific & Engineering Research, Vol No - 6, Issue No - 9, pp. 421 – 428.

Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Technical Conference, pages 267–273, 2008.

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099–1110. ACM, 2008.

Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005.

A.S.Syed Navaz, C.Prabhadevi & V.Sangeetha”Data Grid Concepts for Data Security in Distributed Computing” January 2013, International Journal of Computer Applications, Vol 61 – No 13, pp 6-11.

A.S.Syed Navaz, V.Sangeetha & C.Prabhadevi, “Entropy Based Anomaly Detection System to Prevent DDoS Attacks in Cloud” January 2013, International Journal of Computer Applications, Vol 62 – No 15, pp 42-47.

A.S.Syed Navaz, P.Jayalakshmi, N.Asha. “Optimization of Real-Time Video Over 3G Wireless Networks” September – 2015, International Journal of Applied Engineering Research, Vol No - 10, Issue No - 18, pp. 39724 – 39730.

S.Jensy Mary, A.S Syed Navaz & J.Antony Daniel Rex, “QA Generation Using Multimedia Based Harvesting Web Information” November – 2015, International Journal of Innovative Research in Computer and Communication Engineering, Vol No - 3, Issue No - 11, pp.10381-10386.

A.S Syed Navaz & A.S.Syed Fiaz “Network Intelligent Agent for Collision Detection with Bandwidth Calculation” December – 2015, MCAS Journal of Research, Vol No – 2, pp.88-95, ISSN: 2454-115X.

A.S.Syed Fiaz, N.Asha, D.Sumathi & A.S.Syed Navaz “Data Visualization: Enhancing Big Data More Adaptable and Valuable” February – 2016, International Journal of Applied Engineering Research, Vol No - 11, Issue No - 4, pp.–2801-2804.

A.S.Syed Navaz & Dr.G.M. Kadhar Nawaz “Flow Based Layer Selection Algorithm for Data Collection in Tree Structure Wireless Sensor Networks” March – 2016, International Journal of Applied Engineering Research, Vol No - 11, Issue No - 5, pp.–3359-3363.

A.S.Syed Navaz & Dr.G.M. Kadhar Nawaz “Layer Orient Time Domain Density Estimation Technique Based Channel Assignment in Tree Structure Wireless Sensor Networks for Fast Data Collection” June - 2016, International Journal of Engineering and Technology, Vol No - 8, Issue No - 3, pp.–1506-1512.

M.Ravi & A.S.Syed Navaz "Rough Set Based Grid Computing Service in Wireless Network" November - 2016, International Research Journal of Engineering and Technology, Vol No - 3, Issue No - 11, pp.1122– 1126.

A.S.Syed Navaz, N.Asha & D.Sumathi “Energy Efficient Consumption for Quality Based Sleep Scheduling in Wireless Sensor Networks” March - 2017, ARPN Journal of Engineering and Applied Sciences, Vol No - 12, Issue No - 5, pp.–1494-1498.

A.S.Syed Fiaz, I.Alsheba & R.Meena “Using Neural Networks to Create an Adaptive Character Recognition System”, Sep 2015, Discovery - The International Daily journal, Vol.37 (168), pp.53-58.

A.S.Syed Fiaz, M. Usha and J. Akilandeswari “A Brokerage Service Model for QoS support in Inter-Cloud Environment“,March 2013, International Journal of Information and Computation Technology, Vol.3, No.3, pp 257-260,

A.S.Syed Fiaz, R.Pushpapriya, S.Kirubashini & M.Sathya “Generation and allocation of subscriber numbers for telecommunication“, March2013, InternationalJournal of Computer Science Engineering and Information Technology Research, Vol No: 3; Issue No: 1, pp. 257-266.

A.S.Syed Fiaz, N.Devi, S.Aarthi "Bug Tracking and Reporting System", March 2013, International Journal of Soft Computing and Engineering, Vol No: 3; Issue No: 1, pp. 257-266.

M. Usha, J. Akilandeswari and A.S.Syed Fiaz “An efficient QoS framework for Cloud Brokerage Services”, Dec. 2012, International Symposium on Cloud and Service Computing, pp: 76-79, 17-18, IEEE Xplore.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.