Open Access Open Access  Restricted Access Subscription or Fee Access

A Study on Data Mining Techniques and Tools for Big Data

Dr. R. Beena, C. Bhuvaneshwari


Big Data refers to large volume, growing data sets with heterogeneous, autonomous source such as Engineering, Genomics, Biology, Meteorology, Environmental research and many more. New technologies, systems and infrastructure must be developed in order to handle these data volumes. Deriving useful information from Big Data requires the development of increasingly sophisticated methods of mathematical and statistical analysis and the design of efficient algorithms.

The big data is constantly varying factor and newer algorithms and tools are continuously being developed to handle this big data.  Big Data is all about exploring large volumes of unstructured, invaluable, imperfect, complex data and extract useful information or knowledge for future use.

The platforms such as GPU, Multicore CPUs etc. can be used to speed up the data processing. There tools like Hadoop, Spark, Dynamo, Pentaho, SAMOA etc., can be used to handle big data. Apart from the above mentioned big data platforms, there are many platforms available with different characteristics and choosing the right platform requires an in-depth knowledge about the capabilities of all these platforms.

This paper provides an in depth study on the various data mining algorithms and tools available for performing big data analytics.


Big Data Mining, Clustering, Classification, Big Data Tools, Hadoop, Spark, Pentaho, ASTERIX.

Full Text:



UzmaShafaque,ParagD. Thakare,Mangesh M. Ghonge,Milindkumar V. Sarode, “Algorithm and Approaches to Handle Big Data”, International Journal of Computer Applications (0975 –8887) National Level Technical Conference “X-PLORE 14”.

Antonio Fernando Cruz Santos, I´talo Pereira Teles,Ota´vioManoel Pereira Siqueira, and Adicine´ iaAparecida de Oliveira , “Big Data: A Systematic Review”, # Springer International Publishing AG 2018 S. Latifi (ed.), Information Technology – New Generations, Advances in Intelligent Systems and Computing 558, DOI 10.1007/978-3-319-54978-1_64

Manisha R. Thakare,S. W. Mohod,A. N. Thakare, “Various Data-Mining Techniques for Big Data”, International Journal of Computer Applications (0975 – 8887) International Conference on Quality Up-gradation in Engineering, Science and Technology (ICQUEST2015) 9

SaurabhArora, InderveerChana , “A Survey of Clustering Techniques for Big Data Analysis”, 2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence) , 978-1-4799-4236-7/14/$31.00c©2014 IEEE

Qing He, Xin Jin, Changying Du, FuzhenZhuang, and Zhongzhi Shi , "Clustering in extreme learning machine feature space" Neurocomputing, 12S:SS {9S, 2014} .

R M adhuri, M Ramakrishna Murty, JVR Murthy, PVGD Prasad Reddy, and Suresh C Satapathy, "Cluster analysis on different data sets using k modes and k-prototype a lgorithmsN In ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society oflndia-VollI, pages 137 {144. Springer, 2014.

Ishak B oushakiSaida" KamelNadj et, and Bendjeghaba Omar, “A new algorithm for dataclustering based on cuckoo search optimization" in Genetic and Evolutionary Computing, pages SS {64. Springer, 2014.

Xu e-Feng Jiang, “Application of parallel annealing particle clustering algorithm in data mining" TELKOMNIKA Indonesian Journal of Electrical Engineering, 12(3):21I S {2126, 2014.

Khadija Musayeva, Tristan Henderson, John BO Mitchell, and LazarosMavridis, “Pf clust: an optimised implementation of a parameter-free clustering algorithm" Source code for biology and medicine, 9(1):5, 2014. Copyright © SMART -2016 ISBN: 978-1-5090-3543-4, Different Clustering Algorithms for Big Data Analytics: A Review.

Hong Yu, Zhanguo Liu, and Guoyin Wang, “An automatic method to determine the number of clusters using decision-theoretic rough set" International Journal of Approximate Reasoning, SS (I): IOI {liS, 2014.

KrisztianBuza, Gabor I Nagy, and Alexandros Nanopoulos,"Storage optimizing clustering algorithms for high-dimensional tick data" ExpertSystems with Applications, 2014.

Shuliang WANG, Jinghu a FAN, Meng FANG, and Hanning YUAN, "Hg cudf: Hierarchical grid clustering using data field" Chinese Journal of Electronics, 23(I), 2014.

Iftekhar Nairn, Suprakash Dana, Jonathan Rebhahn, James S Cavenaugh, Tim R Mosmann, and Gaurav Sharma, " Swift scalable clustering for automated identification of rare clle populations in large, high-dimensional flow cytometry datasets, part I : Algorithm design" CytometryPartA, 2014.

Younghoon Kim, et al., ―DBCUREMR: An efficient density – Basedclustering

A. Hinneburg, D. A. Keim, et al. “Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering”, Proc. Very Large Data Bases (VLDB), pp. 506–517, 1999.

Lawrence 0. Hall, NiteshChawla , Kevin W. Bowyer, “Decision Tree Learning on Very Large Data Sets”, IEEE, Oct 1998.

Thangaparvathi, B., Anandhavalli, D An improved algorithm of decision tree for classifying large data set based on rainforest framework, Communication Control and Computing Technologies (ICCCCT), 2010 IEEE International Conference on Oct. 2010Page(s):800 – 805.

D. L. A Araujo., H. S. Lopes, A. A. Freitas, “A parallel genetic algorithm for rule discovery in large databases” , Proc. IEEE Systems, Man and Cybernetics Conference, Volume 3, Tokyo, 940-945, 1999.

Mr. D. V. Patil, Prof. Dr. R. S. Bichkar,“A Hybrid Evolutionary Approach To Construct Optimal Decision Trees with Large Data Sets”, IEEE, 2006.

Ros, F., Harba, R.; Pintore, M. Fast dual selection using genetic algorithms for large data sets, Intelligent Systems Design and Applications (ISDA), 12th International Conference on Date of Conference:27-29 Nov. 2012 Page(s):815 – 820, 2012.

R. XU and D. Wunsch, "Survey of clustering algorithms," IEEE Trans. Neural Network, vol. 16, no. 3,p p. 64S_678, May 2005 .

Giuseppe DeCandia, DenizHastorun, MadanJampani, GunavardhanKakulapati, AvinashLakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels“Dynamo: Amazon’s Highly Available Key-value Store”,

Rajkumar.D, Usha.S, “A Survey on Big Data Mining Platforms, Algorithms and Handling Techniques”, International Journal for Research in Emerging Science and Technology, Volume-3, Special Issue-1, p.50-55, NCRTCT’16.

Gianmarco De FrancisciMorales, “SAMOA: A Platform for Mining Big Data Streams”.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.