Open Access Open Access  Restricted Access Subscription or Fee Access

A Survey on Efficient Big Data Clustering using MapReduce

Avinash Dhanshetti, Tushar Rane


Clustering analysis is key point used by data processing algorithms in Data Mining. The primary aim of Clustering is to segment the data into more diminutive subsets called clusters, such that the data belonging to the same cluster are similar with some similarity metric. Clustering is imperative idea in data investigation and data mining applications. Over years, K-means has been popular clustering algorithm because of its ease of use and simplicity. Now days, as data size is continuously increasing, some researchers started working over distributed environment such as MapReduce to get high performance for big data clustering. In this paper, we explore the current works on efficient big data clustering algorithm using MapReduce framework.


Clustering, Map-Reduce, K-Means, Distributed-Environment.

Full Text:



Borthakur, D.: The Hadoop Distributed File System: Architecture and Design (2007)

A. Jain, R. Dubes, “Algorithms for Clustering Data”, Prentice Hall, 1988.

Alina Ene, Sungjin Im, Benjamin Mosele, “Fast clustering using MapReduce” Proceedings of the 17th ACM SIGKDD Internation Conference, ACM New York 2011

Zhao W, Ma H, He Q, “Parallel k-means clustering based on MapReduce”, Cloud computing- Springer, Berlin Heidelberg 2009.

Ene A, Im S, Moseley B “Fast clustering using MapReduce”, Pro-ceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp. 681–689 2011

Vattani A (2011), “K-means requires exponentially many itera-tions even in the plane [J]”, Discret Comput e Geom 45(4):596–616

D. Arthur, S. Vassilvitskii, “k-means++: the advantages of careful seeding”, Proceedings of the eighteenth annual ACM-SIAM sympo-sium on Discrete algorithms, pp. 1027- 1035, 2007.

T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, “An Efficient K-means Clustering Algorithm: Analysis and Implementation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, 2002.

Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding[C]. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for industrial and applied mathematics, pp 1027–1035

Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, Changqing Ji, “Opti-mized big data K-means clustering using MapReduce ” Springer Science Business Media New York 2014

Jing Zhang, Gongquing Wu, Xuegang Hu, Shiying, Shuilong Hao, “A Parallel K-Means Clustering Algorithm with MPI”, Parallel Ar-chitectures, Algorithms and Programming (PAAP), 2011 Fourth International Symposium, pp. 60-64.

Davidson I, Satyanarayana A (2003) Speeding up k-means cluster-ing by bootstrap averaging[C]. IEEE data mining workshop on clustering large data sets.

Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited [J]. ACM SIGKDD Explor Newsl 2(1):51–57

Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering[C]. ICML, pp 106–113

Fahim AM, Salem AM, Torkey FA et al (2006) An efficient en-hanced k-means clustering algorithm [J]. J Zhejiang Univ Sci a 7(10):1626–1633

Kanungo T, Mount DM, Netanyahu NS et al (2002) An efficient k-means clustering algorithm: analysis and implementation [J]. IEEE Trans Pattern Anal Mach Intell 24(7):881–892


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.