Open Access Open Access  Restricted Access Subscription or Fee Access

A Survey on Partitioning and Parallel Partitioning Clustering Algorithms

Sushreeta Tripathy, Sarbeswara Hota

Abstract


Learning is the process of generating useful information from a huge volume of data. Learning can be classified as supervised learning and unsupervised learning. Clustering is a kind of unsupervised learning. A pattern representing a common behaviour or characteristics that exist among each item can be generated. This paper gives an overview of partition clustering and parallel implementation of this clustering algorithm. It describes about the general working behaviour, the methodologies followed on these approaches and the parameters which affects the performance of these algorithms. Clustering is grouping input data sets into subsets, called ‟clusters‟ within which the elements are somewhat similar. In general, clustering is an unsupervised learning task as very little or no prior knowledge is given except the input data sets. The tasks have been used in many fields and therefore various clustering algorithms have been developed. Clustering task is, however, computationally expensive as many of the algorithms require iterative or recursive procedures and most of real-life data is high dimensional. Therefore, the parallelization of clustering algorithms is inevitable, and various parallel clustering algorithms have been implemented and applied to many applications. In this paper, we review a partition clustering algorithms and their parallel versions as well. Although the parallel clustering algorithms have been used for many applications, the clustering tasks are applied as pre processing steps for parallelization of other algorithms too.

Keywords


Clustering, Supervised Learning, Unsupervised Learning, Parallel Clustering.

Full Text:

PDF

References


Jiawei Han, Micheline Kamber, “Data Mining Concepts and Techniques” Elsevier Publication.

Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988

K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACMComput. Surv., 31(3):264–323, 1999.

Rui Xu and Donald Wunsch II. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3):645–678, May 2005.

Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application.Brief Bioinform, pages bbn058+, February 2009.

Rui Xu and Donald Wunsch II. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3):645–678, May 2005.

C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006

B. Andreopoulos, A. An, and X. Wang. Hierarchical density-based clustering of categorical data and a simplification. In Proceedings of the 11thPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 11–22, Nanjing, China, 2007. Springer LNCS.

D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center Problem.

L. Kaufmann and P. Rousseeuw. Clustering by means of medoids. Elsevier Science, pages 405–416, 1997.

L. Kaufmann and P. Rousseeuw. Finding Groups in Data: An introduction to Cluster Analysis. John Wiley, 1990.

R. Ng and J. Hahn. Efficient and Effective Clustering Methods for Spatial Data Mining. 1994.

F. H¨opner, F. Klawonn, and R. Kruse. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis, and Image Recognittion. Wiley, New York, NY., 1999.

P.S. Bradley and U.M. Fayyad. Refining initial points for k-means clustering. Proceedings of the Fifteenth International Conference on Machine Learning, pages 91–99, 1998.

Aristidis Likas, Nikos Vlassis, and Jakob J. Verbeek. The global k-means clustering algorithm. Pattern Recognition, 36(2):451 – 461, 2003

G. Ball and D. Hall. A clustering technique for summarizing multivariate data. Behav. Sci., 12:153–155, 1997

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal. Mach.Intell., 24(7):881–892, 2002.

Khaled Alsabti, Sanjay Ranka, and Vineet Singh. An Efficient k-MeansClustering Algorithm. In Proceedings of IPPS/SPDP Workshop on High Performance Data Mining, 1998.

Ya-Ping Zhang, Ji-Zhao Sun, Yi Zhang, and Xu Zhang. Parallel implementation of CLARANS using PVM. Machine Learning and Cybernetics, 2004.Proceedings of 2004 International Conference on, 3:1646–1649 vol.3, Aug.2004.

Zhexue Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In In Research Issues on Data Mining and Knowledge Discovery, pages 1–8, 1997

Z. Huang and M. Ng. A fuzzy k-modes algorithm for clustering categorical data. In IEEE Trans Fuzzy Syst, pages 446–452, 1999.

Zhexue Huang. Clustering large data sets with mixed numeric and categorical values. In In The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 21–34, 1997.

H. Kim and H. Park. Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares for Microarray Data Analysis. Bioinformatics, 23(12):1495–1502, 2007.

Chin-Hsiung Wu, Shi-Jinn Horng, and Horng-Ren Tsai. Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. J. Parallel Distrib. Comput., 60(9):1137–1153, 2000.

Ashwani Garg, Ashish Mangla, Neelima Gupta, and Vasudha Bhatnagar. PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data. Database Engineering and Applications Symposium, International, 0:315– 316, 2006.

H.S. Nagesh, S. Goil, and A. Choudhary. A scalable parallel subspace clustering algorithm for massive data sets. pages 477–484, 2000.

S. Ranka and S. Sahni. Clustering on a Hypercube Multicomputer. IEEE Transactions on Parallel and Distributed Systems, 2(2):129–137, 1991.

F. F. Rivera, M. Ismail, and E. L. Zapata. Parallel squared error clustering on hypercube arrays. J. Parallel Distrib. Comput., 8(3):292–299, 1990.

Fionn Murtagh. Comments on ‟Parallel Algorithms for Hierarchical Clustering and Cluster Validity‟. IEEE Trans. Pattern Anal. Mach. Intell.,(10):1056–1057, 1992.

George Forman and Bin Zhang. Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl., 2(2):34–38, 2000.

S. Anitha Elavarasi and Dr. J. Akilandeswari. A Survey On Partition Clustering Algorithm . International Journal of Enterprise Computing and Business Systems, Vol. 1 Issue 1 January 2011.

Xiaowei Xu, Jochen Jager, and Hans-Peter Kriegel. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3:263–290(28), 1999.

Domenico Talia. Parallelism in Knowledge Discovery Techniques. In PARA ‟02: Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing, pages 127–138, London, UK, 2002. Springer-Verlag.

X. Li and Z. Fang. Parallel clustering algorithms. Parallel Comput.,11(3):275–290, 1989.

Dan Judd, Philip K. McKinley, and Anil K. Jain. Large-Scale Parallel Data Clustering. IEEE Trans. Pattern Anal. Mach. Intell., 20(8):871–876, 1998.

S. Kantabutra and A.L. Couch. Parallel K-Means Clustering Algorithm on NOWs. NECTEC Technical Journal, 1(1), 1999.

Inderjit S. Dhillon and Dharmendra S. Modha. A Data-Clustering Algorithm on Distributed Memory Multiprocessors. In Revised Papers from Large- Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 245–260, London, UK, 2000.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.