Open Access Open Access  Restricted Access Subscription or Fee Access

Algorithms for Clustering of Documents

Neeti Arora, Mahesh Motwani


There is great need to organize a large set of documents into categories. The Document Clustering techniques are widely recognized as useful tools for information retrieval, organizing web document and also allow users to search in appropriate direction. A large variety of techniques have been developed by researchers for clustering. The purpose of this paper is to present a novel survey of the various clustering techniques. These techniques can also be used to group web and other documents into meaningful clusters. Categorization of different clustering algorithms is also proposed in this paper.


Clustering, Document Clustering, K-Means, K-Medoids, Web Document.

Full Text:



Michael Steinbach, G. Karypis and Vipin Kumar,” A comparison of document clustering techniques.” KDD Workshop on Text Mining, 2000.

S.Guha, R. Rastogi and K.Shim,”CURE. An efficient clustering algorithm for large databases”,International conference on management of Data. Seatle WA, 1998.

S.Guha, R. Rastogi and K.Shim,”ROCK: A robust clustering algorithm for categorical attributes”, 15th International conference on Data Engineering. Australia, March 1999.

G.Karypis, E.H. Han and V. Kumar,” Chameleon. A hierarchical clustering algorithm using dynamic modelling”,IEEE Computer, 32(8), pp 68-75, 1999.

Zhang T., Ramakrihnan, R. and Livny M.,” BIRCH: an efficient data clustering method for very large databases”,. In Proceedings of the ACM SIGMOD Conference, Montreal, Canada, 1997.

Ying Zhao and G. Karypis,” Criteria functions for document clustering. Experiments and Analysis”, UMN CS 01-040, 2001.

Kaufman, L. and Rousseeuw, P.,”Finding Groups in Data: An Introduction to Cluster Analysis”,John Wiley and Sons, New York.

NG, R. and HAN, J,” Efficient and effective clustering methods for spatial data”, 20th International Conference on Very large databases, 1994.

Abdul Nazeer and M. P. Sebastian,” Improving the Accuracy and Efficiency of the k-means Clustering”, Proceedings of the World Congress on Engineering 2009 Vol I, July 2009, London, U.K.

A.Likas,N.Vlassis,J.Verbeek,”The global k-means clustering algorithm”, Pattern Recognition, 362(2):451-461,2003.

Juanying Xie, Shuai Jiang, Weixin Xie, Xinbo Gao,” An Efficient Global K-means Clustering Algorithm”, Journal of Computers, Vol 6, No 2, 271-279, Feb 2011

Hongwei Yang,” A Document Clustering Algorithm for Web Search Engine Retrieval System”,IEEE and IC4E, 2010.

Moreno Carullo, Elisabetta Binaghi and Ignazio Gallo,” Clustering of Short Commercial Documents for the Web”, IEEE 2008

R.Ng and J.Han,” Efficient and effective clustering method for spatial data mining”, 20th VLDB Conference, Santiago, Chile, 1994.

Velmurugan, T. and T. Santhanam,” Computational complexity between K-means and K-medoids clustering algorithms for normal and uniform distributions of data points”, Journal of Computer Science. 6: 363-368, 2010.

Juhyun Han, Taehwan Kim, Joongmin Choi,” Web Document Clustering by Using Automatic Key phrase Extraction”,In Proceedings of International Conference on Web Intelligence and Intelligent Agent Technology IEEE 2007.

Ester, M., Kriegel, H-P., Sander, J. and Xu, X,” A density-based algorithm for discovering clusters in large spatial databases with noise”, 2nd ACM SIGKDD, Portland, Oregon, 1996.

Ankerst, M., Breunig, M., Kriegel, H.-P., and Sander J.,” OPTICS: Ordering points to identify clustering structure”, ACM SIGMOD Conference, 49-60, Philadelphia, PA, 1999.

DBCLASD --XU, X., ESTER, M., KRIEGEL, H.-P., and SANDER, J.,"A distribution-based clustering algorithm for mining in large spatial databases”, 14th ICDE, 324-331, Orlando, FL, 1998.

A. Hinneburg and D. A. Keim,”A general approach to clustering in large databases with noise”, Knowledge and Information Systems (KAIS), 5(4):387 - 415, 2003.

Sander, J., Ester, M., Kriegel, H.-P., and Xu, X.," Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications”, In Data Mining and Knowledge Discovery, 2, 169-194, 1998.

Daniel Boley, Maria Gini, Robert Gross,” Partitioning Based Clustering for Web Document Categorization”, In journal of Decision support system, 27(3), 1999.

Yongxin Liu, Zhijng Liu,”An Improved Hierarchical K-Means Algorithm for Web Document Clustering”,,IEEE transactions 2008.

Schikuta, E., Erhart, M.,” The BANG-clustering system: grid-based data analysis. In Proceeding of Advances in Intelligent Data Analysis, Reasoning about Data”,. 2nd International Symposium, 513-524, London, UK, 1997.

Wang, W., Yang, J., and Muntz, R.,” STING: a statistical information gridapproach to spatialdata mining.”, In Proceedings of the 23rd Conference on VLDB, Athens, Greece, 1997.

Sheikholeslami, G., Chaterjee, S., and Zhang, A. ,”WaveCluster: A multiresolution clustering approach for very large spatial databases”,. 24th Conference on VLDB, 428-439, New York, 1998.

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.,” Automatic subspace clustering of high dimensional data for data mining applications”, ACM SIGMOD Conference, 94-105, Seattle, WA, 1998.

Goil, S., Nagesh, H., and Choudhay, A.,”MAFIA: Efficient and scalable subspace clustering for very large data sets”,Technical Report CPDC-TR-9906-010, Northwestern University, 1999.

G. A. Carpenter, S. Grossberg, “The ART of Adaptive Pattern Recognition by a self organizing Neural Networks”, IEEE Computer, vol 21(3) pp. 77- 88, 1988.

Carpenter G.A., Grossberg S, Rosen D.," Fuzzy ART: Fast Stable Learning of Analog Patterns by an Adaptive Resonance System”, Neural Networks 4, 759-771, 1991.

Ravikumar Kondadadi, Robert Kozma.,” A Modified Fuzzy ART for Soft Document Clustering”,. IEEE transactions 2002.

Juha Vesanto and Esa Alhoniemi.,” Clustering of Self Organizing Map”,. IEEE Transaction on Neural Networks, Vol 11. Page 586-600, 2000

J. vesanto and E. Alhoniemi,” Clustering of the self Organising Maps”, IEEE transaction on Neural Networks. Volume 11, 2000.

Dehu Qi and Chung-Chih Li,” Self organizing map based web page clustering using web logs”, In Proceedings of Software and Data Engineering, pp. 265-270, 2007.

Juan J Merelo and Fernando,” Clustering web based communities using self organizing maps”, In proceedings of IADIS conference on web based communities. Lisbon Portugal, 2004.

Cheng Tsai,Chen and Chun Tsai. MSGKA,”An Efficient Clustering Algorithm for Large databases”, IEEE international Conference on systems,man and cybernatics,Vol5, 2002.

Juha Vesanto,Eser Alhoniem.,”Clustering of the Self Organising Map”, IEEE transaction on neural network. Vol 11, 2000.

K.Krishna,M.Murthy,” Genetic k-mean algorithm”, IEEE transaction on man,systems & cybernatics Vol 29,1998.

Cheng-Fa Tsai, Chun-Wei Tsai, Han-Chang Wu and Tzer Yang . “ACODF: a novel data clustering approach for data mining in large databases”, Journal of Systems and Software, Volume 73 Issue 1, 2004

Yan Yang &M.Kamel.,”An Aggregated Clustering Approach Using Multiant Colonies Algorithm”, Pattern recognition Vol 39,July 2006.

Cui,Potok, Pepalathingal,” Document Clustering using Particle Swarm Optimization.”,IEEE Swarm intelligence symposium, USA, pp. 185-191, 2005.

RanaForsati, MohammadRezaMeybodi, Mehrad Mahdavi, AzadehGhari Neiat.,” Hybridization of K-means and Harmony Search Methods for Web Page Clustering.”, IEEE International conference on web intelligence and intelligent agent technology, pp. 329-335, 2008.

Moreno Carullo, Elisabetta Binaghi and Ignazio Gallo,” Clustering of Short Commercial Documents for the Web”, IEEE 2008

O.Samir and O. Etzioni,” Web Document Clustering: Afeasibiltydemonstration”, 1998.

Daniel Crabtree, Xiaoying Gao, Peter Andreae” Improving Web Clustering by Cluster Selection”, IEEE International conference on web intelligence,, 2005.

E.H. Han, G. Karypis, V.Kumar and B Mobasher,” Clustering based on association rule hypergraphs”, In Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson,Arizona 1997.

G.Karypis, R.Aggarwal, V.Kumar and S.Shekhar,” Multilevel hypergraph partitioning:Application in VLSI domain”, In proceedings ACM/IEEE Design Automation Conference 1997.

Georgios Paliouras Christos Papatheodorou, VangelisKarkaletsis, Constantine D. Spyropoulos., ”Clustering the Users of Large Web Sites into Communities.”, In Proceedings of the 17th International Conference on Machine Learning (ICML), Stanford University,USA 2000.

Perkowitz, Etzioni., “PageGatherSystem”, 1998

Tajunisha and Saravanan,” An efficient method to improve the clustering performance for high dimensional data by Principal Component Analysis and modified K-means”, International Journal of Database Management Systems, Vol.3, No.1, February 2011

Panagis Magdalinos, Christos Doulkeridis, Michalis Vazirgiannis, "Enhancing clustering quality through landmark-based dimensionality reduction”, ACM Transactions on Knowledge Discovery from Data (TKDD) Volume 5 Issue 2, February 2011

Rahmat Widia Sembiring, Jasni Mohamad Zain, Abdullah Embong,” Dimension Reduction of Health Data Clustering”, International Journal

on New Computer Architectures and Their Applications, The Society of Digital Information and Wireless Communications, 1(4): 1041-1050, 2011.

Jiyang Chen, Osmar R. Zaiane, Randy Goebel,”An Unsupervised Approach to Cluter Web Search Results based on Word Sense Communities”, IEEE 2008.

Yunjuan Xie and Vir V, Phoha.,”Web User Clustering from Access Log Using Belief Function”, ACM transactions 2001

Shafer, G.,” A Mathematical Theory of Evidence. Princeton University Press”, 1976.

Lee, Ingyu, On, B.Won,” An effective web document clustering algorithm based on bisection and merge. Artificial Intelligence Review Springer”, Vol 36, No. 1, June 2011.

S.C. Deerwester,SJ Damais,JK Landaner,GW Furrnas,RA Harshman,” Indexing by latent Semantic Analysis.”,Journal of American Society of Information Sciences, 1990.

”.Document Clustering based on non negative matrix factorization”, 26th International ACM SIGIR conference on R&D in information retrieval 2003.

Estivill castro, V. and Lee, I.,”AMOEBA: Hierarchical Clustering Based on Spatial Proximity Using Delaunay Diagram”, 9th International Symposium on Spatial Data Handling. Beijing, China 2000.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.