Open Access Open Access  Restricted Access Subscription or Fee Access

An Alternative Extension of the K-Means Algorithm for Clustering Medical Data

Dr. R. Nedunchezhian, V. Pattabiraman


Data clustering is a very powerful technique in many application areas. Not only may the clusters have meaning themselves, but clustering allows for efficient data management techniques in that data that is grouped in the same manner will usually be accessed together. Access to data within a cluster may predict that other data in that cluster will be accessed soon; this can lead to optimized storage strategies which perform much better than if the data were randomly stored.
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this is to show how to apply the notion of “cluster centers” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a k-means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with well-known medicine data sets


Clustering, K-Mean Clustering, Proximity

Full Text:



Anamika Gupta, Naveen Kumar, and Vasudha Bhatnagar, "Analysis of Medical Data using Data Mining and Formal Concept Analysis", Proceedings Of World Academy Of Science, Engineering And Technology,Vol. 6, June 2005,.

Anderberg M.R. (1973): Cluster Analysis for Applications. — New York: Academic Press.

Ball G.H. and Hall D.J. (1967): A clustering technique for summarizing multivariate data.—Behav. Sci., Vol. 12, No. 2, pp. 153–155.

Fisher D.H. (1987): Knowledge acquisition via incremental conceptual clustering. — Mach. Learn., Vol. 2, No. 2,pp. 139–172.

Frank Lemke and Johann-Adolf Mueller, "Medical data analysis using self-organizing data mining technologies," Systems Analysis Modelling Simulation, Vol. 43, No. 10, pp: 1399 - 1408, 2003.

Ganti V., Gehrke J. and Ramakrishnan R. (1999): CATUS - Clustering categorical data using summaries.—Proc. Int. Conf. Knowledge Discovery and Data Mining, San Diego, USA, pp. 73–83.

Gibson D., Kleinberg J. and Raghavan P. (1998) Clustering categoricaldata: An approach based on dynamic systems. Proc. 24-th Int. Conf. Very Large Databases, New York,pp. 311–323.

Gowda K.C. and Diday E. (1991): Symbolic clustering using a new dissimilarity measure. — Pattern Recogn., Vol. 24,No. 6, pp. 567–578.

Guha S., Rastogi R. and Shim K. (2000): ROCK: A robust clusteringalgorithm for categorical attributes. — Inf. Syst., . 25, No. 5, pp. 345–366.

Han J. and Kamber M. (2001): Data Mining: Concepts and Techniques. — San Francisco: Morgan Kaufmann Publishers.

Hian Chye Koh and Gerald Tan, "Data Mining Applications in Healthcare", Journal of healthcare information management, Vol. 19, No. 2, pp. 64-72, 2005.

Huang Z. (1997): Clustering large data sets with mixed numeric and categorical values, In: KDD: Techniques and Applications (H. Lu, H. Motoda and H. Luu, Eds.). — Singapore: World Scientific, pp. 21–34.

Huang Z. (1998): Extensions to the k-means algorithm for clusteringlarge data sets with categorical values. — Data Mining Knowl. Discov., Vol. 2, No. 2, pp. 283–304.

Huang Z. and Ng M.K. (1999): A fuzzy k-modes algorithm for clustering categorical data. — IEEE Trans. Fuzzy Syst.,Vol. 7, No. 4, pp. 446–452.

Jain A.K. and Dubes R.C. (1988): Algorithms for Clustering Data. Englewood Cliffs: Prentice Hall.

Ordonez, “Programming the K-Means Clustering Algorithm in SQL,” Proc. ACM Int’l Conf. Knowledge Discovery and Data Mining, pp. 823-828, 2004.

Ralambondrainy H. (1995): A conceptual version of the kmeans algorithm. — Pattern Recogn. Lett., Vol. 15, No. 11, pp. 1147–1157.

Ruspini E.R. (1969): A new approach to clustering. — Inf.Contr., Vol. 15, No. 1, pp. 22–32.

Saad, B. de la Iglesia, and G. D. Bell, “A Comparison of Two Document Clustering Approaches for Clustering Medical Documents”, Proceedings of the 2006 International Conference on Data Mining (DMIN-06), 2006

Selim S.Z. and Ismail M.A. (1984): k-Means--type algorithms: A generalized convergence theorem and characterization of local optimality. — IEEE Trans. Pattern Anal. Mach. Intell., Vol. PAMI-6, No. 1, pp. 81–87.

Tzung-I Tang, Gang Zheng, Yalou Huang, Guangfu Shu, Pengtao Wang, "A Comparative Study of Medical Data Classification Methods Based on Decision Tree and System Reconstruction Analysis", IEMS, Vol. 4, No. 1, pp. 102-108, June 2005.

Zakaria Nouir, Berna Sayrac, Benoît Fourestié, Walid Tabbara, and Françoise Brouaye, "Generalization Capabilities Enhancement of a Learning System by Fuzzy Space Clustering," Journal of Communications, Vol. 2, No. 6, pp. 30-37, November 2007


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.