Open Access Open Access  Restricted Access Subscription or Fee Access

Enhancing The Performance of Hybrid Clustering of Documents using Artificial Neural network based Approach

M. Deepa, P. Tamijeselvy

Abstract


Clustering and classification have been useful and active areas of machine learning research that promise to help us cope with the problem of information overload on the Internet. BIRCH is a clustering algorithm designed  to  operate  under  the  assumption  "the  amount  of memory  available  is  limited,  whereas  the  dataset  can  be arbitrary large". The algorithm generates "a compact dataset summary" minimizing the I/O cost involved .An application of k-means requires an initial partition to be supplied as an input. To generate a "good" initial partition of the "summaries" a clustering algorithm, PDDP can be used. Also we compare the performance of traditional K-Means algorithm with a new artificial neural network based clustering method. Experimental results show that the new method works more accurately than K-Means.


Keywords


BIRCH, PDDP, K-Means, ANN based Clustering, Rand Index

Full Text:

PDF

References


P. Berkhin. A survey of clustering data mining techniques.In J. Kogan, C. Nicholas, and M.Teboulle, editors,Grouping Multidimensional Data: Recent Advancesin Clustering, pages 5.72.Springer.Verlag,Berlin, 2006.

M. Berry and M. Browne. UnderstandingSearch Engines.SIAM, 1999.

D. L. Boley. Principal direction divisive partitioning.Data Mining and Knowledge Discovery, 2(4):325.344, 1998.

Paul S. Bradley, Usama M. Fayyad, and Cory Reina.Scaling clustering algorithms to large databases. InKnowledge Discovery and Data Mining, pages 9.15, Menlo Park, CA, 1998.

E. Chisholm and T. Kolda. New term weighting formulas for the vector space method in information retrieval,1999. Report ORNL/TM-13756, Computer Science and Mathematics Division, Oak Ridge National Laboratory.

S. Dhillon, J. Kogan, and C. Nicholas. Feature selection and document clustering. In M.W. Berry, editor, Survey of Text Mining, pages 73.100. Springer-Verlag,2003

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi_cation. JohnWiley & Sons, second edition, 2000.

G. Hardy, Littlewood J.E., and G. Polya. Inequalities. Cambridge University Press, Cambridge, 1934.

J. Kogan. Introduction to Clustering Large and High. Dimensional Data. Cambridge University Press, NewYork, 2007.

J. Kogan. Scalable clustering with smoka. In Proceedings of International Conference on computing:Theory and Applications. IEEE Computer SocietyPress, to appear.

J. Kogan, C. Nicholas, and V. Volkovich. Text mining with hybrid clustering schemes. In M.W.Berry and W.M. Pottenger, editors, Proceedings of theWorkshop on Text Mining (held in conjunction with the Third SIAM International Conference on Data Mining),pages 5.16, 2003.

J. Kogan, M. Teboulle, and C. Nicholas. The entropic geometric means algorithm: an approach for building small clusters for large text datasets. In D. Boley et al, editor, Proceedings of the Workshop on Clustering Large Data Sets pages 63.71, 2003.

D. Littau and D. Boley. Clustering very large data sets with PDDP. In J. Kogan, C. Nicholas,and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering, pages 99.126.Springer.Verlag, 2006.

K. Rose, E. Gurewitz, and C.G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(9):589.594, 1990.

M. Teboulle and J. Kogan. Deterministic annealing and a 3 -means type smoothing optimization algorithm for data clustering. In I. Dhillon, J. Ghosh, and J. Kogan,editors, Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Fifth SIAM International Conference on Data Mining), pages 13.22, Philadelphia, PA, 2005. SIAM.

G. Zhang, B. Kleyner and M. Hsu. A local search approach to k-clustering. Tech Report HPL-1999-119, 1999.

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:A new data clustering algorithm and its applications. Journal of Data Mining and Knowledge Discovery, 1(2):141.182, 1997.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.