A New Content Based Text Clustering Using Spherical Gaussian EM Algorithm

S.C. Punitha; R. Jayasree

A New Content Based Text Clustering Using Spherical Gaussian EM Algorithm

S.C. Punitha, R. Jayasree

Abstract

In this paper extracting the relations between verbs and their arguments in the same sentence has the potential for analyzing terms within a sentence. A novel concept-based mining model is proposed. This paper model captures the semantic structure of each term within a sentence and document rather than the frequency of the term within a document only. There are four models are present in this concept based mining model, they are sentence-based concept analysis, Document-based concept analysis, Corpus-based and then concept based similarity measures are used. Here clustering is used to attained the better results of the text mining. Spherical Gaussian EM algorithm clustering techniques are used. Large sets of experiments using the proposed concept-based mining model on different data sets in text clustering are conducted. Effectiveness of concept matching in determining an accurate measure of the similarity between documents and extensive sets of experiments using the concept-based term analysis and similarity measure are conducted. Experimental results are taken using MATLAB. Three types of datasets are used in this paper they are Reuters, TDT and 20 News Group. Performance evaluation are used for the text clustering are F-Measure and Execution Time.

Keywords

Data Preprocessing, Web Usage Mining, Path Completion Algorithm, Data Cleaning, User Session Identification, Modified Expectation Maximation.

Full Text:

PDF

References

Frakes B and Baeza-Yates R (1992) ―Information Retrieval: Data Structures and Algorithms‖ Prentice Hall, 1992.

Aas K and Eikvil L (1999) ―Text Categorization: A Survey,‖ Technical Report 941, Norwegian Computing Center, June 1999.

Salton G, Wong A and Yang C. S (1975) ―A vector space model for automatic indexing‖, Communications of the ACM, 18(11):613–620, 1975. (see also TR74-218, Cornell University, NY, USA)

Feldman R and Dagan I (1995) ―Kdt - knowledge discovery in texts‖, In Proc. of the First Int. Conf. on Knowledge Discovery (KDD), pages 112–117, 1995.

Shady Shehata, Fakhri Karray and Mohamed S. Kamel (2010) ―An Efficient Concept-Based Mining Model for Enhancing Text Clustering‖, IEEE Transactions on Knowledge and Data Engineering, Vol. 22, No.10, pp. 1360 – 1371, October 2010.

Honkela T, Kaski S, Lagus K and Kohonen T (1997) ―WEBSOM—Self Organizing Maps of Document Collections,‖ Proc. Workshop Self Organizing Maps (WSOM ’97), 1997.

Navathe, Shamkant B., and Elmasri Ramez, (2000), ―Data Warehousing and Data Mining‖, in ―Fundamentals of Database Systems‖, Pearson Education pvt Inc, Singapore, 841-872.

Zhu X, Ghahramani Z and Lafferty J (2005) Time-sensitive dirichlet process mixture models. Technical report, Carnegie Mellon University, 2005.

Ahmed and E. P. Xing (2008) ―Dynamic non-parametric mixture models and the recurrent Chinese restaurant process: with applications to evolutionary clustering‖, In SDM, pages 219–230, 2008

Sergio Bolasco , Alessio Canzonetti , Francesca Della Ratta-Rinald and Bhupesh K. Singh, (2002), ―Understanding Text Mining:a Pragmatic Approach‖, Roam, Italy

Ronald Neil kostoff (2003) ―text mining for global technology watch‖, article, office of naval research, quincy st. Arlington, 1-27.

Lafferty J. D, McCallum A and Pereira F. C. N (2001) ―Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001

Sha F and Pereira F (2003) ―Shallow parsing with conditional random fields‖, In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 134–141, 2003

McCallum and W. Li (2003) ―Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons‖, In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 188–191. Association for Computational Linguistics, 2003.

Lafferty J. D, McCallum A and Pereira F. C. N (2001) ―Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001

Kingsbury P and Palmer M (2003) ―Propbank: The Next Level of Treebank,‖ Proc. Workshop Treebanks and Lexical Theories, 2003.

Rasmussen E (1992) ―Clustering algorithms‖, In W. Frakes and R. BaezaYates, editors, Information retrieval: data structures and algorithms. Prentice Hall, 1992.

Dasgupta S, and Schulman L. J (2000) ―A two round variant of em for gaussian mixtures‖, Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), 2000.

Michiko Watanabe and Kazunori Yamaguchi (2004) ―The EM Algorithm and Related Statistical Models‖ in 2004.

G.J. McLachlan and T. Krishnan (1997) ―The EM Algorithm and Extensions‖, Wiley, 1997.

Hwee-Leng Ong, Ah-Hwee Tan, Jamie Ng, Hong Pan and Qiu-Xiang Li.(2001),―FOCI : Flexible Organizer for Competitive Intelligence‖, In Proceedings, Tenth International Conference on Information and Knowledge Management (CIKM'01), pp. 523-525, Atlanta, USA, 5- 10.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me