Analysis of Performance of Classifier Algorithms for Different Text Representations

Shweta Dharmadhikari,; Maya Ingale; Parag Kulkarni

Analysis of Performance of Classifier Algorithms for Different Text Representations

Shweta Dharmadhikari,, Maya Ingale, Parag Kulkarni

Abstract

Text representation has a strong impact on the performance of text classification system. Text representation with high and redundant number of features, noisy and irrelevant features often increases training and classification time of text classification system. It also reduces accuracy of system. An appropriate text representation with properly extracted or selected features may lead to high accuracy. Our paper provides brief overview of popular text representation techniques along with the analysis of performance of three major text classifiers against the three popular text representations of vector space model, graph based model and NMF based model in the multi label setting. We are also proposing mltcNMF, feature extraction algorithm based on non negative matrix factorization approach in the high dimensional data space. We conducted set of experiments to make comprehensive evaluation of the effects of these text representation approaches using multi label datasets and also measured classification performance of our new algorithm. Our empirical study shows that use of appropriate feature selection strategy in text representation can significantly improves performance of text classification system.

Keywords

Text Classification, Vector Space Model, NMF, Multi Label Text Classification

Full Text:

PDF

References

Saad, Iglesia and Bell, “Effect of document representation on the performance of Medical document classification”, In Proceedings of the conference on Data Mining, DMIN 2006.

N. Liu, Zhang, J. Yan, “ Text Representation: from Vector to Tensor” , proceedings of the Fifth IEEE International Conference on Data Mining ( ICDM’ 05) 2005, pp. 1550-4786.

M. Radovanovic, M. Ivanovic, “Document representations for classification of short web-page descriptions”, Yugoslav Journal of Operations Research , 18(2008), No. 1, 123-138.

G.Salton, Buckley, “Term weighting approaches in automatic text retrieval”, Information processing and Management. 24(5),pp. 513-523,1988.

G.Salton, “Automation text processing: The transformation, Analysis and retrieval of information by computer”. Addison-Wesley, 1989.

Y. Yang, “An evaluation of statistical approaches to Text Categorization”. Journal of Information Retrieval. Vol. 1 #1/2. Kluwer, pp 68-90, 1999.

D. D. Lewis, “Feature Selection and Feature Extraction for Text Categorization”. In Proceedings of the Speech and Natural Language Workshop, pp. 212-217, 1992.

K. Nigam , A. McCallum, “ Learning to classify text from labeled and unlabeled documents”. AAAI-98, pp. 792-799.

G. Forman. “An extensive empirical study of feature selection metrics for text classification”, Journal of machine learning research, 3:1289-1305,2003.

F. Sebastiani. Machine learning in automated text categorization. ACM computing Surveys, 34(1):1-47,2002.

F. Peng, Schuurmaans. Combining naïve Bayes and n-gram language models for text classification. In Proc. European Conference on Information Retrieval Research ( ECIR-03), pages 335-350, 2003.

M.Berry,M. Browne “Email Survellience using nonnegative matrix factorization”..Computational & Mathematical organization theory, vol.11,249-264, 2005.

P.Hoyer. “Non negative matrix factorization with sparseness constrints”, Journal of Machine learning Research, vol. 5, 1457-1469,2004.

A.Montano, Carazo “NonSmooth nonnegative matrix factorization”, IEEE Transactions on Pattern Analysis and Machine intelligence, vol. 28, 403-415,2006.

http://mulan.sourceforge.net/datasets.html

http://MEKA.sourceforge.net

www.cs.waikato.ac.nz/ml/weka/

R. Angelova , G. Weikum . “Graph based text classification : Learn from your neighbours”. In SIGIR’06, ACM, 1-59593-369-7/06/0008”.

T. Jebara, Wang and chang : Graph construction and b-matching for semi supervised learning. In proceedings of ICML – 2009.

S. Godbole and S. Sarawagi , “Discriminative methods for multi-labeled classification”, 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2004.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me