Feature Selection for Text Clustering and Classification

Kamlesh Dhayal; Sudesh Kumar; Shalini Batra

Feature Selection for Text Clustering and Classification

Kamlesh Dhayal, Sudesh Kumar, Shalini Batra

Abstract

The quality of the data is one of the most important factors influencing the performance of any classification or clustering algorithm. The attributes defining the feature space of a given data set can often be inadequate, which make it difficult to discover useful information or desired output. However, even when the original attributes are individually inadequate, it is often possible to combine such attributes in order to construct new ones with greater predictive power. Feature selection, as a preprocessing step to machine learning, has been very effective in reducing dimensionality, removing irrelevant data, and noise from data to improving result comprehensibility. This paper addresses the task of feature selection for clustering and classification. Here we give a comparative study of variety of classification methods, including Naïve Bayes, J48 etc.

Keywords

Classification, Clustering, Feature selection, Machine learning.

Full Text:

PDF

References

Mark A. Hall, "Correlation-Based Feature Selection for Machine Learning", Department of Computer Science, NewZealand, 1999.

Fabrizio Sebastiani, "Machine Learning in Automated Text Categorization", ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47,2002.

C. J. van Rijsbergen,” Information Retrieval”, 2nd edition, Butterworth,London, 1979.

M. Dash and H. Liu, “Feature Selection for Classification”, Intelligent Data Analysis, vol.1, no. 3, pp. 131-156, 1997.

Y. Yang, J. O. Pedersen, “A comparative study on feature selection in text categorization,” In: Proceedings of the 14th International Conference on Machine Learning, pp.534-547, 1997.

Bruce A. Draper, “Feature Selection from Huge Feature Sets”,Computer Science Department, Colorado State University, Fort Collins,CO 80523, USA.

Dash, M., & Liu, H. (2000). “Feature Selection for Clustering”, Proc. of PAKDD-00 (pp. 110-121).

Martin H. Law, Anil K. Jain, “Feature Selection in Mixture-Based Clustering”, Dept. of Computer Science and Eng., Michigan State University, East Lansing, MI 48824,U.S.A.

S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection”, International Conference on Machine Learning, 2001.

http://www.cs.waikato.ac.nz/ml/weka/

R. Kohavi and G. John, “Wrappers for feature subset selection”,Artificial Intelligence, Vol. (1-2), pp. 273-324, 1997.

P. Langley and H. A. Simon, “Applications of machine learning and rule induction Comunications of the ACM, pp. 55–64, 1995.

R. Kohavi, “Wrappers for Performance Enhancement and Oblivious Decision Graphs”, PhD thesis, Stanford University, 1995.

L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution”, In ICML, pp. 856–863, 2003.

Wilbur, J.W., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18, 45-55.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me