Optimizing Classification of High Dimensional Data by Hybrid Approach of Feature Selection with Wrapper Evaluators

Sanjay Garg; Mahesh Panchal

Optimizing Classification of High Dimensional Data by Hybrid Approach of Feature Selection with Wrapper Evaluators

Sanjay Garg, Mahesh Panchal

Abstract

High dimensional data contains large number of features (predictor attributes) compared to number of samples. As many of these features are irrelevant with class label, if any classification algorithm is directly applied on this dataset then model come out will be less accurate and will take much time for building, testing and applying on unseen data. Feature selection methods will select only those features which are relevant to class label. During feature selection procedure, set of features are generated and evaluated for its relevance with class. There are several methods proposed in literature for generation and evaluation of features. Each method has its own characteristic. In this paper experiment is carried out on three types of cancer gene expression datasets with different feature selection methods. Features are generated by ranker, heuristic and random search methods while they are evaluated by information gain, attreval and wrapper methods. A hybrid approach which combines ranker and subset based feature generation is also proposed. It shows that hybrid approach with wrapper evaluator gives best classification accuracy.

Keywords

Data Mining, Classification, Feature Selection, Wrapper Evaluators

Full Text:

PDF

References

Jinn-Yi Yeh, Tai-Shi Wu, Min-Che Wu and Der-Ming Chang, “Applying Data Mining Techniques for Cancer Classification from Gene Expression Data.”, IEEE International Conference on Convergence Information Technology, 0-7695-3038-9, 2007.

R. Durbin, S. Eddy, A. Krogh and G. Mitchison, “Biological Sequence Analysis: Probability Models of Proteins and Nucleic Acids”, Cambridge University Press, New York.

Lu Y. and J. Han, “Cancer classification using gene expression data.”, Information Systems, 28: 243-268, 2003.

Gregory Piatetsky-Shapiro and Pablo Tamayo., “Microarray Data Mining: Facing the Challenges.”, SIGKDD Explorations, 5(2).

Lipo Wang, Feng Chu and Wei Xie, “Accurate Cancer Classification Using Expressions of Very Few Genes”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(1), January- March 2007.

Xiaosheng Wang and Osamu Gotoh, “A Robust Gene Selection Method for Microarray based Cancer Classification”, Cancer Informatics, 9: 15–30, 2010.

Changjing Shang and Qiang Shen, “Aiding Classification of Gene Expression Data With Feature Selection: A Comparative Study”, International Journal of Computational Intelligence Research, ISSN 0973-1873, 1(1): 68–76, 2005.

Shital Shah and Andrew Kusiak, “Cancer Gene Search with Data Mining and Genetic Algorithms”, Computers in Biology and Medicine (37): 251 – 261, 2007.

Kent Ridge Bio-Medical Dataset. http://datam.i2r.a-tar.edu.sg/datasets/krbd/DLBCL/DLBCL-NIH.html.

Andreas Rosenwald et al., “The Use of Molecular Profiling to Predict Survival After Chemotherapy for Diffuse Large B-Cell Lymphoma”, The New England Journal of Medicine, 346(25), June 2002.

Scott L. Pomeroy, et al., “Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression”, Nature, 415:436-442, January, 2002.

U. Alon, et al., “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays”, PNAS, 96:6745-6750, 1999.

M. Kuramochi and G. Karypis, “Gene Classification Using Expression profiles: A Feasibility Study”, International Journal on Artificial Intelligence Tools, 14(4):641-660, 2005.

Jiexun Li, Hua Su, Hsinchun Chen, Fellow,IEEE and Bernard W. Futscher, “Optimal Search Based Gene Subset Selection for Gene Array Cancer Classification”, IEEE Transactions on Information Technology in Biomedicine, 11(4), July-2007.

T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.

T. Li, C. Zhang and M. Ogihara, “A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression”, Bioinformatics, (20):2429-2437, 2004.

Wang X and Gotoh O, “Microarray Based Cancer Prediction Using Soft Computing Approach”, Cancer Informatics, (7), 2009.

Wang Y, Makedon FS, Ford JC and Pearlman J. HykGene, “A Hybrid Approach for Selecting Marker Genes for Phenotype Classification Using Microarray Gene Expression Data”, Bioinformatics, 21 (8):1530–7, 2005.

V. Vapnik., Statistical Learning Theory, Wiley, 1998.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me