Leukemia Classification using Cloud based Map-Reduce with K-Nearest Neighbor Classifier Framework

A. Kernytskyy; Krzysztof J Cios

Leukemia Classification using Cloud based Map-Reduce with K-Nearest Neighbor Classifier Framework

A. Kernytskyy, Krzysztof J Cios

Abstract

Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Cancer diagnosis is one of the most emerging clinical applications of microarray data. Frequent changes in the behavior of disease generates an enormous volume of data. Microarray data satisfies both the accuracy and velocity of big data in recent development, as it keeps changing with time. Therefore, the analysis of microarray datasets needs a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The exact identification of genes are responsible for causing cancer are essential in microarray data analysis. Most existing schemes are two-phase process such as feature selection or extraction and classification. The various statistical methods based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce based on a K-Nearest Neighbor (MRKNN) classifier is also employed to classify microarray data and the algorithms are successfully implemented in a Hadoop framework.

Keywords

Microarray Gene Expression, Leukemia Classification, Feature Selection, MapReduce based on a K-Nearest Neighbor.

Full Text:

PDF

References

A. Kohlmann, T.J. Kipps, L.Z. Rassenti, J.R. Downing, S.A. Shurtleff, K.I. Mills, A.F. Gilkes, W.-K. Hofmann, G. Basso, M.C. DellOrto, An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: the microarray innovations in leukemia study prephase, Br. J. Haematol. 142 (5) (2008) 802–807.

A. Kühnl, N. Gökbuget, A. Stroux, T. Burmeister, M. Neumann, S. Heesch, T. Haferlach, D. Hoelzer, W.-K. Hofmann, E. Thiel, High BAALC expression predicts chemoresistance in adult B-precursor acute lymphoblastic leukemia, Blood 115 (18) (2010) 3737–3744.

Abbas, A.R., Baldwin, D., Ma, Y., Ouyang, W., Gurney, A., Martin, F., Fong, S., van Lookeren Campagne,

Abe, S., 2010. Feature selection and extraction. In Support Vector Machines for Pattern Classification (pp. 331-341). Springer, London.

Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J.M. and Herrera, F., 2014. A review of microarray datasets and applied feature selection methods. Information Sciences, 282, pp.111-135.

D. Borthakur, The hadoop distributed file system: architecture and design, Hadoop Proj. Website 11 (2007) (2007) 21. [25] A.C. Murthy, V.K. Vavilapalli, D. Eadline, J. Niemiec, J. Markham, Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2, Pearson Education, 2013.

Díaz-Uriarte, R. and De Andres, S.A., 2006. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(1), pp1-13.

Ding, C. and Peng, H., 2005. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02), pp.185-205.

I. Triguero, S. del Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera, ROSEFW-RF: the winner algorithm for the ECBDL14 big data competition: an extremely imbalanced big data bioinformatics problem, Knowl.-Based Syst. 87 (2015) 69–79.

Ioannidis, J.P., Allison, D.B., Ball, C.A., Coulibaly, I., Cui, X., Culhane, A.C., Falchi, M., Furlanello, C., Game, L., Jurman, G. and Mangion, J., 2009. Repeatability of published microarray gene expression analyses. Nature genetics, 41(2), p.149.

J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113. [27] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Inform. Process. Manage. 45 (4) (2009) 427–437.

Jirapech-Umpai, T. and Aitken, S., 2005. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, 6(1), pp.1-11.

Johnson, W.E., Li, C. and Rabinovic, A., 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1), pp.118-127.

K.I. Mills, A. Kohlmann, P.M. Williams, L. Wieczorek, W.-M. Liu, R. Li, W. Wei, D. T. Bowen, H. Loeffler, J.M. Hernandez, Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome, Blood 114 (5) (2009) 1063–1072.

Lee, Y.J. and Chao, C.H., 2003. A Data Mining Application to Leukemia Micro array Gene Expression Data Analysis. In International Conference on Informatics, Cybernetics and Systems (ICICS), Kaohsiung, Taiwan, pp.1-7.

Liu, H. and Motoda, H., 2012. Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media.

Liu, H. and Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 17(4), pp.491-502.

M. Kumar, S. Kumar Rath, Classification of microarray data using kernel fuzzy inference system, Int. Scholarly Res. Notices 2014 (2014), http://dx.doi.org/ 10.1155/2014/769159 (18 pages) 769159.

M. Kumar, S.K. Rath, Microarray data classification using fuzzy K-nearest neighbor, in: International Conference on Contemporary Computing and Informatics (IC3I), IEEE, 2014, pp. 1032–1038.

Peng, H., Long, F. and Ding, C., 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8), pp.1226-1238.

S. Li, T. Li, Z. Zhang, H. Chen, J. Zhang, Parallel computing of approximations in dominance-based rough sets approach, Knowl.-Based Syst. 87 (2015) 102– 111.

Saeys, Y., Inza, I. and Larrañaga, P., 2007. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19), pp.2507-2517.

T. Haferlach, A. Kohlmann, L. Wieczorek, G. Basso, G. Te Kronnie, M.-C. Béné, J. De Vos, J.M. Hernández, W.-K. Hofmann, K.I. Mills, Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group, J. Clin. Oncol. 28 (15) (2010) 2529–2537.

W. Ayadi, M. Elloumi, J.K. Hao, BiMine+: an efficient algorithm for discovering relevant biclusters of DNA microarray data, Knowl.-Based Syst. 35 (2012) 224– 234.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me