Open Access Open Access  Restricted Access Subscription or Fee Access

Cancer Prognosis Prediction Model using Data Mining Techniques

J. S. Saleema, P. Deepa Shenoy, K. R. Venugopal, L. M. Patnaik


Cancer prognosis prediction improves the quality of treatment and increases the survivability of the patients. Disease prognosis is identified at the treatment stage and at the recurrence stage. Conventional cancer prediction method deals only with the survival or mortality of the patients, but not with other labels such as severity of the disease through metastasis or multi-primary, stage, grade, etc. The SEER Public Use cancer database has more prominent variables that support better prediction approach. The objective of this paper is twofold. One is to build a prediction model to find the prominent variables by using the standard classifiers and the second is to improve the prediction accuracy through various sampling techniques. The proposed prediction model consist of three phases namely, basic level pre-processing, problem specific processing and modeling classifiers. Problem specific processing phase deals with feature extraction, sampling and response variable selection. The well known classification algorithms (Decision Tree, Naïve Bayes and KNN) have been used to model the classifiers for prediction analysis. Apart from the available incident data from SEER (Breast, Colorectal and Respiratory Cancer data) a new mixed combination of the three in equal proportion have been generated for the experimentation. Feature selection through correlation and information gain reduced the attributes to 37 from the raw size of 118. Patient survival, age at diagnosis, stage and multiple primaries in the given order has been identified as the prominent response variable, where as grade performed very low in the experimentation. The performances of various sampling techniques have been studied with the data set size ranging from 500 to 30000 samples for the four prominent labels identified in the previous step. The result shows that the balanced stratified sampling technique always maintains consistency in the performance. Also classifier model with decision tree algorithm optimizes the performance compared to the other algorithms. All the results of the models are tabulated in this paper.


Classifier, Pre-processing, Prognosis Prediction, SEER.

Full Text:



SEER Publication, Cancer Facts, Surveillance Research Program, Cancer Statistics Branch, limited-use data (1973-2007). Available:

S. Kassem Fathy, “A prediction survival model for colorectal cancer,” in proc.of American conference on applied mathematics, pp 36-42, 2011.

J. Chen, J. N. K. Rao And R. R. Sitter, “Efficient random imputation for missing data in complex surveys,” Statistica Sinica, vol. 10, pp 1153-1169, 2000.

A. Bellaachia, E. Guven, “Predicting breast cancer survivability using data mining techniques,” Age: Omaha, vol. 58, pp 110–113, 2000.

A. Gupta, N. Kumar, and V. Bhatnagar, “Analysis of medical data using data mining and formal concept analysis,” World Academy of Science, Engineering and Technology, vol. 11, 2005.

D. Delen, G. Walker and A. Kadam, “Predicting breast cancer survivability: a comparison of three data mining methods,” Artificial Intelligence in Medicine, vol. 34, no.2 , pp 113-127, 2005.

E. Arihito, S. Takeo and T. Hiroshi, “Comparison of seven algorithms to predict breast cancer survival,” Biomedical Soft Computing and Human Sciences, Vol.13, No.2, pp.11-16, 2008.

P. Nagar and S. Srivastava, “Adaptive fuzzy regression model for the prediction of dichotomous response variables using cancer data: a case study,” Journal of Applied Mathematics, Statistics and Infomatics(JAMSI), vol. 4, 2008.

A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary, “A lung cancer outcome calculator using ensemble data mining on SEER data,” in Proceedings of the Tenth International Workshop on Data Mining in Bioinformatics (BIOKDD), 2011, pp. 1–9.

F.E Ahmed, “Artificial Neural Network for Diagnosis and Survival Prediction in Colon Cancer,” Molecular Cancer, vol. 4, no.29, 2005.

A. Ali, U. Khan, A. Tufail and M. Kim, “Analyzing potential of SVM based classifiers for intelligent and less invasive breast cancer prognosis,” Second International Conference on Computer Engineering and Applications, pp 313-319, 2010.

S. Palaniappan, R. Awang, “Intelligent heart disease prediction system using data mining techniques,” in Proceddings of AICCSA 2008. IEEE/ACS International Conference on Computer Systems and Applications, pp.108,115, 2008.

J. S. Saleema, B. Sairam, S. D. Naveen, K. Yuvaraj and P Deepa Shenoy, “Prominent Label Identification and Multi-label Classification for Cancer Prognosis Prediction,” in TENCON 2012 - 2012 IEEE Region 10 Conference. Cebu. November 2012.

T. Zeng and J.Liu, “Mixture classification model based on clinical markers for breast cancer prognosis,” Artificial Intelligence in Medicine, vol. 48, pp 129-137, 2010.

R. Al-Bahrani, A. Agrawal, and A. Choudhary, “Colon cancer survival prediction using ensemble data mining on SEER data,” in Proceedings of the IEEE Big Data Workshop on Bioinformatics and Health Informatics (BHI), 2013.

S. Li et al, “Semi-Supervised Learning for Imbalanced Sentiment Classification,” in Proc. of the Twenty-Second international joint conference on Artificial Intelligence. AAAI Press. vol. 3, pp. 1826-1831, 2011.

Khalila et al, “Predicting disease risks from highly imbalanced data using random forest,” BMC Medical Informatics and Decision Making. 2011.

Liu et al, “Exploratory under-sampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics. vol. 39(2), pp.539-550, 2009.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.