Open Access Open Access  Restricted Access Subscription or Fee Access

A Framework for Class Imbalance Problem Using Hybrid Sampling

S. Jahangeer Sidiq, Majid Zaman, Muheet Butt

Abstract


The skewness in underlying data distribution is natural in most of the datasets that are generated in real world applications and such datasets are commonly known as class imbalanced datasets. The examples of one class are very less in number than the examples in other class(es).The multi-class classification with imbalanced datasets has grabbed much attention  by data mining and machine learning research communities in recent years. The main aim of this paper is to ameliorate the classification performance of minority class without reducing the classification performance of majority class(es).The problem is studied as one while as usually the researchers have studied it as two individual problems multi-class and imbalance problem.

This paper addresses the problem by devising a novel framework based on data solution (Random Hybrid Sampling) and well known binarization algorithm (OVO-Binarization).Eventually performance improvement of our frame work is shown using several performance measures such as Precision,Recall,F1-score and G-Mean on benchmark data-sets imported from UCI machine learning repository.


Keywords


Class Imbalance, Classification, Over-Sampling, Under-Sampling, OVO-Binarization, Performance Metrics.

Full Text:

PDF

References


T. W. Liao, “Classification of weld flaws with imbalanced class data,”Expert Syst. Appl., vol. 35, no. 3, pp. 1041–1052, Oct. 2008.

X.-M. Zhao, X. Li, L. Chen, and K. Aihara, “Protein classification with imbalanced data,” Proteins: Structure, Function, and Bioin-formatics,vol. 70, no. 4, pp. 1125–1132, Mar. 2008.

A. C. Tan, D. Gilbert, and Y. Deville, “Multi-class protein fold classification using a new ensemble machine learning approach,” Genome Inf.,vol. 14, pp. 206–217, 2003.

N. V. Chawla, N. Japkowicz, and A. Kotcz, “Editorial: Special issue on learning from imbalanced data sets,” ACM Sigkdd Explo-rations Newslett., vol. 6, no. 1, pp. 1–6, 2004.

L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2,pp. 123–140, 1996.

Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proc. Int. Conf. Mach. Learn., 1996, vol. 96, pp. 148–156.

N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bow-yer,“SMOTEBoost: Improving prediction of the minority class in boosting,” in Proc. 7th Eur. Conf. Principles Practice Knowl. Dis-covery Databases, 2003, pp. 107–119.

M. V. Joshi, V. Kumar, and R. C. Agarwal, “Evaluating boosting algorithms to classify rare classes: Comparison and improve-ments,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 257–264.

Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural net-works with methods addressing the class imbalance problem,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 63–77, Jan. 2006

G. Ou and Y. L. Murphey, “Multi-class pattern classification using neural networks,” Pattern Recognit., vol. 40, no. 1, pp. 4–18, Jan. 2007.

R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” J. Mach. Learn. Res., vol. 5, pp. 101–141, Dec. 2004.

R. Jin and J. Zhang, “Multi-class learning by smoothed boosting,” Mach. Learn., vol. 67, no. 3, pp. 207–227, Jun. 2007.

H. Valizadegan, R. Jin, and A. K. Jain, “Semi-supervised boosting for multi-class classification,” Mach. Learn. Knowl. Discovery Databases, vol. 5212, pp. 522–537, 2008.

T. Hastie, R. Tibshirani, et al., “Classification by pairwise cou-pling,” The Ann. Statist., vol. 26, no. 2, pp. 451–471, 1998.

A. Frank and A. Asuncion. (2010). UCI machine learning reposi-tory [Online]. Available: http://archive.ics.uci.edu/ml

J. Alcal a, A. Fern andez, J. Luengo, J. Derrac, S. Garc# ıa, L. S# anchez, and F. Herrera, “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis frame-work,” J. Multiple-Valued Logic Soft. Comput., vol. 17, pp. 255–287,2010.

A. Fern# andez, V. L “Analysing the classification of imbalanced data-sets with multi-ple classes: Binarization techniques and ad-hoc approaches,” Knowl.-Based Syst., vol. 42, pp. 97–110, 2013.

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.

G. M. Weiss and F. Provost, “The Effect of Class Distribution on Classifier Learning: An Empirical Study,” Department of Comput-er Science, Rutgers University, New Jersey, Tech. Rep. ML-TR-44, 2001.

I. Tomek, “Two modifications of CNN,” IEEE Trans. Syst., Man Cybern., vol. 6, no. 11, pp. 769–772, Nov. 1976.

D. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst., Man Cybern., no. 3, pp. 2:408–421, 1972.

P. E. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE Trans. Inf. Theory, vol. 14, no. 3, pp. 515–516, May 1968.

M. Kubat, S. Matwin, et al., “Addressing the curse of imbalanced training sets: One-sided selection,” in Proc. 14th Int. Conf. Mach.Learn., 1997, vol. 97, pp. 179–186.

G. E. A. P. A. Batista, R. C. Prati, and W. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newslett., vol. 6, no. 1, pp. 20–29, 2004.

R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard, “Class im-balances versus class overlapping: An analysis of a learning sys-tem behavior,” in Proc. Adv. Artif. Intell., 2004, pp. 312–321.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., no. 16, pp. 341–378, 2002.

B. X. Wang and N. Japkowicz, “Imbalanced data set learning with synthetic samples,” in Proc. IRIS Mach. Learn. Workshop, Otta-wa,Canada, Jun. 2004.

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safe-level-synthetic minority over-sampling tech-nique for handling the class imbalanced problem,” in Proc. 13thPacific-Asia Conf. Adv. Knowl. Discovery Data Mining, 2009, pp. 475–482.

X. Fan, K. Tang, and T. Weise, “Margin-based over-sampling method for learning from imbalanced datasets,” in Proc. 15th Pa-cific-Asia Conf. Adv. Knowl. Discovery Data Mining, 2011, pp. 309–320.

H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in Proc. Int. Conf. Adv. Intell. Comput., 2005, pp. 878–887.

H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive syn-thetic sampling approach for imbalanced learning,” in Proc. IEEE-Int. Joint Conf. Neural Netw (IEEE World Congress Comput. In-tell)., 2008, pp. 1322–1328.

K. Puntumapon and K. Waiyamai, “A pruning-based approach for searching precise and generalized region for synthetic minority over-sampling,” in Proc. 16th Pacific-Asia Conf. Adv. Knowl.Discovery Data Mining, 2012, pp. 371–382.

Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008. (IEEE World Congress onComputational Intelligence). IEEE International Joint Conference on, pages 1322–1328. IEEE, 2008.

Francisco Fernández-Navarro, César Hervás-Martı́nez, and Pe-dro Antonio Gutiérrez. A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44(8):1821–1833, 2011.

Minlong Lin, Ke Tang, and Xin Yao. Dynamic sampling approach to training neural networks for multiclass imbalance classification. Neural Networks and Learning Systems, IEEE Transactions on, 24(4):647–660, 2013.

R.C. Holte, L. Acker, and B.W. Porter, “Concept Learning and the Problem of Small Disjuncts,”Proc. Int’l J. Conf. ArtificialIntelli-gence,pp. 813-818, 1989.

D. Mease, A.J. Wyner, and A. Buja, “Boosted Classification Trees and Class Probability/QuantileEstimation,”J. Machine Learning Research,vol. 8, pp. 409-439, 2007.

C. Drummond and R.C. Holte, “C4.5, Class Imbalance, and Cost Sensitivity: Why Under Sampling Beats Over-Sampling,”Proc.Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II,2003.

Albert Orriols-Puig and Ester Bernadó-Mansilla. Evolutionary rule-based systems for imbalanced data sets. Soft Computing, 13(3):213–225, 2009.

Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. "A training algorithm for optimal margin classifiers." Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992.

Scikit-learn: Machine Learning in Python, Pedregosaet al., JMLR 12, pp. 2825-2830, 2011.

C. V. Rijsbergen. Information Retrieval. London, Butterworths, 1979.

MiroslavKubat, Robert Holte, and Stan Matwin. Learning when negative examples abound. In Machine Learning: ECML-97, pag-es146–153. Springer, 1997.

Yanmin Sun, Mohamed S. Kamel, and Yang Wang. Boosting for learning multiple classes with imbalanced class distribution.In Da-ta Mining, 2006. ICDM’06. Sixth International Conference on,pages 592–602. IEEE, 2006.

Sidiq, S. J., Zaman, M., & Butt, M. (2017). An Experimental Comparison of Extensible Algorithms for Multi-class Imbalance Problem.

Sidiq, S. J., Ahmed, M., & Ashraf, M. (2017). An Empirical Com-parison of Supervised Classifiers for Diabetic Diagnosis. Interna-tional Journal, 8(1).


Refbacks

  • There are currently no refbacks.