Open Access Open Access  Restricted Access Subscription or Fee Access

Automatic Selection of Decision Tree Algorithm Based on Training Set Size

Dr. K. Vivekanandan, T. Sathyabama, M. Prabhavathi

Abstract


In Data mining applications, very large training data sets with several million records are common. Decision trees are powerful and popular technique for both classification and prediction. Many decision tree construction algorithms have been proposed to handle large or small training data sets. Some algorithms are best suited for large data sets and some for small data set. The decision tree algorithm C4.5 classifies categorical and continuous attributes very well but it handles efficiently only a smaller data set. SLIQ (Supervised Learning In Quest) and SPRINT (Scalable Parallelizable Induction of Decision Tree)handles very large datasets. This paper deals with the automatic selection of decision tree algorithm based on training set size. The proposed system first prepares the training dataset size using the mathematical measure. The resultant training set size will be checked with the available memory. If memory is sufficient then the tree construction will continue with any one of the algorithms C4.5, SLIQ, SPRINT. After classifying the dataset, the accuracy of the classifier is estimated. The major advantages of the proposed approach are that the system takes less time and avoids memory problem.


Keywords


Data Mining, Decision Trees, Classification, Machine Learning, Training Data.

Full Text:

PDF

References


Rakesh Agrawal, Tomasz Imielinski, and Arun Swa mi, “Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering”, vol 5(6), 1993, pp.914-925.

M. Mehta, R. Agrawal, and J. Rissanen. “SLIQ: A fast scalable classifier for data mining”, In Proc. Of EDBT, 1996.

Jaiwei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, San Diego: Academic Press, 2001.

S. Rasoul Safavian and David Landgrebe, “A Survey of Decision Tree Classifier Methodology” , IEEE Transactions on Systems, Man,and Cybernetics, Vol. 21, No. 3, pp 660-674, May 1991.

J. Ross Quinlan., “C4.5: Programs for Machine Learning”, Morgan Kaufman, 1993.

Amir Bar-Or, Daniel Keren, Assaf Schuster, and Ran Wolff, “Hierarchical Decision Tree Induction in Distributed Genomic Databases”, IEEE Transactions on Knowledge and Data Engineering,vol.17, No.8, August 2005 .

Tjen-sien, Wei-YinLoh and Yu-shan shih, “ A Comparision of Prediction Accuracy, Complexity, and Training time of Thirty – three Old and New Classification Algorithms,” Machine Learning, vol.40,pp. 203-229, 2000.

Tjen-sien, Wei-YinLoh and Yu-shan shih, “An Emprical comparision of Decision Trees and other Classification Methods”, University of Wisconsin, Madison, Technical report 979, Jan 1998

J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of VLDB, 1996.

Arun K Pujari, “Data Mining Techniques”, Universities Press, 2001

Alex Berson, Stephen Smith and Kurt Thearling, “Building Data Mining Applications for CRM”, Tata McGrawhill Publishers, 2000.

Banerjee M., and Chakraborty M.K., “Rough Logics: A survey with further directions,” Rough Sets Analysis, Physica Verlag, Heidelberg,1997.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classification and Regression Trees. Wadsworth, Belmont”, 1984.

J. Bala, J. Huang and H. Vafaie K. DeJong and H. Wechsler “Hybrid Learning Using Genetic Algorithms and Decision Trees for Pattern Classification”, 2003.

Andrew B. Nobel, “Analysis of a complexity based pruning scheme for classification trees”, IEEE Transactions on Information Theory,vol. 48, pp.2362-2368, 2002.

Carla E. Brodley Paul E. Utgoff, “Multivariate versus Univariate Decision Trees”, COINS Technical Report 92-8, Jan 1992

B. Cremilleux , C. Robert , “Use of Attribute Selection Criteria in Decision Trees in Uncertain Domains” Universite de Caen, 1998.

David Bowser-Chao Debra L. Dzialo, “A Comparison of the Use of Binary Decision Trees and Neural Networks in Top Quark Detection “ ,centre for physics particles, 4 sep, 1992

Deborah R. Carvalho, Alex A. Freitas , “A hybrid decision tree/genetic algorithm for coping with the problem of small disjuncts in data mining” 2004.

V. Corruble D.E. Brown and C.L. Pittard, “A comparison of decision classifiers with backpropagation neural networks for multimodal classification problems”, Pattern Recognition, 26:953–961, 1993.

Donato Malerba, Floriana Esposiro and Giovanni Semeraro , “A Further Comparision of Simplification Methods for Decision –Tree Induction” , Springer-verlag, 1996.

Eibe Frank , Ian H. Witten, “Selecting Multiway Splits in Decision Trees”, Department of Computer Science, University of Waikato,1996

D.Fournier and B.Cremilleux, “Using Impurity and Depth for Decision Trees Pruning”, Universite de Caen,2000

Floriana Esposito, Donato Malerba, and Giovanni Semeraro “A Comparative Analysis of Methods for Pruning Decision Trees” , IEEE Transactions on pattern analysis and machine intelligence,vol.19,No.5, May 1997

W. Frawley and G. Piatetsky-Shapiro and C. Matheus, “Knowledge Discovery in Databases: An Overview”, AI Magazine, fall 1996, pgs 213-228.

Goharian & Grossman, “Introduction to Data mining”, spring 2003.

Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti_, “RainForest - A Framework for Fast Decision Tree Construction of Large Datasets”, Proceedings of the 24th VLDB Conference New York, USA, 1998.

Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, Wei-Yin Lohz, “BOAT—Optimistic Decision Tree Construction”, In Proc. Of SIGMOD’99, Philadelphia,1999.

Haixun Wang, Carlo Zaniolo “CMP: A Fast Decision Tree Classifier Using Multivariate Predictions”, University of California at Los Angeles, Los Angeles, CA 90095, 1997.

D. Hand, H. Mannila, P. Smyth,” Principles of Data Mining”, MIT Press, Cambridge, MA, 2001

Igor Konenenko, Edward Simec, “Induction of decision trees using RELIEFF”, University of Ljubljana, Slovenia, 1994

Kononenko.I, “Estimating attributes: Analysis and Extensions of RELIEF”, Proc.European Conf. on Machine Learning, Catania, April 1994.

M. James. “Classificaton Algorithms”, Wiley, 1985.

John Robust “The Effects of Training Set Size on Decision Tree Complexity”, 1995.

C.Z.Jainkow, “Fuzzy decision trees: Issues and methods,” IEEE Transactions on Systems, Man and Cybernetics, vol.28, pp.1-14,1998.

J.Kent Martin ,“An Exact Probability Metric for Decision Tree Splitting”, Learning from data: AI and Statistics, Springer verlag,1996.

Mahesh V. Joshi, George Karypis ,Vipin Kumar, “ ScalParC : A New Scalable and 28 Efficient Parallel Classification Algorithm for Mining Large Datasets”, NSF CCR 9423082, by Army Research Office, 1999.

J.L.Kolodner, “Case Based Reasoning”, San Mateo: Morgan Kaufmann, 1993.

Jaiwei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, San Diego: Academic Press, 2001.

Jeffrey W. Seifert, “Datamining: An Overview”, June 7, 2005.

Ravi kothari and Ming Dong, “Decision Trees for Classification- A review and some new results”, world scientific, June 30, 2000.

Kira.K, Rendell .L, “The feature Selection Problem: traditional methods and new algorithm, Proc.AAAI’92, San Jose, CA, July 1992.

Khaled Alsabti, Sanjay Ranka, Vineet Singh, “Clouds: A Decision Tree Classifier for Large Datasets”, Department of CISE, University of Florida, October 27, 1998.

Micheline Kamber, Lara Winstone, Wan Gong, Shan Cheng, Jiawei Han,”Generalization and Decision Tree Induction: Efficient Classification in Data Mining”, Database Systems Research Laboratory, 1996


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.