Open Access Open Access  Restricted Access Subscription or Fee Access

Quad Tree-based K-Means Clustering Algorithm for Software Fault Prediction

S. Samson Dinakaran, B. Meena Preethi

Abstract


Clustering Techniques may be used for fault prediction in software modules, more so in those cases where fault labels are not available. In this paper a Quad Tree-based K-Means algorithm has been applied for predicting faults in program modules. The aims of this paper are twofold. First, Quad Trees are applied for finding the initial cluster centers to be input to the K-Means Algorithm. An input threshold parameter § governs the number of initial cluster centers and by varying § the user can generate desired initial cluster centers. The concept of clustering gain has been used to determine the quality of clusters for evaluation of the Quad Tree-based initialization algorithm as compared to other initialization techniques. The clusters obtained by Quad Tree-based algorithm were found to have maximum gain values. Second, the Quad Tree based algorithm is applied for predicting faults in program modules. The overall error rates of this prediction approach are compared to other existing algorithms and are found to be better in most of the cases.

Keywords


K-Means Clustering, Quad Tree, Software Fault Prediction

Full Text:

PDF

References


C. Catal, U. Sevim, and B. Diri, ―Clustering and Metrics Threshold Based Software Fault Prediction of Unlabeled Program Modules,‖ Proc Sixth Int‘l Conf. Information Technology: New Generations, pp. 199-204, 2009.

S. Zhong, T.M. Khoshgoftaar, and N. Seliya, ―Unsupervised Learning for Expert-Based Software Quality Estimation,‖ Proc. IEEE Eighth Int‘l Symp. High Assurance Systems Eng., pp. 149-155, 2004.

S. Zhong, T.M. Khoshgoftaar, and N. Seliya, ―Analyzing Software Measurement Data with Clustering Techniques,‖ IEEE Intelligent Systems,vol. 19, no. 2, pp. 20-27, Mar./Apr. 2004.

N. Seliya and T.M. Khoshgoftaar, ―Software Quality Classification Modeling Using the PRINT Decision Algorithm,‖ Proc. IEEE 14th Int‘l Conf. Tools with Artificial Intelligence, pp. 365-374, 2002.

J. Han and M. Kamber, Data Mining Concepts and Techniques, second ed, pp. 401-404. Morgan Kaufmann Publishers, 2007.

M. Ester, H.P. Kriegel, J. Sander, and X.Xu., ―A Density-Based Algorithmfor Discovering Clusters in Large Spatial Databases with Noise,‖ Proc.Second Int‘l Conf. Knowledge Discovery and Data Mining (KDD ‘96), pp. 226- 231, 1996.

http://archive.ics.uci.edu/ml/datasets/Iris, 2012.

P.S. Bishnu and V. Bhattacharjee, ―A New Initialization Method for KMeans Algorithm Using Quad Tree,‖ Proc. Nat‘l Conf. Methods and Models in Computing (NCM2.

M. Laszlo and S. Mukherjee, ―A Genetic Algorithm Using Hyper-Quadtrees for Low-Dimensional K-Means Clustering,‖ IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 533-543, Apr. 2006.

D. Pelleg and A. Moore, ―X-Means: Extending K-Means with Efficient Estimation of the Number of Cluster,‖ Proc. 17th Int‘l Conf. Machine Learning, pp. 727-734, 2000.

R. Marinescu, ―Detection Strategies: Metrics-Based Rules for Detecting Design Flaws,‖ Proc. 20th Int‘l Conf. Software Maintenance, pp. 350-359, 2004.

N. Seliya, T.M. Khoshgoftaar, and S. Zhong, ―Analyzing Software Quality with Limited Fault- Proneness Defect Data,‖ Proc. IEEE Ninth Int‘l Symp.High-Assurance Systems Eng., pp. 89-98, 2005.

K.E. Emam, S. Benlarbi, and N. Goel, ―Comparing Case Based Reasoning Classifiers for Predicting High Risk Software Component,‖ J. Systems and Software, vol. 55, no. 3, pp. 301-320, 2001.

V. Bhattacherjee, P.K. Mohanti, and S. Kumar, ―Complexity Metrics for Analogy Based Effort Estimation,‖ J. Theoretical and Applied Information Technology, vol. 6, no. 1, pp. 001-008, 2009.

S. Vicinanza, M.J. Prietulla, and T. Mukhopadhyay, ―Case Based Reasoning in Software Effort Estimation,‖ Proc. 11th Int‘l Conf. Information Systems. pp. 149-158, 1990.

V. Bhattacherjee and P.S. Bishnu, ―Unsupervised Learning Approach to Fault Prediction in Software Module,‖ Proc. Nat‘l Conf. Computing and Systems, pp. 101-108, 2010.

S. Wang and M.P. Armstrong, ―A Quad Tree Approach to Domain Decomposition for Spatial Interpolation in Grid Computing Environment,‖ J. Parallel Computing: High Performance Computing with Geographical Data: vol. 29, no. 10, pp. 1481-1504, 2003.

M.D. Berg, O. Cheong, M. Kreveld, and M. Overmars, Computational Geometry Algorithms and Applications, third ed., pp. 309-315. Springer, 2008.

R.A. Finkel and J.L. Bentley, ―Quad Trees: A Data Structure for Retrieval on Composite Key,‖ Acta information, vol. 4, no. 1, pp. 1-9, 1974.

R. Tibshirani, G. Walther, and T. Hastie, ―Estimating the Number of Clusters in a Dataset via the Gap Statistic,‖ J. Statistical Soc., vol. 63, no. 2, pp. 411-423, 2001.

Y. Jung, H. Park, and D.Z. Du, ―A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering,‖ J. Global Optimization, vol. 25, pp. 91-111, 2003.

http://promisedata.org/, 2012.

D. Steinley and M.J. Brusco, ―Initializing K-Means Batch Clustering: A Critical Evaluation of Several Techniques,‖ J. Classification, vol. 24, pp. 99-21, 2007.

A. Likas, N. Vlassis, and J. Verbeek, ―The Global K-means Clustering Algorithm,‖ Pattern Recognition, vol. 36, pp. 451-461, 2003.

SAS, ―The FASTCLUS Procedure,‖ in SAS/STAT 9.1 User‘s Guide vol. 2, Cary, NC, SAS Inst., Inc., 2004.

P.S. Bishnu and V. Bhattacherjee, ―Outlier Detection Technique Using Quad Tree,‖ Proc Int‘l Conf. Computer Comm. Control and Information Technology, pp. 143-148, Feb. 2009.

P.S. Bishnu and V. Bhattacherjee, ―Application of K-Medoids with kd-Tree for Software Fault Prediction,‖ ACM Software Eng. Notes, vol. 36, pp. 1-6,Mar. 2011.

V. Bhattacherjee and P.S. Bishnu, ―Software Fault Prediction Using KMedoids Algorithm,‖ Proc. Int‘l Conf. Productivity, Quality, Reliability, Optimization and Modeling (ICPQROM ‘11), p. 191, Feb. 2011


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.