Open Access Open Access  Restricted Access Subscription or Fee Access

A Comparative Study of Software Bug Clustering Using Lingo and STC Web Clustering Algorithms

Naresh Kumar Nagwani, Dr. Shreesh Verma


Software bug classification is one of the important and popular problems in software engineering. Recently number of algorithms and techniques are presented to automate this process. Software bug data contains number of attributes like bug-id, summary (title), description, comments, status, version etc. Most of the important attributes holds text data. Lingo and STC (Suffix Tree Clustering) both are popular text clustering algorithms used in web mining. In this paper Lingo and STC algorithms are used to classify the software bugs. Classification using clustering methodology is used to create the software bug classes from software bug clusters. In this methodology first clusters are created and then appropriate labels are assigned to the clusters, which indicate the class label for the clusters. Both of these algorithms Lingo and STC are implemented as the part of Carrot2 framework. The software bug repository data is integrated and passed to Carrot2 framework for applying Lingo and STC algorithms. Lingo and STC algorithms are compared for software bug classification task. The comparison is done using various clustering parameters: the number of clusters generated, purity of the clusters and entropy of the clusters created etc.


Software Bug Classification, Lingo Clustering, STC Clustering, Software Bug Clustering, Software Bug Repository.

Full Text:



Ahsan S., Ferzund J., Wotawa F. (2009) Automatic Software Bug Triage System (BTS) Based on Latent Semantic Indexing and Support Vector Machine, IEEE 2009 Fourth International Conference on Software Engineering Advances, pp. 216-221.

Antoniol G., Massimiliano K., Penta D. (2008) Is it a Bug or an Enhancement? A Text-based Approach to Classify Change Requests, NSERC Canada Research Chair Tier I in Software Change and Evolution.

Anvik, J., Hiew, L., and Murphy, G. C. (2006) Who Should Fix This Bug?, in Proc. of 28th international conference on Software engineering, Shanghai, China,, pp. 361– 370.

Ayewah N., Pugh W. (2009) Learning from Defect Removals, IEEE MSR 2009, pp. 179-182.

Benjamin C. M., Fung, Ke Wang, and Ester M. (2005) Hierarchical Document Clustering, The Encyclopedia of Data Warehousing and Mining, John Wang (ed.), Idea Group, pp. 1-7.

Berry M., Kogan J. (2010) Text Mining-application And Theory, John Wiley and Sons Ltd. Publications, ISBN 978-0-470-74982-1.

Bugzilla, An Open source web-based general-purpose bug tracker and testing tool originally developed and used by the Mozilla:

Carpineto C, Wosin S, Romano G., Weiss D. (2009) A Survey of Web Clustering Engines, ACM Computing Surveys, Vol.41, No. 3, Article 17.

Eclipse, A multi-language software development environment comprising an integrated development environment (IDE) andan extensible plug-in system :

G. Karypis (2002) CLUTO: A clustering toolkit. Technical Report 02-017. University of Minnesota. At

Han J., Kamber M. (2009) Data Mining: Concepts & Techniques 2nd Edition, Morgan Kaufmann Publishers, ISBN 978-81-312-0535-8.

IEEE Standard Classification for Software Anomalies, IEEE Std 1044-1993, 1993.

IEEE Standard Classification for Software Anomalies, IEEE Std 1044-2009 (revision of IEEE Std 1044-1993), 2009.

Jalbert N., Weimer W. (2008) Automated Duplicate Detection for Bug Tracking Systems, IEEE International Conference on Dependable Systems & Networks: Anchorage, Alaska, June 24-27 2008, pp. 52-61.

Java, Open source programming language:

JBoss Seam, a web application framework. Bugs are available at:

Jing L. (2006) Survey of Text Clustering, Tutorial on Pacific-Asia Conference on Knowledge Discovery and Data Mining,PAKDD06.

Jira, An issue tracking, bug tracking and project tracking tool for software development teams:

Kagdi H., Poshyvanyk D. (2009) Who Can Help Me with this Change Request? IEEE ICPC 2009, pp. 273-277.

Kagdi, H., Collard, M. L., and Maletic, J. I., A Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution, Journal of Software Maintenance and Evolution: Research and Practice, vol. 19, no. 2, pp. 77-131, 2007.

Kagdi, H., Hammad, M., and Maletic, J. I. (2008) Who Can Help Me with this Source Code Change? In Proc. of IEEE International Conference on Software Maintenance, Beijing, China.

Kim S., Whitehead E., Zhang Y. (2008) Classifying Software Changes: Clean or Buggy? IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 34, NO. 2, MARCH/APRIL 2008, pp. 181-196.

Konchady M. (2006) Text Mining Application Programming, Career & Professional Group, a part of Cengage Learning, ISBN 978-81-315-0247-1.

Mozilla, a global community dedicated to building free, open source products like Firefox web browser and Thunderbird email software:

MySql Bugs available at:

MySql, A relational database management system (RDBMS) that runs as a server providing multi-user access to a number of databases:

Nagwani N., Singh P. (2009) Bug Mining Model Based on Event-Component Similarity to Discover Similar and Duplicate GUI Bugs, IEEE International Advance Computing Conference, IACC-2009, Patiala, Punjab, India. Pp. 1388 - 1392 Location: Patiala, India ISBN: 978-1-4244-2927-1.

Nagwani N., Singh P. (2009) Weight Similarity Measurement Model Based, Object Oriented Approach for Bug Databases Mining to Detect Similar and Duplicate bugs, International Conference on Advances in Computing, Communication and Control, ICAC-2009, ACM SIGART Conf Id - 2009-16014, Mumbai, Maharashtra, India. pp. 202-207.

Nagwani N., Verma S. (2009) An Open Source Framework for Data Pre-processing of Online Software Bug Repositories, CiiT International Journal of Data Mining Knowledge Engineering, Print: ISSN 0974 – 9683 & Online: ISSN 0974 – 9578, Vol. 1, No. 7.

Nagwani N., Verma S. (2010) Predictive Data Mining Model for Software Bug Estimation Using Average Weighted Similarity, IEEE 2nd International Advance Computing Conference (IEEE IACC 2010).

Nagwani N., Verma S. (2011) Software Bug Classification Using Suffix Tree Clustering (STC) Algorithm, International Journal of Computer Science and Technology (IJCST), VOL.2, ISSUE.1.

O. Zamir, O. Etzioni (1998) Web Document Clustering: A Feasibility Demonstration. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).

O. Zamir, O. Etzioni (1999) Grouper: A Dynamic Clustering Interface for Web Search Results. Computer Networks Vol. 31, pp. 11-16, ISSN: 1361-1374.

Osinski, S., Stefanowski, J., and Weiss, D. (2004) Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition. In Proceedings of the International Intelligent Information Processing and Web Mining Conference, Zakopane, Poland, Advances in Soft Computing, pages 359–368.

Ploski J., Rohr M., Schwenkenberg P., Hasselbring W. (2007) Research Issues in Software Fault Categorization , Software Engineering Group, TrustSoft, ACM SIGSOFT Software Engineering Notes, Volume 32 Number 6.

Stefanowski J., Weiss D. (2003) Carrot2 and Language Properties in Web Search Results Clustering, Advances in Web Intelligence, Springer.

Stefanowski J., Weiss D. (2007) Comprehensible and Accurate Cluster Labels in Text Clustering, Conference RIAO2007, Pittsburgh PA, U.S.A.

Sun C., Lo D., Wang X., Jiang J., Khoo S (2010) A Discriminative Model Approach for Accurate Duplicate Bug Report Retrieval, ACM ICSE’10, May 2–8, 2010, Cape Town, South Africa, pp. 45-54.

Trac, A Project management and bug/issue tracking system:

Wang X., Zhang L., Xie T., Anvik J., Sun J. (2008) An Approach to Detecting Duplicate Bug Reports using Natural Language and Execution Information, ACM ICSE’08, May 10–18, 2008, Leipzig, Germany, pp. 461-470.

Williams and J. Spacco. Szz (2008) revisited: verifying when changes induce fixes. In DEFECTS ’08: Proceedings of the 2008 workshop on Defects in large software systems, pages 32–36, New York, NY, USA.

Zeng, H., He Q., Chen Z., Ma W., Ma. J. (2004) Learning to Cluster Web Search Results. In Proceedings of the ACM SIGIR Conference on Research and development in information retrieval, pp. 210-217.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.