Open Access Open Access  Restricted Access Subscription or Fee Access

Exploring Capabilities of Canopy Clustering Algorithm

Srinivas Sivarathri, A. Govardhan

Abstract


Clustering is one of the widely used data mining techniques that have its utility in extracting business intelligence that can help enterprises making expert decisions. Clustering is an unsupervised learning algorithm that can identify natural groups from given objects. Many types of clustering algorithms such as hierarchical, partitioning, density based, the model based, grid based and soft computing came into existence. The quality of clustering and the computational overhead are two important concerns while using clustering techniques. Canopy clustering is a clustering technique that is best used as pre-processing to main clustering algorithms like K-Means. Using the canopy it is possible to work out huge and impossible clustering algorithms to work out. As the canopy uses cheap distance metric, it is possible to reduce clustering overhead with losing the accuracy of clusters. However, in the industry, there is suspicion about the need for the canopy clustering in the future as streaming K-means is able to serve the purpose. In this paper we explore the canopy clustering algorithm and provide useful insights into this in order to drive home insights pertaining to canopy clustering. We built a prototype that demonstrates the usefulness of canopy clustering. The empirical results revealed that canopy clustering reduces much of the computational overhead when compared with clustering algorithms without canopy approach.


Keywords


Clustering, Pre-Clustering, Canopy Clustering, and Distance Metric

Full Text:

PDF

References


Andrew McCallumzy . (1997). Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching. Clustering, p1-10.

Andrew McCallum, Kamal Nigam, Lyle H. Unger. (2014). The Canopies Algorithm.. Danny Wyatt, p4-13

Andrew Musselman. (2014). Canopy Clustering. mahout, p1-8

Rorlig. (2015). CS 292: KMeans Algorithm with Canopy Clustering. Available: https://rorlig.wordpress.com/2011/05/13/project-report-cs-292-kmeans-algorithm-with-canopy-clustering/. Last accessed 10 Dec 2014.

Apache Mahout (2014), Canopy Clustering, Available online at: https://mahout.apache.org/users/clustering/canopy-clustering.html [Accessed: 11 Dec 2014]

Apache Software Foundation (2015), Available online athttp://www.apache.org/ [Accessed: 10 Dec 2014]

YoannPadioleau. (2009). A poor’s man MapReduce for OCaml.OCamlMP, p2-23.

Archana Kale, Vishal Shirsat ,PrajaktaPatkar , YogeshKanhurkar , Sandeep Dange. (2013). Email Clustering Using Hadoop Map-Reduce Technique. ISSN. 1 (6), p4-7.

Vance Faber. (1994). Clustering and the Continuous k-Means Algorithm. Los Alamos Science. 22, p1-7.

Hanna Köpcke *, Erhard Rahm. (2009). Frameworks for entity matching: A comparison. Elsevier. 0 (0), p1-14.

Peter Christen. (2007). Towards Parameter-free Blocking for Scalable Record Linkage. TRCS. 0 (0), p1-12.

I. H.Witten, A.Moffat, and T. C. Bell. Managing Gigabytes.Morgan Kaufmann, second edition, 1999.

Peter Christen. (2008). Febrl – An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. nevuda. 0 (0), p24-27.

A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey and M. A. Ramadan. (2008). DCBOR: A Density Clustering Based on Outlier Removal. International Journal of Computer. 2 (9), p107-112.

Peter Christen. (2008). Febrl – A Freely Available Record Linkage System with a Graphical User Interface. HDKM. 80 (0), p1-9.

Hanna Köpcke *, Erhard Rahm. (2009). Frameworks for entity matching: A comparison. Elsevier. 0 (0), p1-14.

Adam Nelson, Tim Menzies, Gregory Gay. (2010). Sharing Experiments Using Open Source Software. John Wiley & Sons,. 0 (0), p1-36.

JebamalarTamilselvi J.* and Saravanan V. (2010). Token-based method of blocking records for large data warehouse. ISSN. 2 (2), p5-10.

PavelShvaikoJérômeEuzenat Tom Heath ChristophQuix Ming Mao Isabel Cruz. (2011). Ontology Matching. ISWC. 0 (0), p1-227.

Daniel Gallego,GabrielHuecas. (2012). An Empirical Case of a Context-aware Mobile Recommender System in a Banking Environment. Journal of convergence. 3 (4), p49-56.

Kyuseok Shim. (2012). MapReduce Algorithms for Big Data Analysis. VLDB. 5 (12), p2016-2017.

Rainer Schnell. (2013). Privacy-Preserving Record Linkage and Privacy-Preserving Blocking for Large Files with Cryptographic Keys using Multibit Trees. JSM. 0 (0), p187-194.

Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu. (2015). Map/Reduce Affinity Propagation Clustering Algorithm. International Journal of Electronics and Electrical Engineering. 3 (4), p311-317.


Refbacks

  • There are currently no refbacks.