Open Access Open Access  Restricted Access Subscription or Fee Access

An Analysis of Various Record Matching Approaches and Similarity Computations

Cyju Varghese, Naveen Sundar

Abstract


Linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect manually. Record matching refers to the task of finding similar entities in two or more records. Performing record matching solves the duplication detection problems; hence the needs for identifying the suitable record matching technique follow. This paper presents a survey on record matching techniques highlighting what approaches are utilized, the number of classifiers used, multiple stages of duplication detection performed, thus comparing each technique with other. This paper also exhibits the various matching metrics available. Further, we want to point out potential pitfalls as well as challenging issues need to be addressed by a record matching technique. And then we exhibit an unsupervised method to perform record matching on a web database scenario. We believe that the results of this evaluation will help analyst to come with more easier and feasible methods for record matching. This is a real challenging task particularly in Web scenario.

Keywords


Duplication Detection, Record Matching, Similarity Calculation, Unsupervised

Full Text:

PDF

References


P. Christen, ―Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification,‖ Proc.ACM SIGKDD, pp. 151-159, 2008

H. Yu, J. Han, and C.C. Chang, ―PEBL: Web Page Classification without Negative Examples,‖ IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, pp. 70-81, Jan. 2004.

R. Ananthakrishna, S. Chaudhuri, and V. Ganti, ―Eliminating Fuzzy Duplicates in Data Warehouses,‖ Proc. 28th International Confernce Very Large Data Bases, pp. 586-597, 2002

Weifeng Su, Jiying Wang, and Frederick H. Lochovsky, ―Record Matching over Query Results from Multiple Web Databases,‖ IEEE Transaction Knowledge and Data Engineering, April 2010 (vol. 22 no. 4) pp. 578-589

Mikhail Bilenko and Raymond Mooney, University of Texas at Austin William Cohen, Pradeep Ravikumar, and Stephen Fienberg, ―Adaptive Name matching in Information Intergration‖.

William Cohen, Pradeep Ravikumar, and Stephen Fienberg, ―Adaptive Name matching in Information Intergration‖

N. Koudas, S. Sarawagi, and D. Srivastava, ―Record Linkage: Similarity Measures and Algorithms (Tutorial),‖ Proc. ACM SIGMOD, pp. 802-803, 2006

W. Su, J. Wang, and F.H. Lochovsky, ―Holistic Schema Matching for Web Query Interfaces,‖ Proc. 10th Int‘l. Conf. Extending Database Technology, pp. 77-94, 2006.

R. Baxter, P. Christen, and T. Churches, ―A Comparison of Fast Blocking Methods for Record Linkage,‖ Proc. KDD Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003.

Bin He, Kevin Chen-Chuan Chang, ―Automatic Complex Schema Matching across Web Query Interfaces: A Correlation Mining Approach‖ ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006, Pages 1–45.

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, ―Robust and, T Efficient Fuzzy Match for Online Data Cleaning,‖ Proceedings ACM SIGMOD, pp. 313-324, 2003.

P. Christen. Churches, and M. Hegland, ―Febrl—A Parallel Open Source Data Linkage System,‖ Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.

N. Koudas, S. Sarawagi, and D. Srivastava, ―Record Linkage: Similarity Measures and Algorithms (Tutorial),‖ Proc. ACM SIGMOD, pp. 802-803, 2006

P. Christen, T. Churches, and M. Hegland, ―Febrl—A Parallel Open Source Data Linkage System,‖ Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.

P. Christen and K. Goiser, ―Quality and Complexity Measures for Data Linkage and Deduplication,‖ Quality Measures in Data Mining, F. Guillet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer, 2007


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.