Detecting and Removing Duplicate Records from Multiple Web Databases

Tapashi Paul; V. Ulagamuthalvi

Detecting and Removing Duplicate Records from Multiple Web Databases

Tapashi Paul, V. Ulagamuthalvi

Abstract

The basic scenario is that the queries3 given by the users are matched and the results are got from multiple web databases. The complexity in this concept arises due to the presence of duplicate and redundant records. To solve this, an unsupervised online record matching method, UDD is used which is mainly for identifying the duplicates from the query results of multiple web databases. The duplicate records are identified and ignored. Only the original records are displayed to the user. For these purpose two classifiers namely Weighted Component Similarity Summing classifier (WCSS) & Support Vector Machine (SVM) classifier are used iteratively to find the duplicates and filter those records from multiple web databases. The other concept is to avoid duplication of websites. Generally static weightage were allocated for URLs. Instead of static weightage, the idea of dynamic weightage is introduced here. Dynamic weightage is allocated to the respective URLs to avoid unauthorized users to create duplicate sites. This proves that the UDD works much better than the existing methods

Keywords

Duplicate, Record Matching, Weightage, Database and Multiple Result.

Full Text:

PDF

References

Wendy Alvey and Bettye Jamerson, Record LinkageTechniques – 1997, Proceedings of an International Workshop and Exposition, March 1997, Federal Committee on Statistical Methodology, Office of Management and Budget.

I. P. Fellegi and A. B. Sunter, A Theory For Record Linkage, Journal of the American Statistical sociation 64 (1969), no. 328, 1183–1210.

Mauricio Antonio Harn´andez-Sherrington, A Generalization of Band Joins and the Merge/Purge Problem, Ph.D. thesis, Department of Computer Sciences, Columbia University, 1996.

Jeremy A. Hylton, Identifying and Merging Related Bibliographic Records, Master’s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1996.

Matthew A. Jaro, Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association 84 (1989), no. 406, 414–420.

Beth Kliss and Wendy Alvey, Record Linkage Techniques – 1985, Proceedings of the Workshop on Exact Matching Methodologies, May 1985,

Record Matching over Query Results from Multiple Web Databases by Weifeng Su, Jiying Wang, and Frederick H. Lochovsky,IEEE.

Duplicate Detection of Query Results from Multiple Web Databases by Hemalatha S, Raja K, Tholkappia Arasu IEEE.

B. He and K.C.-C. Chang, ―Automatic Complex Schema Matching Across Web Query Interfaces: A Correlation Mining Approach,‖ ACM Trans. Database Systems, vol. 31, no. 1, pp. 346-396, 2006.

M.A. Hernandez and S.J. Stolfo, ―The Merge/Purge Problem for Large Databases,‖ ACM SIGMOD Record, vol. 24, no. 2, pp. 127-138, 1995.

M.A. Jaro, ―Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,‖ J. Am. Statistical Assoc., vol. 89, no. 406, pp. 414-420, 1989.

D.V. Kalashnikov, S. Mehrotra, and Z. Chen, ―Exploiting Relationships for Domain-Independent Data Cleaning,‖ Proc. SIAM Int’l Conf. Data Mining, pp. 262-273, 2005.

N. Koudas, S. Sarawagi, and D. Srivastava, ―Record Linkage: Similarity Measures and Algorithms (Tutorial),‖ Proc. ACM SIGMOD, pp. 802-803, 2006.

F. Letouzey, F. Denis, and R. Gilleron, ―Learning from Positive and Unlabeled Examples,‖ Proc. 11th Int’l Conf. Algorithmic Learning Theory, pp. 71-85, 2000.

R. Baxter, P. Christen, and T. Churches, ―A Comparison of Fast Blocking Methods for Record Linkage,‖ Proc. KDD Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, ―Robust and Efficient Fuzzy Match for Online Data Cleaning,‖ Proc. ACM SIGMOD, pp. 313-324, 2003.

P. Christen, T. Churches, and M. Hegland, ―Febrl—A Parallel Open Source Data Linkage System,‖ Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.

O. Bennjelloun, H. Garcia-Molina, D. Menestrina, Q. Su,S.E.Whang, and J. Widom, ―Swoosh: A Generic Approach to Entity Resolution,‖ The VLDB J., vol. 18, no. 1, pp. 255-276, 2009.

M. Bilenko and R.J. Mooney, ―Adaptive Duplicate Detection Using Learnable String Similarity Measures,‖ Proc. ACM SIGKDD, pp. 39-48, 2003.

P. Christen, ―Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification,‖ Proc. ACM SIGKDD, pp. 151-159, 2008.

W.E. Winkler, ―Using the EM Algorithm for Weight Computationin the Fellegi-Sunter Model of Record Linkage,‖ Proc. Section Survey Research Methods, pp. 667-671, 1988

S. Chaudhuri, V. Ganti, and R. Motwani, ―Robust Identification of Fuzzy Duplicates,‖ Proc. 21st IEEE Int’l Conf. Data Eng. (ICDE ’05),pp. 865-876, 2005.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me