Identification of Duplicate Records over Query Results from Real Time Web Databases

J. Aruna; J. Jeysree

Identification of Duplicate Records over Query Results from Real Time Web Databases

J. Aruna, J. Jeysree

Abstract

Detecting database records that are approximate duplicates is an important task. A database having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Databases may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple databases. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time database scenario, where the records to match are query results dynamically generated on the fly. To address the problem of record matching in such database scenario, we present a Unsupervised Duplication Detection (UDD), for a given query the algorithm can effectively identify duplicates from the query result records of multiple databases. In the algorithm proposed, we start from the non duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from multiple databases.

Keywords

-Record Matching; Duplication Detection; SVM; UDD

Full Text:

PDF

References

R.Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th International Conference Very Large Data Bases, 2002, pp. 586-597.

Baodong LI, Yongquan DONG, Yongxin ZHANG and DonglanLIU, ”Duplicate Record Detection Based on Unsupervised Learning Method”, Journal of Computational Information Systems, December 2011, Vol. 7, No. 16, pp. 5891-5899.

Bolla Anil Kumar, Satya P Kumar and Somayajula, “Hide the Duplicate Web Pages”, International Journal of Computer Science and Technology, September 2011, Vol. 2, No. 3, pp. 438-440.

R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast Blocking Methods for Record Linkage, ” Proceedings Knowledge Discovery on Data Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003 , pp. 25-27

R. Baxter, Lifang Gu ,”Adaptive Filtering for Efficient Record Linkage”, SIAM International Conference on Data Mining, 2004, pp.477-481

M.Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proceedings ACM SIGKDD conference on Knowledge Discovery and Data mining, 2003, pp. 39-48.

Cai Bo, Zhang Feng Li and Wang Can, ” Research on Chunking Algorithms of Data De-duplication”, American Journal of Engineering and Technology Research, 2011, Vol. 11, No. 9, pp. 1353-1358.

P.Christen, “Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification,” Proceedings ACM SIGKDD conference on Knowledge Discovery and Data mining,2008, pp. 151-159.

P.Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and Deduplication”, Springer, 2007, vol. 43, pp. 127-151.

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proceedings ACM SIGKDD conference on Knowledge Discovery and Data mining 2003, pp. 313-324.

S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE International Conference on Data Engineering, 2005, pp. 865- 876.

DebabrataDey, Member, IEEE, Vijay S. Mookerjee, and Dengpan Liu, “Efficient Techniques for Online Record Linkage”, IEEE Transactions on Knowledge and Data Engineering, March 2011, Vol. 23, No. 3, pp. 373-387.

Diego Zardetto, Monica Scannapieco and TizianaCatarci, “Efficient Automated Object Matching”, International Council for Open and Distance Education World Conference, March 2010, pp. 757-768.

A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios. “Duplicate Record Detection: A Survey”, IEEE Transaction Knowledge and Data Engineering, 2007, pp. 1-16.

Jiannan Wang, Guoliang Li and Jianhua Fe, ”Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join”, International Council for Open and Distance Education World Conference, April 2011, pp. 458-469.

Haibin Cheng, Pang-Ning Tan, Member, IEEE, and Rong Jin, “Efficient Algorithm for Localized Support Vector Machine,” IEEE Transaction Knowledge and Data Engineering, April 2010, vol. 22, no 4

H. Yu, J. Han, and C.C. Chang, “PEBL: Web Page Classification without Negative Examples,” IEEE Transaction on Knowledge and Data Engineering, Jan. 2004, vol. 16, no. 1, pp. 70-81.

Ho Min Jung_, Sang Yong Park, Jeong Gun Lee, Young Woong Ko, “Efficient Data deduplication System Considering File Modification Pattern,” International Journal of Security and Its Applications.April, 2012 Vol. 6 No. 2.

Luiz Osvaldo Evangelista, Eli Cortez, Altigran S. da Sliva and Wagner MeiraJr, “Adaptive and Flexible Blocking for Record Linkage Tasks”, Journal of Information and Data Management, June 2010, Vol. 1, No. 2, pp. 167-181

Marijn Schraagen, “Complete Coverage for Approximate String Matching in Record Linkage using Bit Vectors”, 23rd IEEE International Conference on Tools with Artificial Intelligence, November 2011, pp.740-747.

A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of High-Dimensional Datasets with Application to Reference Matching” , Proceedings ACM SIGKDD conference on Knowledge Discovery and Data mining, 2000, pp. 169-178.

Mohamed Yakout, Mikhail J. Atallah, Ahmed Elmagarmid, “Efficient Private Record Linkage”, International Conference on Data Engineering”, 2009, pp 1283-128.

Mohamed Yakout, Ahmed K. Elmagarmid, Hazem Elmeleegy, Mourad Ouzzani and Alan Qi, “Behavior Based Record Linkage”, Proceedings of the Very Large Database Endowment, September 2010, Vol. 3, No. 2, pp. 439-448.

Ranjna Gupta, NeelamDuhan, A.K. Sharma and NehaAggarwal, “Query Based Duplicate Data Detection on WWW“, International Journal on Computer Science and Engineering, July 2010, Vol. 02, No. 4, pp.1395-1400.

Tanvi Gupta and Latha Banda, “A Novel Approach to Detect the Near-Duplicates by Refining Provenance Matrix”, International Journal of Computer Technology & Applications, Jan-Feb 2012, Vol. 3, No. 1, pp.231-234.

Weifeng Su, J.Wang and Frederick H.Lochovsky, “Record Matching over Query Results from Multiple Web Databases”,IEEE Transactions on Knowledge and Data Engineering, April 2010, Vol. 22, No. 4 pp. 578-589.

Yan Wu, Qi Zhang and Xuanjing Huang, “Efficient Near-Duplicate Detection for Q & A Forum”, Proceedings of 5th International Joint Conference on Natural Language Processing, November 2011, pp. 1001-1009.

YaquanXu and Haibo Wang, “A New Feature Selection Method Based on Support Vector Machines for Text Categorization”, International Journal of Data Analysis Techniques and Strategies, March 2011, Vol. 3, No. 1, pp. 1-20.

Zakia S., “Detection and Elimination of Duplicate Data from Semantic Web Queries”, International Journal of Multidisciplinary Sciences and Engineering, November 2010, Vol. 1, No. 2, pp. 1-6.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me