Improving Effectiveness in Large Scale Data by Concentrating Deduplication with Adaboost Algorithm

A. Srilekha

Improving Effectiveness in Large Scale Data by Concentrating Deduplication with Adaboost Algorithm

A. Srilekha

Abstract

The deduplication process refers to finding reports which are duplicate or copied data by comparing one or more information base or data sets. The process of matching records from numerous records is named as record linkage. The output of the deduplication process contains coordinated data which enclose important useable information in sequence. This information is very costly to obtain and due to this deduplication process is gaining attention day by day. In duplication procedure the cleaning process which eliminates copied or duplicate data from a single database is a difficult step as its outcomes succeeding data indulgence or data mining may get greatly influenced by the duplicates. The catalog extent is increasing day by day so the the identical process difficulty is becoming one of the major tackle for record linkage and for deduplication. In order to overcome this issue to some extent we propose a T3S that is Two Stage Sampling Selection model. In T3S there are two stages. The first stage is to produce balanced subset candidate pairs which are need to be ticket. The second stage is to produce smaller and more informative guidance sets when comparing with first stage and an active assortment is call upon incrementally in order to eliminate the redundant pairs that are created in first stage. By using mnemonic names in second stage the duplicate files are identified. In classification phase we extend our work by using Adaboost algorithm which is an effective classification approach. A number of studies revealed that Adaboost gives better accuracy when comparing with SVM classifier. The outcome of our proposed approach on real world dataset will show the proportional analysis of both the methods, which shows that our planned approach and methods, gives better results when compared to SVM.

Keywords

Adaboost, Deduplication, Hashing Algorithm, SVM Classifier, Tokenization, T3S.

Full Text:

PDF

References

A. Arasu, M. Gotz, and R. Kaushik, “On active learning of record matching packages,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 783–794.

A. Arasu, C. R_e, and D. Suciu, “Large-scale deduplication with constraints using dedupalog,” in Proc. IEEE Int. Conf. Data Eng.,2009, pp. 952–963.

R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in Proc. 16th Int. Conf. World Wide Web, pp. 131–140, 2007.

K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi, “Active sampling for entity matching,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012, pp. 1131–1139.

A. Beygelzimer, S. Dasgupta, and J. Langford, “Importance weighted active learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn., pp. 49–56, 2009.

M. Bilenko and R. J. Mooney, “On evaluation and training-set construction for duplicate detection,” in Proc. Workshop KDD, 2003, pp. 7–12.

S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in Proc. 22nd Int. Conf. Data Eng., p. 5, Apr. 2006.

P. Christen, “Automatic record linkage using seeded nearest neighbour and support vector machine classification,” in Proc. 14th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2008, pp. 151–159.

P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 9, pp. 1537–1555, Sep. 2012.

P. Christen and T. Churches, “Febrl-freely extensible biomedical record linkage,” Computer Science, Australian National University, Tech. Rep. TR-CS-02-05, 2002.

D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,” Mach. Learn., vol. 15, no. 2, pp. 201–221, 1994.

J. Wang, G. Li, J. X. Yu, and J. Feng, “Entity matching: How similar is similar,” Proc. VLDB Endow., vol. 4, no. 10, pp. 622–633, Jul. 2011.

C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing system,” in Proc. 3rd ACM Conf. Digital Libraries, 1998, pp. 89–98.

H. K€opcke and E. Rahm, “Training selection for tuning entity matching,” in Proc. Int. Workshop Quality Databases Manage. Uncertain Data, 2008, pp. 3–12.

I. Fellegi and A. Sunter, “A theory for record linkage,” J. Am. Statist. Assoc., vol. 64, no. 328, pp. 1183–1210, 1969.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me