Detecting and Removing Duplicate Records from Multiple Web Databases

Tapashi Paul, V. Ulagamuthalvi


The basic scenario is that the queries3 given by the users are matched and the results are got from multiple web databases. The complexity in this concept arises due to the presence of duplicate and redundant records. To solve this, an unsupervised online record matching method, UDD is used which is mainly for identifying the duplicates from the query results of multiple web databases. The duplicate records are identified and ignored. Only the original records are displayed to the user. For these purpose two classifiers namely Weighted Component Similarity Summing classifier (WCSS) & Support Vector Machine (SVM) classifier are used iteratively to find the duplicates and filter those records from multiple web databases. The other concept is to avoid duplication of websites. Generally static weightage were allocated for URLs. Instead of static weightage, the idea of dynamic weightage is introduced here. Dynamic weightage is allocated to the respective URLs to avoid unauthorized users to create duplicate sites. This proves that the UDD works much better than the existing methods


Duplicate, Record Matching, Weightage, Database and Multiple Result.

