An Optimized Approach to Record Deduplication

V. Nirmala, B. Rosiline Jeetha


Record deduplication is a specialized technique for eliminating duplicate copies of repeating record. Duplicate record detection is important for data preprocessing and cleaning. The increasing volume of information available in digital media becomes a challenging problem for data administrators. The increased volume even created redundant data also in the database. So a system or method is become immense to control the redundancy and duplication. Databases are increasing in size at an exponential rate, and it plays an important role in all industry. Detection of duplicate Records in IT industry become is necessary to obtain precise results while searching and to shrink storage requirements. This paper presents the problem of duplicate records and their detection. In the proposed approach, we made a method that makes use of BAT for generating the optimal similarity measure to decide whether the data is duplicate or not. The optimal similarity measure is generated using BAT algorithm for the training datasets. This system is initialized with a population of random solutions and searches for optima by updating bat generations We have used Synthetic datasets to analyze the proposed algorithm and the performance of the proposed algorithm is compared against the genetic programming technique with the help of evaluation metrics. Our Approach makes the user free from the burden of having to choose and tune this parameter.


BAT Algorithm Data Preprocessing, Duplicate Detection, Data Duplication, Genetic Programming

