Open Access Open Access  Restricted Access Subscription or Fee Access

A Survey of Data Cleansing Algorithms for Detecting Duplicate Records

R. Muthunagai, A. Benaseer


In today’s competitive environment, there is a need for more precise information for a better decision making. Yet the inconsistency in the data submitted makes it difficult to aggregate data and analyze results which may delays or data compromises in the reporting of results. The purpose of this article is to study the different algorithms available to clean the data to meet the growing demand of industry and the need for more standardised data. The data cleaning algorithms can increase the quality of data while at the same time reduce the overall efforts of data collection.


Record Matching, Duplicate Detection, Data Cleaning, Data Integration, Data Deduplication, Entity Matching.

Full Text:



“A Data Cleaning Method Based on Association Rules” by Weijie Wei, Mingwei Zhang, Bin Zhang ,

Applied Brain and Vision Science-Data cleaning algorithm

“Data Cleansing for Web Information Retrieval using Query Indepandent Features” by Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma-

“An Extensive Framework for Data Cleaning “ by Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon

“A Token-Based Data Cleaning Technique for Data Warehouse” by Timothy E. Ohanekwu International Journal of Data Wrehousing and Mining Volume1

“The role of visualisation in effective data cleaning” by Yu Qian,Kang Zhang – Proceddins of 2005 ACM symposium on applied computing

“A Statistical Method for Integrating Data Cleaning and Imputation” by Chris Mayfield, Jennifer Neville, Sunil Prabahakar-Purdue University(Computer Science report-2009)

“Data cleansing based on mathematical morphology” by Sheng Tang published in ICBBE 2008 The second International Conference-2008

“A Domain Independent Data Cleaning Algorithm for detecting similar-duplicates” by Kazi Shah Nawaz Ripon, Ashquir Rahman and G.M. Atiqur Rahaman – Journal of Computer Vol 5, No. 12,2010

P.Pehwa “An Efficient Algorithm for Data Cleaning” -2011.

“Attribute Correction-Data cleaning using Association Rule and Clustering Methods” by R.KavithaKumar, Dr. RM. Chandrasekaran, IJDKP,Vol.1,No.2 March-2011.

Random Forest Based Imbalanced Data Cleaning and Classification – Jie Gu –

Data Cleansing Based on Mathematical Morphology S.Tang-2008 – Bioinfornatics and Biomedical Engineering, 2008 ICBBE 2008. The 2nd International conference.

“An efficient Algorithm for Data Cleaning of Log File using File Extension” International journal of Computer Appliactions 48(8):13-18, June-2012 Surabhi Anand , Rinkle Rani Aggarwal.

A New Efficient Data Cleasing Method – Li Zhao, Sung Sam Yuan, Sun Peng and Ling Tok Wang –

Computer Research and DEvleopment (ICCRD), 2011,3rd International Conference .”web log cleaning for mining of web usage patterns” –T.T.Aye.

“Mass Data Cleaning Algorithm based on extended tree-like knowledge base” – Yan Cai-rong,SUN Gui-ning , GAO Nian-gao Computer Enginerring and application -2010

ERACER-A database approach for statistical inference and data cleaning- Chris Mayfield, Jennifer Neville, Sunil Prabhakar

“Adaptive cleaning for RFID Data Streams” by Shawn R. Jeffery, Minos Garofalakis, Michael J. Franklin

“Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm” by Manuel Castejon Limas, Joaquin B. Ordieres Mere, Francisco J. Martinezn de Pison , Ascacibar and Eliseo P. Vergara Gonzaalez

Informatics and Computational Intelligence (ICI) 2011, Mohamed H.H. IEEE Xplore Digital Library. “ E-Clean : A Data Cleaning Framework for Patient Data”

Duplicate Detection Methods for Situation Awareness, Norbert Baumgartner, Wolfgang Gottesheim, Stefan Mitsch, Werner Retschitzegger,and Wieland Schwinger,2009

Holistic Data Cleaning: Putting Violations Into Context. Xu Chu, Ihab F. Ilyas, Paolo Papotti,(2013)


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.