Open Access Open Access  Restricted Access Subscription or Fee Access

A Survey on Duplicate Detection Approaches in Hierarchical Data

Kiran Lokhande, Tushar Rane, S. T. Patil

Abstract


Duplicate detection is the process of finding the duplicate objects in the data. This is the important part of data cleansing step of data mining. Significant amount of work has been done in duplicate detection of relational data, but only recently the researchers have shifted their focus towards duplicate detection in hierarchical and semi-structured data e.g. XML. In this paper we provide an overview of different methods for duplicate detection in hierarchical data and semi-structured data.


Keywords


Data Cleansing, Duplicate Detection, XML, Data Mining, Hierarchical Data

Full Text:

PDF

References


E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Approaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.

L. Leita˜o, P. Calado, and M. Weis. “Efficient and Effective Duplicate Detection in Hierarchical Data.” IEEE Transactions on Knowledge and Data Engineering, Vol. 25, May 2013.

L. Leita˜o, P. Calado, and M. Weis. “Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection.” in Proc. 16th ACM Int’l Conf. Information and Knowledge Management, pp. 293-302, 2007.

A.M. Kade, C.A. Heuser. “Matching XML Documents in Highly Dynamic Applications.” in Proc. ACM Symp. Document Eng. (DocEng), pp. 191-198, 2008.

K.-C. Tai, "The tree-to-tree correction problem," Journal of the ACM (JACM), vol. 26, no. 3, pp. 422-433, July 1979.

P. N. Klein, "Computing the edit-distance between unrooted ordered trees," in Proceedings of the 6th European Symposium on Algorithms, ser. Lecture Notes in Computer Science, vol. 1461. Venice, Italy:Springer, 1998, pp. 91-102.

W. Chen, "New algorithm for ordered tree-to-tree correction problem," Journal of Algorithms, vol. 40, no. 2, pp. 135-158, Aug. 2001.

E. D. Demaine, S. Mozes, B. Rossman, and 0. Weimann, "An optimal decomposition algorithm for tree edit distance," in Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP 2007), Wroclaw, Poland, 2007.

K. Zhang, R. Statman, and D. Shasha, "On the editing distance between unordered labeled trees," Information Processing Letters, vol. 42, no. 3, pp. 133-139, 1992.

S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. “Approximate XML Joins.” in Proc. ACM SIGMOD Conf. Management of Data, 2002.

J.C.P. Carvalho and A.S. da Silva. “Finding Similar Identities among Objects from Multiple Web Sources.” in Proc. CIKM Workshop Web Information and Data Management (WIDM), pp. 90-93, 2003.

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. “Eliminating Fuzzy Duplicates in Data Warehouses.” in Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.

M. Weis and F. Naumann. “Dogmatix Tracks Down Duplicates in XML.” in Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005.

S. Puhlmann, M. Weis, and F. Naumann. “XML Duplicate Detection Using Sorted Neighborhoods.” in Proc. Conf. Extending Database Technology (EDBT), pp. 773-791, 2006.

M.A. Herna´ndez and S.J. Stolfo. “The Merge/Purge Problem for Large Databases.” in Proc. ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.