Data Extraction in Web Databases by Combining Tag and Value Similarity
Abstract
In real time applications, identification of records that represent the same real-world entity is a major challenge to be solved. Detection and removal of duplicate records that relate to the same entity within one dataset is an important task in data preprocessing. The novel data extraction and alignment method called CTVS that combines both tag and value similarity is enhanced by using unsupervised duplicate detection algorithm (UDD) to eliminate the duplicate records in web databases. CTVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column. Specifically, new techniques are proposed to handle the case when the QRRs are not contiguous, which may be due to the presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs. Also a new record alignment algorithm that aligns the attributes in a record, first pairwise and then holistically, by combining the tag and data value similarity information is designed.
Keywords
Full Text:
PDFReferences
Weifeng Su, Jiying Wang, and Fredrick H.Lochovsky, “Record Matching Over Query Results from Multiple Web Databases”, IEEE Trans. Knowledge and Data Eng., vol. 22, no. 4, April 2010.
B. Liu and Y. Zhai, “NET - A System for Extracting Web Data from Flat and Nested Data Records,” Proc. Sixth Int’l Conf. Web Information Systems Eng., pp. 487-495, 2005.
H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. 14th World Wide Web Conf., pp. 66-75, 2005.
J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. 12th World Wide Web Conf., pp. 187-196, 2003.
K. Simon and G. Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” Proc. 14th ACM Int’l Conf. Information and Knowledge Management, pp. 381-388, 2005.
M. Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. ACM SIGKDD, pp. 39-48, 2003.
M.K. Bergman, “The Deep Web: Surfacing Hidden Value,” White Paper, BrightPlanet Corporation, http://www.brightplanet.com/resources/details/deepweb.html, 2001.
P. Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and Deduplication,” Quality Measures in Data Mining, F. Guillet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer, 2007.
S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE Int’l Conf. Data Eng., pp. 865-876, 2005.
W. Su, J. Wang, and F.H. Lochovsky, “ODE: Ontology-Assisted Data Extraction,” ACM Trans. Database Systems, vol. 34, no. 2, article 12, p. 35, 2009.
A.K.Elmagarmid, P.G.Ipeirotis, and V.S.Verykios, “Duplicate Record Detection: A Survey”, IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan.2007.
Weifeng Su, Jiying Wang, and Fredrick H.Lochovsky, “Combining Tag and Value Similarity for Data extraction and Alignment”, IEEE Trans. Knowledge and Data Eng., vol. 24, no. 7, July 2012.
Y. Zhai and B. Liu, “Structured Data Extraction from the Web Based on Partial Tree Alignment,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006.
Ahmed. I, A. Aziz, “Dynamic Approach for Data Scrubbing Process”, International Journal on Computer Science and Engineering 2(2): 416-423, 2010.
F. Panse, M.V. Keulen, A.D. Keijzer and N. Ritter,” Duplicate detection in probabilistic data”, ICDE IEEE workshops, 2010.
Beskales. G, M. A. Solimon, I. F. Ilyas, S. Ben-David and Y. Kim, “ProbClean: A Probabilistic duplicate detection system “, ICDE IEEE conference in 2010.
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.