Open Access Open Access  Restricted Access Subscription or Fee Access

A Survey on Visual Cue Based Data Area Identification in Unsupervised Web Data Extraction

M. Priya, S. Jamuna Rani

Abstract


STRUCTURED data in Web pages usually contain important information. Such data are often retrieved from underlying databases and displayed in Web pages using fixed templates. In this paper, we call these structured data objects data records. There are two main approaches to data extraction, wrapper induction and automatic extraction. In wrapper induction, a set of data extraction rules are learnt from a set of manually labeled pages . However, manual labeling is labor intensive and time consuming. For different sites or even pages in the same site, manual labeling needs to be repeated because different sites may follow different templates.

Keywords


Data Extraction, Data Record Alignment, Visual Cues

Full Text:

PDF

References


Hong, J. L., Siew, E.-G., and Egerton, S. 2010. Information extraction for search engines using fast heuristics. Data and Knowledge Engineering 69, 169-196.

Kayed, M. and Chang, C.-H. 2010. FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Transactions on Knowledge and Data Engineering 22, 2, 249-263.

Li, Z., Ng, W. K., and Sun, A. 2005. Web data extraction based on structural similarity.Knowledge and Information Systems 8, 438-461.

Liu, W., Meng, X., and Meng, W. 2010. ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22, 3, 447-459.

Simon, K. and Lausen, G. 2005. ViPER: Augmenting Automatic Information Extraction with visual Perceptions. In Proc. 14th ACM Conference on Information and Knowledge Management.381-388.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.