A Survey on Data Extraction from Web Pages

Deepa John; G. Naveen Sundar

A Survey on Data Extraction from Web Pages

Deepa John, G. Naveen Sundar

Abstract

Internet provides huge amount of information. The amount of information on the web is growing at an astonishing rate. Web can be considered as the largest knowledge base. Web pages contain a lot of information. Extracting data from the web pages are very difficult. This is mainly because of the complex structure of the web pages. And there isn’t any uniformity when the structure of the web page is considered. Due to the lack of any uniform structure of Web information sources, access to this huge collection of information has been limited to browsing and searching. Many a times the data need to be extracted from the web pages so as to facilitate different applications. Also, extracting relevant data alone is a tedious task. Therefore, the availability of robust, flexible extraction methods that transform the Web pages into program-friendly structures such as a relational database has become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. This survey paper mentions some of the techniques for web data extraction.

Keywords

Semi-Structured Data, Data Extraction, Web Database, Web Mining, Wrapper Generation.

Full Text:

PDF

References

Wei Liu, Xiaofeng Meng, and Weiyi Meng, "ViDE: A Vision- Based Approach for Deep Web Data Extraction”, IEEE transactions on Data and Knowledge Engineering, pp 447- We460, March 2010.

G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and WebWebs,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 24-33, 1998.

V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data WebExtraction from Large Web Sites,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. Web109-118, 2001.

D. Cai, S. Yu, J. Wen, and W. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Proc. Asia Pacific Web Conf. (APWeb), pp. 406-417, 2003.

L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 611-621, 2000.

Y. Lu, H. He, H. Zhao, W. Meng, and C.T. Yu, “Annotating Structured Data of the Deep Web,” Proc. Int’l Conf. Data Eng.(ICDE), pp. 376-385, 2007.

W. Liu, X. Meng, and W. Meng, “Vision-Based Web Data Records Extraction,” Proc. Int’l Workshop Web and Databases (WebDB ’06), pp. 20-25, June 2006.

Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. Int’l World Wide Web Conf. (WWW), pp. 76-85, 2005

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C.T. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. Int’l World Wide Web Conf. (WWW), pp. 66-75, 2005

J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma, “Simultaneous Record Detection and Attribute Labeling in Web Data Extraction,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 494-503, 2006.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me