Open Access Open Access  Restricted Access Subscription or Fee Access

Min Hash Clustering Algorithm for Extraction of HTML Tags from Social Media

Dr. D. Napoleon, M. Praneesh


The World Web is the huge and fast growing source of information. Most of this information is in the form of free text, making the information hard to query. However many websites that has large collections of pages containing structured data (i-e) data having structure or a template. Thus here we present an extended Min hash algorithm for extracting the template from a large number of web documents which are generated from heterogeneous templates.


HTML Tags, Template Extraction, Web Pages.

Full Text:



A. Arasu and H. Garcia-Molina, “Extracting Structured Data fromWeb Pages,” Proc. ACM SIGMOD, pp. 337-348, 2003.

C.-H. Chang and S.-C. Lui, “IEPAD: Information Extraction Basedon Pattern Discovery,” Proc. Int’l Conf. World Wide Web (WWW-10), pp. 223-231, 2001.

C.-H. Chang, M. Kayed, M.R. Girgis, and K.A. Shaalan, “Survey of Web Information Extraction Systems,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp. 1411-1428, Oct. 2006.

V. Crescenzi, G. Mecca, and P. Merialdo, “Knowledge and Data Engineerings,” Proc. Int’l Conf. Very Large Databases (VLDB), pp. 109-118, 2001.

C.-N. Hsu and M. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” J. Information Systems, vol. 23, no. 8, pp. 521-538, 1998.

N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proc. 15th Int’l Joint Conf. Artificial Intelligence (IJCAI), pp. 729-735, 1997.

A.H.F. Laender, B.A. Ribeiro-Neto, A.S. Silva, and J.S. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, pp. 84-93, 2002.

B. Lib, R. Grossman, and Y. Zhai, “Mining Data Records in Webpages,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 601-606, 2003.

I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approachto Wrapper Induction,” Proc. Third Int’l Conf. Autonomous Agents(AA ’99), 1999.

K. Simon and G. Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” Proc. Int’l Conf. Information and Knowledge Management (CIKM), 2005.

J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. Int’l Conf. World Wide Web (WWW-12), pp. 187-196, 2003

Y. Yamada, N. Craswell, T. Nakatoh, and S. Hirokawa, “Testbed for Information Extraction from Deep Web,” Proc. Int’l Conf. WorldWide Web (WWW-13), pp. 346-347, 2004.

W. Yang, “Identifying Syntactic Differences between Two Programs,” Software—Practice and Experience, vol. 21, no. 7, pp. 739-755, 1991.

Y. Zhai and B. Liu, “Web Data Extraction Based on Partial TreeAlignment,” Proc. Int’l Conf. World Wide Web (WWW-14), pp. 76-85, 2005.

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. Int’l Conf. World Wide Web (WWW), 2005.

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Automatic Extraction of Dynamic Record Sections from Search Engine ResultPages,” Proc. Int’l Conf. Very Large Databases (VLDB), pp. 989-1000,2006.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.