Min Hash Clustering Algorithm for Extraction of HTML Tags from Social Media

Dr. D. Napoleon, M. Praneesh


The World Web is the huge and fast growing source of information. Most of this information is in the form of free text, making the information hard to query. However many websites that has large collections of pages containing structured data (i-e) data having structure or a template. Thus here we present an extended Min hash algorithm for extracting the template from a large number of web documents which are generated from heterogeneous templates.


HTML Tags, Template Extraction, Web Pages.

