Extracting Template Properties using Agglomerative Clustering

R. Devika; T. Mohanraj

Extracting Template Properties using Agglomerative Clustering

R. Devika, T. Mohanraj

Abstract

World Wide Web is widely used to publish and access
information on the Internet. Most of the web pages in the web sites are published using the common templates with contents. Templates are the readymade holders, which provide readers easy access to the contents guided by consistent structures. It provides common look
and feel to the web pages. However, the accuracy and performance of the web applications are degraded due to the presence of irrelevant terms in the templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. Hence, the proposed system presents a new clustering algorithm for
grouping the web pages that are using similar templates. The web pages under same cluster have equal priority and they are homogeneous. Hence, all those pages will not be displayed. In order to prioritize any homogeneous web page, the properties of that particular website will be extracted and modified. By changing the properties, homogeneous web page can be converted to a heterogeneous web page.

Keywords

Clustering, Heterogeneous, Homogeneous, Prioritize, Template Extraction.

Full Text:

PDF

References

A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web

Pages,” Proc. ACM SIGMOD, 2003.

V. Crescenzi, P. Merialdo, and P. Missier, “Clustering Web Pages Based

on Their Structure,” Data and Knowledge Eng., vol. 54, pp. 279- 299,

I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-

Clustering,” Proc. ACM SIGKDD, 2003.

B. Long, Z. Zhang, and P.S. Yu, “Co-Clustering by Block Value

Decomposition,” Proc. ACM SIGKDD, 2005.

Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree

Alignment,” Proc. 14th Int’l Conf. World Wide Web (WWW), 2005.

H. Zhao, W. Meng, and C. Yu, “Automatic Extraction of Dynamic

Record Sections from Search Engine Result Pages,” Proc. 32nd

Int’l Conf. Very Large Data Bases (VLDB), 2006.

M. de Castro Reis, P.B. Golgher, A.S. da Silva, and A.H.F. Laender,

“Automatic Web News Extraction Using Tree Edit Distance,”Proc. 13th

Int’l Conf. World Wide Web (WWW), 2004.

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic

Wrapper Generation for Search Engines,” Proc. 14th Int’l Conf. World

Wide Web (WWW), 2005.

S. Zheng, D. Wu, R. Song, and J.-R. Wen, “Joint Optimization of

Wrapper Generation and Template Detection,” Proc. ACM SIGKDD,

K. Vieira, A.S. da Silva, N. Pinto, E.S. de Moura, J.M.B. Cavalcanti,

and J. Freire, “A Fast and Robust Method for Web Page Template

Detection and Removal,” Proc. 15th ACM Int’l Conf. Information and

Knowledge Management (CIKM), 2006.

M.N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim,

“Xtract: A System for Extracting Document Type Descriptors from Xml

Documents,” Proc. ACM SIGMOD, 2000.

http://shops.oscommerce.com/

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me