Extracting Template Properties using Agglomerative Clustering
Abstract
World Wide Web is widely used to publish and access
information on the Internet. Most of the web pages in the web sites are published using the common templates with contents. Templates are the readymade holders, which provide readers easy access to the contents guided by consistent structures. It provides common look
and feel to the web pages. However, the accuracy and performance of the web applications are degraded due to the presence of irrelevant terms in the templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. Hence, the proposed system presents a new clustering algorithm for
grouping the web pages that are using similar templates. The web pages under same cluster have equal priority and they are homogeneous. Hence, all those pages will not be displayed. In order to prioritize any homogeneous web page, the properties of that particular website will be extracted and modified. By changing the properties, homogeneous web page can be converted to a heterogeneous web page.
Keywords
Full Text:
PDFReferences
A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web
Pages,” Proc. ACM SIGMOD, 2003.
V. Crescenzi, P. Merialdo, and P. Missier, “Clustering Web Pages Based
on Their Structure,” Data and Knowledge Eng., vol. 54, pp. 279- 299,
I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-
Clustering,” Proc. ACM SIGKDD, 2003.
B. Long, Z. Zhang, and P.S. Yu, “Co-Clustering by Block Value
Decomposition,” Proc. ACM SIGKDD, 2005.
Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree
Alignment,” Proc. 14th Int’l Conf. World Wide Web (WWW), 2005.
H. Zhao, W. Meng, and C. Yu, “Automatic Extraction of Dynamic
Record Sections from Search Engine Result Pages,” Proc. 32nd
Int’l Conf. Very Large Data Bases (VLDB), 2006.
M. de Castro Reis, P.B. Golgher, A.S. da Silva, and A.H.F. Laender,
“Automatic Web News Extraction Using Tree Edit Distance,”Proc. 13th
Int’l Conf. World Wide Web (WWW), 2004.
H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully Automatic
Wrapper Generation for Search Engines,” Proc. 14th Int’l Conf. World
Wide Web (WWW), 2005.
S. Zheng, D. Wu, R. Song, and J.-R. Wen, “Joint Optimization of
Wrapper Generation and Template Detection,” Proc. ACM SIGKDD,
K. Vieira, A.S. da Silva, N. Pinto, E.S. de Moura, J.M.B. Cavalcanti,
and J. Freire, “A Fast and Robust Method for Web Page Template
Detection and Removal,” Proc. 15th ACM Int’l Conf. Information and
Knowledge Management (CIKM), 2006.
M.N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim,
“Xtract: A System for Extracting Document Type Descriptors from Xml
Documents,” Proc. ACM SIGMOD, 2000.
http://shops.oscommerce.com/
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.