Open Access Open Access  Restricted Access Subscription or Fee Access

Optimized Web Page Generation Using Web Content Mining

M. Karthikeyan, P. Aruna

Abstract


In the past few years, there has been an exponential increase in the amount of information available on World Wide Web. Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials such as large amount of banner ads, navigation bars and copyright notices etc. Such irrelevant information is not part of the main content of the pages, they will seriously harm Web mining and searching. In this paper we develop an automatic HTML generator to generate optimized web pages using Web content mining from the already existing web pages. The input for the HTML generator is any HTML webpage or web pages. The web pages are downloaded manually by the user or by using the download manager developed in the automatic HTML generator. These downloaded pages are mined and useful information’s are extracted including keywords and stored in the specific location. By using the keywords Web pages are clustered by Dbscan clustering algorithm to identify website category. With the help of these mined resources a new optimized webpage is created. This web page will be user friendly and noise free in nature and it may contains text, images, audio, video, structured list and hyperlink structures. Although only sample web pages of five different categories are considered, the proposed method can be applied to any web pages that can be mined for knowledge extraction.

Keywords


Web Content Mining, Text Mining, Web Structure Mining, Link Mining, HTML Generator

Full Text:

PDF

References


Agouti, M., Crestani, F.,& Melucci,M, “On the use of information retrievel techniques for the automatic construction of hypertext”, Information Processing and Management, vol. 33, Issue 2, 133-144 (1997).

Wilkinson, R., & Smeaton, A.F., “Automatic link generation”, ACM Computing Surveys, 31(4) (1999).

Kosala & H. Blockeel, “Web mining research: A Survey”, ACM SIGKDD Explorations, Newsletter 2(1),1-15 (2000).

Miguel Gomes da Costa Junior, Zhiguo Gong, “Web structure Mining: An Introduction”,7803-9303. IEEE (2005).

Danxiang Ai., Yugeng Zhang, Hui Zuo, Quan Wang, “Web Content Mining for Market Intelligence Acquiring from B2C Websites”, Springer-verlag Berlin Heidelberg pp 15-170 (2006).

Kumi ITAI, Atsuhiro TAKASU, Jun ADACHI, “ Information Extraction from HTML Pages and its Integration”, Proceedings of the 2003 symposium on Applications and the Internet Workshops, 0-7695-7/03,2003,IEEE.

Fatima Ashraf, Reda Alhajj, “ClusTex “ Information Extraction from HTML pages”, 21st International conference on Advanced Information Networking and Applications Worshops,0-7695-2847-3/07, 2007,IEEE.

Chunyuan Zhang, Zhiyang Lin, “Automatic Web News Extraction Based on Similar Pages”, International Conference on Web Information Systems and Mining”, 978-0-7695-4224/10, 2010, IEEE

Fatima Ashraf, Tansel Ozyer, Reda Alhajj, “Employing Clustering Techniques for Automatic Information Extraction from HTML Documents”, IEEE Transactions on Systems, Man, Cybernetics, vol 38, No.5, 2008.

Baohua Liao, Bo Cheng, Chuanchang Liu, Junliang Cheng, Gang Tan, “ Content Extraction from Web Pages Based on Gaussian Smoothing”, Proceedings of IC-BNMT, 978-1-4244-6769, 2010, IEEE.

Jinlin Chen, Subash Shankar, Angela Kelly, Serigne Gningue, Rathika Rajaravivarma, “An Adaptive Bottom Up Clustering Approach for Web News Extraction”,978-1-4244-5217-0/09, 2009, IEEE.

Ya Gao, Fang Yuan, Ming Zhang, “Data Extraction Based on Index Path in Web”, Proceedings of Second International Workshop on Education Technology and Computer Science,978-0-7695-3987-4/10, 2010, IEEE.

Mehler A, “Aspects of text semantics in hypertext”,Proceedings of the 10th ACM conference on hypertext and hypermedia pp 25-26, Darmstadt, Germany,1999.

Allan J, “Automatic hypertext link typing”, Proceedings for the Hypertext’96 conference, pp 42-52,Washington, DC, USA, 1996.

Allan J, “Building hypertext using information retrieval”, Information Processing and Management, 33,145-159,1997.

Tan, A.H “Text Mining:The state of the art and the challenges”, Proceedings of PAKDD’99 workshop on knowledge discovery from advanced databases (KDAD’99), pp 65-70, Beijing, China, 1999.

Samuel Kaski, Timo Honkela, Krista Lagus Teuvo Kohonen, “WEBSOM – Self-organizing maps of document collections”, Neurocomputing, vol 21, 101-117,1998.

Hsin-Chang Yang, Chung-Hong Lee, Kuo-Lung Ke, “TOSOM: A Topic-Oriented Self-Organizing Map for Text organization”, proceedings of World Academy of Science, Engineering and Technology 65, 2010.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.