Open Access Open Access  Restricted Access Subscription or Fee Access

Clustering of Web Page for Different Domains using Data Extraction and Self Organizing Map

Chhaya Varade, Dr. Bhupesh Gour, Dr. Asif Ullah Khan, Shailendra Jain


Given the rapid growth and success of public information sources on the World Wide Web, it is increasingly attractive to extract data from these sources and make it available for further processing by end users and application programs. Data extracted from Web sites can serve as the springboard for a variety of tasks, including information retrieval (e.g. business intelligence), event monitoring (news and stock market), and electronic commerce (shopping comparison). Extracting structured data from Web sites is not a trivial task. Most of the information on the Web today is in the form of Hypertext Markup Language (HTML) documents which are viewed by humans with a browser. A sophisticated method to organize the layout of the information and assist user navigation is therefore particularly important. Data Extraction is the process of retrieving data out of data sources further data processing. Online data exists in the form of a web record. Depending on the end user query, the query results are generated by web databases and from this query results pages. The main objective of this paper is to extract and align important data from different domains with the help of HTML tags and its value. After extracting data, Self Organizing Map (SOM) will classify the extracted data from different domains in the form of clusters. Clustering is the process of grouping physical or abstract objects into classes of similar objects.


Data Extraction, Data Record Alignment, Clustering, QRR, SOM.

Full Text:



Wifeng Su, Jiying Wang, Frederick Lochovsky, “Combining Tag and Value similarity for Data Extraction and Alignment”, IEEE Computer Society, Volume 24 No. 7, July 2012.

Y. Zhai and B. Liu, “Structured Data Extraction from the Web Based on Partial Tree Alignment,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006.

Y. Zhai and B. Lui, “Extracting Web Data Using Instance-Based Learning”. Wrapper Generation

Vijay Sullare, Dr. Asif Ullah Khan, Dr. Bhupesh Gour, “Analysis of Visibility and Temperature Patterns of Indian Cities and It’s Clustering to Identify the Effect of Presence of Aerosol Particles in the Atmosphere” ,

Bhupesh Gour, “Fast Fingerprint Identification System using Backpropogational Neural Network and Self Organizing Map” In Proceedings of the Glow Gift, International Level Seminar, R.G.P. V., Bhopal, June 2005.

Andreas Nürnberger and Marcin Detyniecki, “Content Based Analysis of Email Databases Using Self-Organizing Maps”, Proceedings of the European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems -EUNITE'2001, Tenerife, Spain, pp. 134-142, December, 2001.

Juha Vesanto and Esa Alhoniemi, “Clustering of the Self-Organizing Map”, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 3, MAY 2000.

K. M. Faraoun, A. Boukelif, “Neural networks learning improvement using the K-means clustering algorithm todetect network intrusions”, in International Journal of Computational Intelligence, Volume 3 Number 2, December 29, 2005.

Qinzhi Zhang, Kai Huang and Hong Yan, “Fingerprint Classification Based on Extraction and Analysis of Singularities and Pseudoridges”, Conferences in Research and Practice in Information Technology, Vol 11. School of Electrical and Information Engineering University of Sydney, Australia. 2006.

J. B. MacQueeN (1967), “Some Methods for Classification and analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability”, Berkeley, University of California Press, 1:281-297.

R.Ashok Kumar, Dr Y.Rama Devi, “Efficient Approaches for Record level Web Information Extraction Systems”, Published in International Journal of Advanced Engineering & Application, Jan 2011 Issue.

Fernando Bação, Victor Lobo and Marco Painho, “Self-organizing Maps as Substitutes for K-Means Clustering”, ISEGI/UNL, Campus de Campolide, 1070-312 LISBOA, Portugal and Portuguese Naval Academy, Alfeite, 2810-001 ALMADA, Portugal.

Prof. Dr. Teuvo Kohonen, “Self-Organized Formation of Topologically Correct Feature Maps”, Biological Cybernetics, Springer- Verlag 1982.

J. C. Bezdek and S. K. Pal, Eds., Fuzzy Models for Pattern Recognition: Methods that Search for Structures in Data. New York: IEEE, 1992.

G. J. McLahlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1987, vol. 84.

P. Singam, Prof. P. Pardhi, “Web data extraction using the approach of segmentation and parsing”, International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 9–Sep 2013.

Dave King “Introduction to the Web Mining Minitrack”, 2012 45th Hawaii International Conference on System Sciences.

Bing Liu, Robert Grossman, Yanhong Zhai, “Mining Data Records in Web Pages”.

G.V.Rajya Lakshmi, Mr.B.Narasimha Swamy, “Web Data Identification and Extraction”, International Journal of Electronics and Computer Science Engineering, ISSN- 2277-1956.

Valiente, G. Tree edit distance and common subtrees.Research Report LSI-02-20-R, Universitat Politecnica deCatalunya, Barcelona, Spain, 2002.

T. Kohonen, Construction of similarity diagrams for phonemes by a self-organizing algorithm, Technical Report TKK-FA463, Helsinki University of Technology, Espoo, Finland (1981).

T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43 (1982) 59– 69.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.