Open Access Open Access  Restricted Access Subscription or Fee Access

Component based Effective Web Crawler and Indexer using Web Services

A. Vadivel, S.G. Shaila, R. Devi Mahalakshmi, J. Karthika

Abstract


Designing and developing an effective web crawler is a challenging role in a large search engine. This paper proposes component based web crawler along with the indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The communication between the services is sent and received using XML, SOAP and WSDL. In the crawler service, the web pages are fetched and parsed for retrieving all the hyperlinks. The process is carried out recursively using Breadth-First strategy. The extracted URLs are downloaded and those web pages are sent to the indexer services by passing the message. In the indexer service, HTML pages are parsed, stop words are removed, stemming of keywords are carried out as pre-processing steps and the result is stored in the form of inverted index. We have evaluated the performance of the proposed design specification of the crawler with indexer and found that the number of pages retrieved is notably on the higher side.

Keywords


Inverted Index, Tokenization, URL, Web Crawler, Web Service.

Full Text:

PDF

References


A. Arasu, J. Cho, H. Garcia-Molin, and S. Raghavan, “Searching the web”, ACM Transactions on Internet Technologies, vol. 1, no. 1, August 2001., June 2001.

P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler: A scalablefully distributed web crawler”. In Proceedings of AusWeb02 - The Eighth Australian World Wide Web Conference, Queensland, Australia, 2002.

J. Cho, and H. Garcia-Molina, “Parallel crawlers”. In Proceedings of the Eleventh International World Wide Web Conference , pp. 124 - 135, 2002.

S. Chakrabarti, B. Dom, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, David Gibson, and J. Kleinberg, "Mining the Web's link structure" Computer , vol.32, no.8, pp.60-67, Aug 1999.

J. Cho, H. Garcia-Molina, and L. Page, “Efficient crawling through url ordering”. In Proceedings of the 7th International World Wide Web Conference, May 1998.

S. Chakrabarti, M.van den Berg, and B. Dom, “Distributed hypertext resource discovery through examples”. In Proc. of 25th Int. Conf. on Very Large Data Bases, pages 375–386.

S. Chakrabarti , M.van den Berg , and B. Dom, “Focused crawling: A new approach to topic-specific web resource discovery”, In Proc. of the 8th Int. World Wide Web Conference, May 1999.

J. Cho and H. Garcia-Molina, “The evolution of the web and implications for an incremental crawler”, In Proc. of 26th Int. Conf. on Very Large Data Bases, pages 117–128, September 2000.

J. Cho and H. Garcia-Molina , “Synchronizing a database to improve freshness”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data pages 117–128, May 2000.

M. Diligenti, F. Coetzee, S. Lawrence, C. Giles and M. Gori, “Focused crawling using contextgraphs”. In Proc. of 26th Int. Conf. on Very Large Data Bases, September 2000.

Akansha Singh , Krishna Kant Singh, “Faster and Efficient Web Crawling with Parallel Migrating Web Crawler”, IJCSI International Journal of Computer Science Issues, vol. 7, Issue 3, No 11, May 2010.

A. Heydon and M. Najork , “Mercator: A scalable, extensible web crawler”, World Wide Web, vol. 2, no. 4, pp. 219 -229, 1999.

Y. Hafri, C. Gjeraba, “Dominos: A new Web Crawler's Design”, Ecole polytechnique de nates, September 16, 2004.

Hilal Hadi Saleh and Isra’a Tahseen Ali, "Effective web crawler”, Eng and Tech .journal ,vol.29, no.3, 2011.

Joachim Hammer, Jan Fiedler, “Using Mobile Crawlers to Search the Web Efficiently”, In International Journal of Computer and Information Science, vol 1, no.1, pages 36-58, 2000.

Junghoo, Cho, H. Garcia-Molina, L.Page , “Efficient Crawling Through URL Ordering”, In proceeding of the seventh International web conference, Brisbanc, Australia, April 14-18, 1998

B. Kahle , “Achieving the Internet”. Scientific American, 1996.

Monica Peshave, “How Search Engines Work And A Web Crawler Application”, International journal of computer science and security, vol2, issue 2,2005.

Molina G. Hector, “Searching the Web”, ACM Transactions on Internet Technology, Volume 1 Issue 1, Aug. 2001.

M. Najork and J. Wiener, “Breadth-first search crawling yields high- quality pages”, In 10th Int. World Wide Web Conference, 2001

J. Rennie and A. McCallum, “Using reinforcement learning to spider the web efficiently”, In Proc. of the Int. Conf. on Machine Learning (ICML), 1999

A.K. Sharma , J.P. Gupta , D. Aggarwal, “PARCAHYDE: An Architecture of a Parallel Crawler based on Augmented Hypertext Documents.”

V. Shkapenyuk , T. Suel, “Design And Implementation of a High- Performance Distributed WebCrawler”, In ICDE. 2002.

F. Selvestri, “high Performance Issues in Web Search Engines: Algorithms and Techniques”, May 2004. Eng. & Tech. Journal Vol.29, No.3, 2011.

J. Talim , Z. Liu ,P. Nain , and E. Coffman , “Controlling robots of web search engines”. In SIGMETRICS Conference, June 2001.

“Web Crawling and Indexing”, Cambridge University Press. January 25, 2008.

J. Yong, Lee, S. Ho, Lee, Kim, “Crawler: A Seed–By–Seed Parallel Web Crawler”. School of computing, Soongsil University, Seoul, Korea. 2007.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.