Open Access Open Access  Restricted Access Subscription or Fee Access

Development and Evolution of Web Crawlers: Current Status and Future Perspectives

Sayali A. Sapkal, Prachi M. Joshi, Mousami V. Munot

Abstract


Internet provides access to a huge repository of data making the search and retrieval of the required information trivial. Dynamic increase in the complexity and the volume of the information further makes the effective search necessary and also challenging. Web crawling is the mechanism used by the various search engines to collect the pages from the World Wide Web. The crawlers need to be intelligent and adaptive with respect to the environment they are acting on. With the existence of different strategies that rank the pages, still getting the required precision and recall values is a challenge. Other factors that add up the complexity are the memory constraints and the hit-rate that takes place. Considering the recent trends and evolutions, developing efficient web crawlers is still an open issue attracting many researchers. This article presents a survey of origin, need and the current advances in the evolution of web crawlers.


Keywords


Search engine, Web crawler, World Wide Web

Full Text:

PDF

References


Yadav D., “Design of novel incremental Parallel Webcrawler”, Ph.D. Thesis, Jaypee Inst. Of Information Technology University, 2010 .

C. Olston, M. Najorl, “Web Crawling”, Foundations and Trends in Information Retrieval, Vol. 4, (3), 2010. pp – 175-246.

www.thinkpink.com/bp/webcrawler/historu.html

Chris Ridings’ “PageRank Explained” paper which, as of April 2002 http://web.archive.org/web/*/ http://www.goodlookingcooking.co.uk/PageRank.pdf

http://www.googleguide.com/google_works.html

C. Castillo, “Effective Web Crawling”, Ph.D. Thesis, University of Chile, 2004.

C. Castillo, M. Marin, A. Rodriguez, R. Yates, “ Scheduling algorithms for web crawling”, Proc. Of IEEE Conf. in WebMedia and LA-Web, 2004, pp 10-17.

S. Brin and L. Page, “The anatomy of Large scale hyper textual web search engine”, Computer networks and ISDN systems, vol. 30, 1998, pp 107-217.

S. Batsakis, E. Petrakis, E. Milios “Improving the performance of focused web crawlers”, ACM Journal of Data and Knowledge Engineering, Vol. 68 (10), 2009, pp 1001-1013.

G. Pant, P. Srinivasan, “Learning to crawl: Comparing classification schemes”, ACM Transactions on Information systems, Vol. 23, 2005, pp – 430-463.

M. Frank,M. Nelson, “Evaluation of crawling policies for web repository crawlers”, in Proc. Of seventeenth Conf. on Hypertext and Hypermedia, 2006.

S. Pavalam, S. Raja, F. Akorli, M. Jawahar, “A survey of Web crawler algorithms ”, IJCSI , 2011, pp 309- 313.

Worldwidewebsize.com

S. Samarawickarma, L. Jayaratne, “A survey of focused web crawling approaches”, Journal of Information Organization, Vol. 2(1), 2012, pp – 1- 9.

B. Novak, “ A survey of focused web crawling algorithms”, in Proc. Of SKIDD multi conference, 2004.

H. Ali, “Sellf ranking and evaluation approach for focused crawler based on multi-agent system”, Intl. Arab Journal of Information Technology, vol 5 (2), 2008, pp 183-191.

H. Liu, E., Milios, “Probailistic models focused web crawling”, ACM Journal of Computational Intelligence, Vol.28 (3), 2012, pp 289-328.

G. Pant, P. Shrinivasan, “Link contexts in classifier-guided topical crawlers”, IEEE Trans. In Knowledge and Data Engineering, 2006, Vol.18 (1), pp 107-122.

S. Batsakis, E. Petrakis, E. Milios “Improving the performance of focused web crawlers”, ACM Journal of Data and Knowledge Engineering, Vol. 68 (10), 2009, pp 1001-1013.

P. Mishra, “Focused crawling techniques”, Internatinal Journal of Computers & Technology, 2012

F. Menczer, G. Pant, P. Srinivasan, “ Topical Web Crawlers: Evaluating Adaptive Algorithms”, ACM Transactions on Internet Techonology, Vol. 4, 2004, pp 378-419

G. Manku, A. Jain, A. Sarma, “Detecting near-duplicates for web crawling”, in Proc. Of IW3C2

S. Kharazmi, A. Nejad, H. Abolhassani, “Freshness of Web search engines: Improving performance of web search engines using data mining techniques”, in Proc. Of ICITST, 2009, pp 1-7.

J. Jiang, X. Song, N. Yu, C. Lin, “ FoCUS: Learning to Crawl Web Forums”, IEEE Trans. On Knowledge and Data Engineering”, Vol. PP, (99), 2012.

M. Najork, “Web crawler architecture”, Microsoft Research Article.

P. Gupta, K. Johari . “Implementations of Web crawler”, ICETET, 2009, pp – 838-843.

Y. Wang, J. Yang, W. Lai, R. Cai, L. Zhang, W. Ma, “Exploring Traversal strategy for web forum crawling”, ACM SIGIR, 2008, pp 459 -466.

H. Liu, E. Milios, L. Korba, “Exploiting Multiple Features with MEMMs for focused web crawling”, in Proc. Of 13th International Conf. on Natural Language and Information systems: Applications of Natural Language to Information Systems, pp 99-110

T. Tamura, K. Somboonviwat, M. Kitsuregawa, “A method for language specific web crawling and its evaluation”, Wiley Periodicals, Systems and Computers in Japan, Vol 38 (2), 2007 pp 10-20.

Z. Ling, Q. Zheng, “The improved Pagerank in web crawler”, in Proc. Of First International Conf. of Information Science and Engineering, 2009, pp 1889-1892.

M. Peshave, “How Search Engines work and a web crawler application”, Dept. of Computer Science, university of Illinios at Springfield, Springfield, IL 62703.

T.Bennouas, F. Montgolfier, “Random web crawls”, in Proc. of ACM World Wide Web Conference, 2007, pp 451- 460.

W. Aiello, F. Chung, L. Lu, “A random graph model for massive graphs”, in Proc. of 32nd annual symposium on theory of computing, 2000, pp 171- 180.

S. Raghavan, H. Gracia-Molina, “Crawling the hidden web”, in Proc. of 27th VLDB Conference, 2001.

V. Shkapenyuk, T. Suel, “Design and implementation of high performance distributed web crawlers”, 2001.

P. Boldi, B. Codenotti, M. Santini, S. Vigna, “Ubicrawler: A scalabale fully distributed web crawler.”pp 1- 14.

C. Chang, M. Kayed, M. Girgis, K. Shaalan, “A survey of web information extraction systems”, IEEE Transactions on KDE, Vol. 18(01), 2006.

J. Cho, “ Crawling the web: Discovery and maintenance of large scale web data.”, Thesis, Stanford University, 2001.

S. Pandey, C. Olsten, “User-Centric web crawling”, ACM transactions on the World Wide Web, 2005, pp 401-411

O. Brandman, J. Cho, H. Gracia-Molina, N. Shivkumar, “Crawler-friendly web servers”, in workshop on Performance and Architecture of web servers, 2000.

A. Nemeslaki, K. Pocsarovszky, “Web crawler research methodology”, 2011.

M. Kumar, R. Vig, “Focused web crawling based on TF-IDF semantics and hub score learning”, Journal of Emerging technologies in web intelligence, 2013.

R. Rana, N. Tyagi, “A novel architecture of ontology-based semantic web crawler”, International Journal of Computer Applications, 2012, pp 31-36.

A. agarwal, D. Singh, A. Kedia, A. Pandey, V. Goel, “Design of a parallel migrating web crawler”, IJARCSSE, 2012, pp 147-153.

Y. He, D. Xin, V. Ganti, S. Rajaraman, N. Shah, “Crawling deep web entitiy pages”, WSDB, 2013.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.