Open Access Open Access  Restricted Access Subscription or Fee Access

Crawling the Web at Desktop Balance Inspection

N. Prasanna Balaji, Pingili Madhavi

Abstract


Crawler is a hypertext resource discovery system whose goal is to selectively seek out pages that are relevant to a pre-defined set of topics. The inherent characteristics of focused crawling, personalization and low resource needs, naturally lends to its usage by individuals. Current focused crawlers depend on a classifier that scores each crawled document with respect to a predefined set of topics. In real applications Today finding information on the web is an arduous task that will only get worse with the exponential increase of content. To deal with this issue, search engines, web sites providing users with the ability to query for web pages of interest, have become the utility of choice. A simple search against any given engine can yield thousands of results. This presents a huge problem as one could spend a majority of her time just paging through the results. More often than not, one must visit the referenced page to see if the result is actually what she was looking for. This can be attributed to two important facts. On the one hand, the result page’s link metadata, i. e. the description of the page returned by the search engine, is not very descriptive. On the other hand, the result page itself is not guaranteed to be what the user is looking for. Once a page is found that matches the user expectations, the search engine moves out of the picture and it is up to the user to continue mining the needed information. This can be a cumbersome task as pages on the web have a plethora of links and an abundance of content. Having to filter through thousands of references is not a very rewarding search process considering that there is no way to be certain that the search will be complete, or always yield the web page the user was originally interested in. These problems must be addressed in order to increase the effectiveness of search engines. We believe that the aforementioned problems could be addressed if the number of links the user had to visit in order to find the desired web page could be reduced. First, we define the following term which will be used throughout the rest of the paper and in our assertion below. A concept is one overarching idea or topic present in a web page. Put simply, our assertion is: if the concepts of underlying web page results being searched can be presented to the user automatically then the amount of links the user will need to visit to find her desired result will be lessened. Essentially this means automatically discovering the set of concepts that describe a web page. This describes and evaluates a method for automatically extracting the concepts from web pages returned via heterogeneous search engines including Google, MSN Search, Yahoo Search, AltaVista Search and Ask Jeeves Search. Along with regular concepts, our method also extracts complex concepts.


Keywords


Crawling, Crawlers, Focused crawler, Query, Track, user queries.

Full Text:

PDF

References


Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Amr El Abbadi, Michael L.Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 200–209. Morgan Kaufmann, 2000.

Chiasen Chung and Charles L. A. Clarke. Topic-oriented collaborative crawling. In 11th ACM Conference on Information and Knowledge Management, McLean, Virginia, November 2002.

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World-Wide Web Conference(WWW7), 1998. Online at http://decweb.ethz.ch/WWW7/1921/com1921.htm.

S Chakrabarti, M M Joshi, and V B Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In SIGIR, 2001.

Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. JACM, 46(5):604–632, 1999.

Junghoo Cho. Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data. PhD thesis, Stanford University, 2001.

Andrew McCallum. Bow: A toolkit for statistical language modeling,text retrieval, classification and clustering. Software available from http://www. cs. cmu. edu/~mccallum/bow/, 1998.

Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999), 31(11–16):1623–1640, 1999.

Filippo Menczer. Links tell us about lexical and semantic Web content. Technical Report Computer Science Abstract CS. IR/0108004,arXiv. org, August 2001. Online at http://arxiv. org/abs/cs. IR/0108004.

S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW, Hawaii. ACM, May 2002.

Michael Hersovici, Michal Jacovi, Yoelle S Maarek, Dan Pelleg,Menachem Shtalheim, and Sigalit Ur. The Shark-Search algorithm-an application: Tailored web site mapping. In 7th World-Wide Web Conference, Brisbane, Australia, April 1998. Online at http://www7. scu. edu. au/programme/fullpapers/1849/com1849. htm.

Jason Rennie and Andrew McCallum. Using reinforcement learning to spider the web efficiently. In 16th International Conference on Machine Learning, pages 335– 343, Bled, Slovenia, 1999. Morgan Kaufmann. Online at http://www.cs.cmu.edu/~mccallum/papers/rlspider-icml99s.ps.gz.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles,and Marco Gori. Focused crawling using context graphs. In Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal,Nabil Kamel, Gunter Schlageter, and Kyu- Young Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 527–534. Morgan Kaufmann, 2000. Online at http://www.neci.nec.com/~lawrence/papers/ focus-vldb00/focus-vldb00.pdf.

G Salton and M J McGill. Introduction to Modern Information Retrieval.McGraw- Hill, 1983.

F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. To appear in ACM Transactions on Internet Technology, 2003.

Filippo Menczer, Gautam Pant, Padmini Srinivasan, and Miguel E. Ruiz.Evaluating topic-driven web crawlers. In Research and Development in Inform Bibliography. Online at citeseer.ist.psu.edu/article/menczer01evaluating.html.

C. Buckley. Implementation of the smart information retrieval system.Technical report, Department of Computer Science, Cornell University,Technical Report TR85-686, Department of Computer Science, 1985.

Mandar Mitra, Christopher Buckley, Amit Singhal, and Claire Cardie. An analysis of statistical and syntactic phrases. In Proceedings of RIAO-97,5th International Conference “Recherche d’Information Assistee parOrdinateur”, pages 200 214, Montreal, CA, 1997.

Jeffrey Dean and Monika R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999),31(11–16):1467–1479, 1999.

Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.

Krishna Bharat and Monika R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of SIGIR-98,21st ACM International Conference on Research and Development in Information Retrieval, pages 104–111, Melbourne, AU, 1998.

Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. PEBL: Positive example based learning for Web page classification using SVM. In Proceedings of the 2002 ACM SIGKDD Conference, pages 239–248,Edmonton, Canada, July 2002. http://www-faculty.cs.uiuc.edu/~kcchang/Papers/pebl-kdd02.pdf.

Shlomo Argamon-Engelson and Ido Dagan. Committee based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335–360, 1999.

Thorsten Joachims. A statistical learning model of text classification for support vector machines. In W Bruce Croft, David J Harper, Donald H Kraft, and Justin Zobel, editors, International Conference on Research and Development in Information Retrieval, volume 24, pages 128–136, New Orleans, September 2001. SIGIR, ACM.

Mukul Joshi. Exploiting Topic Interaction on the Web. Master’s thesis,Dept. of Computer Science and Engg., IIT Bombay.

Mathew Richardson and Pedro Domingos. The Intelligent Surfer:Probabilistic Combination of Link and Content Information in PageRank.In Advances in Neural Information Processing Systems 14. MIT Press, 2002.

Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. Ranking the web frontier. In Proceedings of the World Wide Web Conference, pages 309–318, may 2004. Online at http://www2004.org/proceedings/docs/1p309.pdf.

Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Block-based web search. In The 27th Annual International ACM SIGIR Conference (SIGIR’2004), July 2004. Online at http://caideng.go.nease.net/p248-cai.pdf.

Deng Cai, Xiaofei He, Ji-Rong Wen, and Wei-Ying Ma. Block-level link analysis. In The 27th Annual International ACM SIGIR Conference (SIGIR’2004), July 2004. Online at http://caideng.go.nease.net/p259-cai.pdf.

Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of WWW2003, The Twelfth International World Wide Web Conference, Budapest, HUNGARY, 2003.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.