Abstract
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
- AHO, A., HOPCROFT, J., AND ULLMAN, J. 1983. Data Structures and Algorithms. Addison-Wesley, Reading, MA.]] Google ScholarDigital Library
- ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature 401, 6749 (Sept.).]]Google ScholarCross Ref
- AMENTO, B., TERVEEN, L., AND HILL, W. 2000. Does authority mean quality? Predicting expert quality ratings of web documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY.]] Google ScholarDigital Library
- ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 290-297.]] Google ScholarDigital Library
- BAR-YOSSEF, Z., BERG, A., CHIEN, S., AND WEITZ, J. F. D. 2000. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439 (Oct.), 509-512.]]Google ScholarCross Ref
- BHARAT,K.AND BRODER, A. 1999. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
- BHARAT, K., BRODER, A., HENZINGER, M., KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. The connectivity server: fast access to linkage information on the Web. Comput. Netw. ISDN Syst. 30, 1-7, 469-477.]] Google ScholarDigital Library
- BRIN, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of the Sixth International Conference on Extending Database Technology (Valencia, Spain, Mar.), H. -J. Schek, F. Saltor, I. Ramos, and G. Alonso, Eds.]] Google ScholarDigital Library
- BRIN,S.AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107-117.]] Google ScholarDigital Library
- BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. 2000. Graph structure in the web: experiments and models. In Proceedings of the Ninth International Conference on The World Wide Web.]] Google ScholarDigital Library
- CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM Press, New York, NY.]]Google Scholar
- CHAKRABARTI, S., DOM, B., AND INDYK, P. 1998b. Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27, 2, 307-318.]] Google ScholarDigital Library
- CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON, D., AND KLEINBERG,J. 1998c. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International Conference on The World Wide Web (WWW7, Brisbane, Australia, Apr. 14-18), P. H. Enslow and A. Ellis, Eds. Elsevier Sci. Pub. B. V., Amsterdam, The Netherlands, 65-74.]] Google ScholarDigital Library
- CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96, Padua, Italy, June 24-26), G. E. Blelloch, Chair. ACM Press, New York, NY, 329-335.]] Google ScholarDigital Library
- CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
- CHO,J.AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for publication.]]Google Scholar
- CHO,J.AND GARCIA-MOLINA, H. 2000b. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- CHO,J.AND GARCIA-MOLINA, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD '2000, Dallas, TX, May). ACM Press, New York, NY.]] Google ScholarDigital Library
- CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1-7, 161-172.]] Google ScholarDigital Library
- COFFMAN,E.J.,LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search engines. Tech. Rep. INRIA, Rennes, France.]]Google Scholar
- DEAN,J.AND HENZINGER, M. R. 1999. Finding related pages in the world wide web. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
- DILIGENTI, M., COETZEE,F.M.,LAWRENCE, S., GILES,C.L.,AND GORI, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
- DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY,, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.]] Google ScholarDigital Library
- DUMAIS,S.T.,FURNAS,G.W.,LANDAUER,T.K.,DEERWESTER, S., AND HARSHMAN, R. 1988. Using latent semantic analysis to improve access to textual information. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '88, Washington, DC, May 15-19), J. J. O'Hare, Ed. ACM Press, New York, NY, 281-285.]] Google ScholarDigital Library
- EGGHE,L.AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New York, NY.]]Google Scholar
- FALOUTSOS, C. 1985. Access methods for text. ACM Comput. Surv. 17, 1 (Mar.), 49-74.]] Google ScholarDigital Library
- FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 4 (Oct.), 267-288.]] Google ScholarDigital Library
- GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.]]Google ScholarCross Ref
- GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems (HYPERTEXT '98, Pittsburgh, PA, June 20-24), R. Akscyn, Chair. ACM Press, New York, NY, 225-234.]] Google ScholarDigital Library
- GOLUB,G.AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.]] Google ScholarDigital Library
- HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/ pub/1999-31.]]Google Scholar
- HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).]]Google Scholar
- HIRAI, J., RAGHAVAN, S., GARCIA-MOLINA, H., AND PAEPCKE, A. 2000. Webbase: A repository of web pages. In Proceedings of the Ninth International Conference on The World Wide Web. 277-293.]] Google ScholarDigital Library
- HUBERMAN,B.A.AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature 401, 6749 (Sept.).]]Google ScholarCross Ref
- KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46,6 (Nov.).]] Google ScholarDigital Library
- KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).]]Google Scholar
- KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1999. Trawling the web for emerging cyber-communities. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
- LAWRENCE,S.AND GILES, C. 1998. Searching the world wide web. Science 280, 98-100.]]Google ScholarCross Ref
- LAWRENCE,S.AND GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.]]Google ScholarCross Ref
- MANBER,U.AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935-948.]] Google ScholarDigital Library
- MACLEOD,I.A.,MARTIN, P., AND NORDIN, B. 1986. A design of a distributed full text retrieval system. In Proceedings of 1986 ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '86, Palazzo dei Congressi, Pisa, Italy, Sept. 8-10), F. Rabitti, Ed. ACM Press, New York, NY, 131-137.]] Google ScholarDigital Library
- MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library Project. Computer Systems Laboratory, Stanford University, Stanford, CA. http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.]]Google Scholar
- MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2001. Building a distributed full-text index for the web. In Proceedings of the Tenth International Conference on The World-Wide Web.]] Google ScholarDigital Library
- MOFFAT,A.AND BELL, T. A. H. 1995. In situ generation of compressed inverted files. J. Am. Soc. Inf. Sci. 46, 7 (Aug.), 537-550.]] Google ScholarDigital Library
- MOTWANI,R.AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.]] Google ScholarDigital Library
- PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep.. Computer Systems Laboratory, Stanford University, Stanford, CA.]]Google Scholar
- PINSKI,G.AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.]]Google Scholar
- PITKOW,J.AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '97, Atlanta, GA, Mar. 22-27), S. Pemberton, Ed. ACM Press, New York, NY, 383-390.]] Google ScholarDigital Library
- RIBEIRO-NETO,B.A.AND BARBOSA, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the Third ACM Conference on Digital Libraries (DL '98, Pittsburgh, PA, June 23-26), I. Witten, R. Akscyn, and F. M. Shipman, Eds. ACM Press, New York, NY, 182-190.]] Google ScholarDigital Library
- ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/ mak/projects/robots/exclusion.html.]]Google Scholar
- SALTON, G., ED. 1988. Automatic Text Processing. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.]] Google ScholarDigital Library
- TOMASIC,A.AND GARCIA-MOLINA, H. 1993. Performance of inverted indices in distributed text document retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Systems (Dec.). 8-17.]] Google ScholarDigital Library
- VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20.]] Google ScholarDigital Library
- WILLS,C.E.AND MIKHAILOV, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
- WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.]] Google ScholarDigital Library
Index Terms
- Searching the Web
Recommendations
Content and link-structure perspective of ranking webpages: A review
AbstractThe delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
A Googol of Information about Google
Timothy P. Chartier reviews Google's PageRank and Beyond: The Science of Search Engine Rankings by Amy Langville and Carl Meyer.
Querying the web graph
SPIRE'10: Proceedings of the 17th international conference on String processing and information retrievalThis paper focuses on using hyperlinks in the ranking of web search results. We give a brief overview of the vast body of work in the area; we provide a quantitative comparison of the different features; we sketch how link-based ranking features can be ...
Comments