skip to main content
article

Searching the Web

Authors Info & Claims
Published:01 August 2001Publication History
Skip Abstract Section

Abstract

We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

References

  1. AHO, A., HOPCROFT, J., AND ULLMAN, J. 1983. Data Structures and Algorithms. Addison-Wesley, Reading, MA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature 401, 6749 (Sept.).]]Google ScholarGoogle ScholarCross RefCross Ref
  3. AMENTO, B., TERVEEN, L., AND HILL, W. 2000. Does authority mean quality? Predicting expert quality ratings of web documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 290-297.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BAR-YOSSEF, Z., BERG, A., CHIEN, S., AND WEITZ, J. F. D. 2000. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439 (Oct.), 509-512.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. BHARAT,K.AND BRODER, A. 1999. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. BHARAT, K., BRODER, A., HENZINGER, M., KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. The connectivity server: fast access to linkage information on the Web. Comput. Netw. ISDN Syst. 30, 1-7, 469-477.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. BRIN, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of the Sixth International Conference on Extending Database Technology (Valencia, Spain, Mar.), H. -J. Schek, F. Saltor, I. Ramos, and G. Alonso, Eds.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BRIN,S.AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107-117.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. 2000. Graph structure in the web: experiments and models. In Proceedings of the Ninth International Conference on The World Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM Press, New York, NY.]]Google ScholarGoogle Scholar
  13. CHAKRABARTI, S., DOM, B., AND INDYK, P. 1998b. Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27, 2, 307-318.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON, D., AND KLEINBERG,J. 1998c. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International Conference on The World Wide Web (WWW7, Brisbane, Australia, Apr. 14-18), P. H. Enslow and A. Ellis, Eds. Elsevier Sci. Pub. B. V., Amsterdam, The Netherlands, 65-74.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96, Padua, Italy, June 24-26), G. E. Blelloch, Chair. ACM Press, New York, NY, 329-335.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. CHO,J.AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for publication.]]Google ScholarGoogle Scholar
  18. CHO,J.AND GARCIA-MOLINA, H. 2000b. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. CHO,J.AND GARCIA-MOLINA, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD '2000, Dallas, TX, May). ACM Press, New York, NY.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1-7, 161-172.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. COFFMAN,E.J.,LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search engines. Tech. Rep. INRIA, Rennes, France.]]Google ScholarGoogle Scholar
  22. DEAN,J.AND HENZINGER, M. R. 1999. Finding related pages in the world wide web. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. DILIGENTI, M., COETZEE,F.M.,LAWRENCE, S., GILES,C.L.,AND GORI, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY,, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. DUMAIS,S.T.,FURNAS,G.W.,LANDAUER,T.K.,DEERWESTER, S., AND HARSHMAN, R. 1988. Using latent semantic analysis to improve access to textual information. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '88, Washington, DC, May 15-19), J. J. O'Hare, Ed. ACM Press, New York, NY, 281-285.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. EGGHE,L.AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New York, NY.]]Google ScholarGoogle Scholar
  27. FALOUTSOS, C. 1985. Access methods for text. ACM Comput. Surv. 17, 1 (Mar.), 49-74.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 4 (Oct.), 267-288.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.]]Google ScholarGoogle ScholarCross RefCross Ref
  30. GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems (HYPERTEXT '98, Pittsburgh, PA, June 20-24), R. Akscyn, Chair. ACM Press, New York, NY, 225-234.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. GOLUB,G.AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/ pub/1999-31.]]Google ScholarGoogle Scholar
  33. HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).]]Google ScholarGoogle Scholar
  34. HIRAI, J., RAGHAVAN, S., GARCIA-MOLINA, H., AND PAEPCKE, A. 2000. Webbase: A repository of web pages. In Proceedings of the Ninth International Conference on The World Wide Web. 277-293.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. HUBERMAN,B.A.AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature 401, 6749 (Sept.).]]Google ScholarGoogle ScholarCross RefCross Ref
  36. KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46,6 (Nov.).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).]]Google ScholarGoogle Scholar
  38. KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1999. Trawling the web for emerging cyber-communities. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. LAWRENCE,S.AND GILES, C. 1998. Searching the world wide web. Science 280, 98-100.]]Google ScholarGoogle ScholarCross RefCross Ref
  40. LAWRENCE,S.AND GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.]]Google ScholarGoogle ScholarCross RefCross Ref
  41. MANBER,U.AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935-948.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. MACLEOD,I.A.,MARTIN, P., AND NORDIN, B. 1986. A design of a distributed full text retrieval system. In Proceedings of 1986 ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '86, Palazzo dei Congressi, Pisa, Italy, Sept. 8-10), F. Rabitti, Ed. ACM Press, New York, NY, 131-137.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library Project. Computer Systems Laboratory, Stanford University, Stanford, CA. http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.]]Google ScholarGoogle Scholar
  44. MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2001. Building a distributed full-text index for the web. In Proceedings of the Tenth International Conference on The World-Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. MOFFAT,A.AND BELL, T. A. H. 1995. In situ generation of compressed inverted files. J. Am. Soc. Inf. Sci. 46, 7 (Aug.), 537-550.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. MOTWANI,R.AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep.. Computer Systems Laboratory, Stanford University, Stanford, CA.]]Google ScholarGoogle Scholar
  48. PINSKI,G.AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.]]Google ScholarGoogle Scholar
  49. PITKOW,J.AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '97, Atlanta, GA, Mar. 22-27), S. Pemberton, Ed. ACM Press, New York, NY, 383-390.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. RIBEIRO-NETO,B.A.AND BARBOSA, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the Third ACM Conference on Digital Libraries (DL '98, Pittsburgh, PA, June 23-26), I. Witten, R. Akscyn, and F. M. Shipman, Eds. ACM Press, New York, NY, 182-190.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/ mak/projects/robots/exclusion.html.]]Google ScholarGoogle Scholar
  52. SALTON, G., ED. 1988. Automatic Text Processing. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. TOMASIC,A.AND GARCIA-MOLINA, H. 1993. Performance of inverted indices in distributed text document retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Systems (Dec.). 8-17.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. WILLS,C.E.AND MIKHAILOV, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Searching the Web

          Recommendations

          Reviews

          Jun Lin

          Are you curious about the way Web search engines provide users with a list of URLs after just a few keywords are entered__?__ This article gives an overview on the core engine that makes this possible. The authors start by discussing the challenges of the Internet and the architecture of Web search engines. The major challenges—the size, rapid change, lack of coherence, and interlinked nature of the Internet—introduce the rest of the article. Chapter Two discusses the discovery of information from the Internet, with a detailed explanation of the challenges of crawler models in respect to page selection and page refresh. Chapter Three introduces the storage and distributed Web repository. In chapters Four and Five, the authors present the most popular indexing architectures used in Web search services and ranking and link analysis. The article covers in detail the fundamental components of Web search engines and the most common design and implementation. For each component discussed, a conclusion section is provided to summarize the concepts and give further alert on challenges to be addressed in the future. Theoretical analysis and arguments are supported by the authors’ own experiments in which data and statistics are provided. The article is most suitable for readers who are interested in the design of Web search engines. In addition, readers who want to submit their Web pages to search engines can also be benefit from reading this article to increase the matching-rate and ranking of their Web pages among the search engines. For other readers who search the Internet to find certain information, this article is also a good source to understand the technology involved, though it does not contain direct guidelines towards the most popular Web search engines on the market. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Internet Technology
            ACM Transactions on Internet Technology  Volume 1, Issue 1
            Aug. 2001
            140 pages
            ISSN:1533-5399
            EISSN:1557-6051
            DOI:10.1145/383034
            Issue’s Table of Contents

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 August 2001
            Published in toit Volume 1, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader