article

Searching the Web

Authors:
Arvind Arasu

Stanford University, Computer Science Dept. Stanford, CA

Stanford University, Computer Science Dept. Stanford, CA
View Profile

,
Junghoo Cho

Stanford University, Computer Science Dept. Stanford, CA

Stanford University, Computer Science Dept. Stanford, CA
View Profile

,
Hector Garcia-Molina

Stanford University, Computer Science Dept. Stanford, CA

Stanford University, Computer Science Dept. Stanford, CA
View Profile

,
Andreas Paepcke

Stanford University, Computer Science Dept. Stanford, CA

Stanford University, Computer Science Dept. Stanford, CA
View Profile

,
Sriram Raghavan

Stanford University, Computer Science Dept. Stanford, CA

Stanford University, Computer Science Dept. Stanford, CA
View Profile

Authors Info & Claims

ACM Transactions on Internet Technology Volume 1 Issue 1pp 2–43https://doi.org/10.1145/383034.383035

Published:01 August 2001Publication History

ACM Transactions on Internet Technology

Abstract

We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

References

AHO, A., HOPCROFT, J., AND ULLMAN, J. 1983. Data Structures and Algorithms. Addison-Wesley, Reading, MA.]] Google ScholarDigital Library
ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature 401, 6749 (Sept.).]]Google ScholarCross Ref
AMENTO, B., TERVEEN, L., AND HILL, W. 2000. Does authority mean quality? Predicting expert quality ratings of web documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY.]] Google ScholarDigital Library
ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 290-297.]] Google ScholarDigital Library
BAR-YOSSEF, Z., BERG, A., CHIEN, S., AND WEITZ, J. F. D. 2000. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439 (Oct.), 509-512.]]Google ScholarCross Ref
BHARAT,K.AND BRODER, A. 1999. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
BHARAT, K., BRODER, A., HENZINGER, M., KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. The connectivity server: fast access to linkage information on the Web. Comput. Netw. ISDN Syst. 30, 1-7, 469-477.]] Google ScholarDigital Library
BRIN, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of the Sixth International Conference on Extending Database Technology (Valencia, Spain, Mar.), H. -J. Schek, F. Saltor, I. Ramos, and G. Alonso, Eds.]] Google ScholarDigital Library
BRIN,S.AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107-117.]] Google ScholarDigital Library
BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. 2000. Graph structure in the web: experiments and models. In Proceedings of the Ninth International Conference on The World Wide Web.]] Google ScholarDigital Library
CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM Press, New York, NY.]]Google Scholar
CHAKRABARTI, S., DOM, B., AND INDYK, P. 1998b. Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27, 2, 307-318.]] Google ScholarDigital Library
CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON, D., AND KLEINBERG,J. 1998c. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International Conference on The World Wide Web (WWW7, Brisbane, Australia, Apr. 14-18), P. H. Enslow and A. Ellis, Eds. Elsevier Sci. Pub. B. V., Amsterdam, The Netherlands, 65-74.]] Google ScholarDigital Library
CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96, Padua, Italy, June 24-26), G. E. Blelloch, Chair. ACM Press, New York, NY, 329-335.]] Google ScholarDigital Library
CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
CHO,J.AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for publication.]]Google Scholar
CHO,J.AND GARCIA-MOLINA, H. 2000b. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
CHO,J.AND GARCIA-MOLINA, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD '2000, Dallas, TX, May). ACM Press, New York, NY.]] Google ScholarDigital Library
CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1-7, 161-172.]] Google ScholarDigital Library
COFFMAN,E.J.,LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search engines. Tech. Rep. INRIA, Rennes, France.]]Google Scholar
DEAN,J.AND HENZINGER, M. R. 1999. Finding related pages in the world wide web. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
DILIGENTI, M., COETZEE,F.M.,LAWRENCE, S., GILES,C.L.,AND GORI, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases.]] Google ScholarDigital Library
DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY,, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.]] Google ScholarDigital Library
DUMAIS,S.T.,FURNAS,G.W.,LANDAUER,T.K.,DEERWESTER, S., AND HARSHMAN, R. 1988. Using latent semantic analysis to improve access to textual information. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '88, Washington, DC, May 15-19), J. J. O'Hare, Ed. ACM Press, New York, NY, 281-285.]] Google ScholarDigital Library
EGGHE,L.AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New York, NY.]]Google Scholar
FALOUTSOS, C. 1985. Access methods for text. ACM Comput. Surv. 17, 1 (Mar.), 49-74.]] Google ScholarDigital Library
FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 4 (Oct.), 267-288.]] Google ScholarDigital Library
GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.]]Google ScholarCross Ref
GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems (HYPERTEXT '98, Pittsburgh, PA, June 20-24), R. Akscyn, Chair. ACM Press, New York, NY, 225-234.]] Google ScholarDigital Library
GOLUB,G.AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.]] Google ScholarDigital Library
HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/ pub/1999-31.]]Google Scholar
HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).]]Google Scholar
HIRAI, J., RAGHAVAN, S., GARCIA-MOLINA, H., AND PAEPCKE, A. 2000. Webbase: A repository of web pages. In Proceedings of the Ninth International Conference on The World Wide Web. 277-293.]] Google ScholarDigital Library
HUBERMAN,B.A.AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature 401, 6749 (Sept.).]]Google ScholarCross Ref
KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46,6 (Nov.).]] Google ScholarDigital Library
KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).]]Google Scholar
KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1999. Trawling the web for emerging cyber-communities. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
LAWRENCE,S.AND GILES, C. 1998. Searching the world wide web. Science 280, 98-100.]]Google ScholarCross Ref
LAWRENCE,S.AND GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.]]Google ScholarCross Ref
MANBER,U.AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935-948.]] Google ScholarDigital Library
MACLEOD,I.A.,MARTIN, P., AND NORDIN, B. 1986. A design of a distributed full text retrieval system. In Proceedings of 1986 ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '86, Palazzo dei Congressi, Pisa, Italy, Sept. 8-10), F. Rabitti, Ed. ACM Press, New York, NY, 131-137.]] Google ScholarDigital Library
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library Project. Computer Systems Laboratory, Stanford University, Stanford, CA. http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.]]Google Scholar
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2001. Building a distributed full-text index for the web. In Proceedings of the Tenth International Conference on The World-Wide Web.]] Google ScholarDigital Library
MOFFAT,A.AND BELL, T. A. H. 1995. In situ generation of compressed inverted files. J. Am. Soc. Inf. Sci. 46, 7 (Aug.), 537-550.]] Google ScholarDigital Library
MOTWANI,R.AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.]] Google ScholarDigital Library
PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep.. Computer Systems Laboratory, Stanford University, Stanford, CA.]]Google Scholar
PINSKI,G.AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.]]Google Scholar
PITKOW,J.AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '97, Atlanta, GA, Mar. 22-27), S. Pemberton, Ed. ACM Press, New York, NY, 383-390.]] Google ScholarDigital Library
RIBEIRO-NETO,B.A.AND BARBOSA, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the Third ACM Conference on Digital Libraries (DL '98, Pittsburgh, PA, June 23-26), I. Witten, R. Akscyn, and F. M. Shipman, Eds. ACM Press, New York, NY, 182-190.]] Google ScholarDigital Library
ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/ mak/projects/robots/exclusion.html.]]Google Scholar
SALTON, G., ED. 1988. Automatic Text Processing. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.]] Google ScholarDigital Library
TOMASIC,A.AND GARCIA-MOLINA, H. 1993. Performance of inverted indices in distributed text document retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Systems (Dec.). 8-17.]] Google ScholarDigital Library
VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20.]] Google ScholarDigital Library
WILLS,C.E.AND MIKHAILOV, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International Conference on The World-Wide Web.]] Google ScholarDigital Library
WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.]] Google ScholarDigital Library

Index Terms

Searching the Web
1. Information systems

Recommendations

Content and link-structure perspective of ranking webpages: A review
Abstract
The delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
Read More
A Googol of Information about Google

Timothy P. Chartier reviews Google's PageRank and Beyond: The Science of Search Engine Rankings by Amy Langville and Carl Meyer.

Read More
Querying the web graph
SPIRE'10: Proceedings of the 17th international conference on String processing and information retrieval

This paper focuses on using hyperlinks in the ranking of web search results. We give a brief overview of the vast body of work in the area; we provide a quantitative comparison of the different features; we sketch how link-based ranking features can be ...
Read More

Reviews

Reviewer: Jun Lin

Are you curious about the way Web search engines provide users with a list of URLs after just a few keywords are entered__?__ This article gives an overview on the core engine that makes this possible. The authors start by discussing the challenges of the Internet and the architecture of Web search engines. The major challenges—the size, rapid change, lack of coherence, and interlinked nature of the Internet—introduce the rest of the article. Chapter Two discusses the discovery of information from the Internet, with a detailed explanation of the challenges of crawler models in respect to page selection and page refresh. Chapter Three introduces the storage and distributed Web repository. In chapters Four and Five, the authors present the most popular indexing architectures used in Web search services and ranking and link analysis. The article covers in detail the fundamental components of Web search engines and the most common design and implementation. For each component discussed, a conclusion section is provided to summarize the concepts and give further alert on challenges to be addressed in the future. Theoretical analysis and arguments are supported by the authors’ own experiments in which data and statistics are provided. The article is most suitable for readers who are interested in the design of Web search engines. In addition, readers who want to submit their Web pages to search engines can also be benefit from reading this article to increase the matching-rate and ranking of their Web pages among the search engines. For other readers who search the Internet to find certain information, this article is also a good source to understand the technology involved, though it does not contain direct guidelines towards the most popular Web search engines on the market. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Internet Technology Volume 1, Issue 1
Aug. 2001
140 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/383034
Editor:
Won Kim
Issue’s Table of Contents
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2001
Published in toit Volume 1, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HITS
PageRank
authorities
crawling
indexing
information retrieval
link analysis
search engine
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 394
  Total Citations
  View Citations
- 9,591
  Total Downloads
- Downloads (Last 12 months)41
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Searching the Web

ACM Transactions on Internet Technology

Abstract

References

Cited By

Index Terms

Recommendations

Content and link-structure perspective of ranking webpages: A review

A Googol of Information about Google

Querying the web graph

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Searching the Web

ACM Transactions on Internet Technology

Abstract

References

Cited By

Index Terms

Recommendations

Content and link-structure perspective of ranking webpages: A review

A Googol of Information about Google

Querying the web graph

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media