Skip to main content

2011 | OriginalPaper | Buchkapitel

2. Scalability Challenges in Web Search Engines

verfasst von : Berkant Barla Cambazoglu, Ricardo Baeza-Yates

Erschienen in: Advanced Topics in Information Retrieval

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Continuous growth of the Web and user bases forces web search engine companies to make costly investments on very large compute infrastructures. The scalability of these infrastructures requires careful performance optimizations in every major component of the search engine. Herein, we try to provide a fairly comprehensive coverage of the literature on scalability challenges in large-scale web search engines. We present the identified challenges through an architectural classification, starting from a simple single-node search system and moving towards a hypothetical multi-site web search architecture. We also discuss a number of open research problems and provide recommendations to researchers in the field.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The size of the indexed Web (visited on February 1, 2010), http://​www.​worldwidewebsize​.​com/​.
 
Literatur
Zurück zum Zitat Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 5–14 CrossRef Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 5–14 CrossRef
Zurück zum Zitat Altingovde I, Ozcan R, Ulusoy O (2009) A cost-aware strategy for query result caching in web search engines. In: Boughanem M, Berrut C, Mothe J, Soule-Dupuy C (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 5478. Springer, Berlin/Heidelberg, pp 628–636 Altingovde I, Ozcan R, Ulusoy O (2009) A cost-aware strategy for query result caching in web search engines. In: Boughanem M, Berrut C, Mothe J, Soule-Dupuy C (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 5478. Springer, Berlin/Heidelberg, pp 628–636
Zurück zum Zitat Anh V, Moffat A (2004) Index compression using fixed binary codewords. In: Proceedings of the Australasian Database Conference. Australian Computer Society and Inc, Darlinghurst, pp 61–67 Anh V, Moffat A (2004) Index compression using fixed binary codewords. In: Proceedings of the Australasian Database Conference. Australian Computer Society and Inc, Darlinghurst, pp 61–67
Zurück zum Zitat Anh V, Moffat A (2006a) Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18(6):857–861 CrossRef Anh V, Moffat A (2006a) Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18(6):857–861 CrossRef
Zurück zum Zitat Anh V, Moffat A (2006b) Pruned query evaluation using pre-computed impacts. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 372–379 Anh V, Moffat A (2006b) Pruned query evaluation using pre-computed impacts. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 372–379
Zurück zum Zitat Anh V, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 35–42 Anh V, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 35–42
Zurück zum Zitat Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the Web. ACM Transactions on Internet Technology 1(1):2–43 CrossRef Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the Web. ACM Transactions on Internet Technology 1(1):2–43 CrossRef
Zurück zum Zitat Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani N (2001) Distributed query processing using partitioned inverted files. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp 10–20 CrossRef Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani N (2001) Distributed query processing using partitioned inverted files. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp 10–20 CrossRef
Zurück zum Zitat Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani A, Ziviani N (2007) Analyzing imbalance among homogeneous index servers in a web search system. Information Processing and Management 43(3):592–608 CrossRef Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani A, Ziviani N (2007) Analyzing imbalance among homogeneous index servers in a web search system. Information Processing and Management 43(3):592–608 CrossRef
Zurück zum Zitat Baeza-Yates R, Ribeiro-Neto B (2010) Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading, MA Baeza-Yates R, Ribeiro-Neto B (2010) Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading, MA
Zurück zum Zitat Baeza-Yates R, Saint-Jean F (2003) A three level search engine index based in query log distribution. In: Nascimento M, de Moura E, Oliveira A (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 2857. Springer, Berlin/Heidelberg, pp 56–65 CrossRef Baeza-Yates R, Saint-Jean F (2003) A three level search engine index based in query log distribution. In: Nascimento M, de Moura E, Oliveira A (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 2857. Springer, Berlin/Heidelberg, pp 56–65 CrossRef
Zurück zum Zitat Baeza-Yates R, Junqueira F, Plachouras V, Witschel H (2007a) Admission policies for caches of search engine results. In: Ziviani N, Baeza-Yates R (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 74–85 CrossRef Baeza-Yates R, Junqueira F, Plachouras V, Witschel H (2007a) Admission policies for caches of search engine results. In: Ziviani N, Baeza-Yates R (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 74–85 CrossRef
Zurück zum Zitat Baeza-Yates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri F (2007b) The impact of caching on search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 183–190 Baeza-Yates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri F (2007b) The impact of caching on search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 183–190
Zurück zum Zitat Baeza-Yates R, Castillo C, Junqueira F, Plachouras V, Silvestri F (2007c) Challenges in distributed information retrieval. In: Proceedings of the International Conference on Data Engineering. IEEE CS, New York, NY, pp 6–20 Baeza-Yates R, Castillo C, Junqueira F, Plachouras V, Silvestri F (2007c) Challenges in distributed information retrieval. In: Proceedings of the International Conference on Data Engineering. IEEE CS, New York, NY, pp 6–20
Zurück zum Zitat Baeza-Yates R, Gionis A, Junqueira F, Plachouras V, Telloli L (2009a) On the feasibility of multi-site web search engines. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 425–434 CrossRef Baeza-Yates R, Gionis A, Junqueira F, Plachouras V, Telloli L (2009a) On the feasibility of multi-site web search engines. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 425–434 CrossRef
Zurück zum Zitat Baeza-Yates R, Murdock V, Hauff C (2009b) Efficiency trade-offs in two-tier web search systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 163–170 Baeza-Yates R, Murdock V, Hauff C (2009b) Efficiency trade-offs in two-tier web search systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 163–170
Zurück zum Zitat Barroso L, Hölzle U (2009) The Datacenter as a Computer. Synthesis Lectures on Computer Architecture. Morgan & Claypool Barroso L, Hölzle U (2009) The Datacenter as a Computer. Synthesis Lectures on Computer Architecture. Morgan & Claypool
Zurück zum Zitat Barroso L, Dean J, Hölzle U (2003) Web search for a planet: The Google cluster architecture. IEEE Micro 23(2):22–28 CrossRef Barroso L, Dean J, Hölzle U (2003) Web search for a planet: The Google cluster architecture. IEEE Micro 23(2):22–28 CrossRef
Zurück zum Zitat Bharat K, Broder AZ (1999) Mirror, mirror on the Web: A study of host pairs with replicated content. In: Proceedings of the International Conference on the World Wide Web. Elsevier/North-Holland, New York, NY, pp 1579–1590 Bharat K, Broder AZ (1999) Mirror, mirror on the Web: A study of host pairs with replicated content. In: Proceedings of the International Conference on the World Wide Web. Elsevier/North-Holland, New York, NY, pp 1579–1590
Zurück zum Zitat Bharat K, Broder A, Dean J, Henzinger M (2000) A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science 51(12):1114–1122 CrossRef Bharat K, Broder A, Dean J, Henzinger M (2000) A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science 51(12):1114–1122 CrossRef
Zurück zum Zitat Blanco R, Barreiro A (2006) TSP and cluster-based solutions to the reassignment of document identifiers. Journal of Information Retrieval 9(4):499–517 CrossRef Blanco R, Barreiro A (2006) TSP and cluster-based solutions to the reassignment of document identifiers. Journal of Information Retrieval 9(4):499–517 CrossRef
Zurück zum Zitat Blanco R, Bortnikov E, Junqueira F, Lempel R, Telloli L, Zaragoza H (2010) Caching search engine results over incremental indices. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 82–89 Blanco R, Bortnikov E, Junqueira F, Lempel R, Telloli L, Zaragoza H (2010) Caching search engine results over incremental indices. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 82–89
Zurück zum Zitat Blandford D, Blelloch G (2002) Index compression through document reordering. In: Proceedings of the Data Compression Conference. IEEE Computer Society, Washington, DC, pp 342–351 Blandford D, Blelloch G (2002) Index compression through document reordering. In: Proceedings of the Data Compression Conference. IEEE Computer Society, Washington, DC, pp 342–351
Zurück zum Zitat Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Software: Practice and Experience 34(8):711–726 CrossRef Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Software: Practice and Experience 34(8):711–726 CrossRef
Zurück zum Zitat Boldi P, Bonchi F, Castillo C, Donato D, Gionis A, Vigna S (2008) The query-flow graph: Model and applications. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 609–618 Boldi P, Bonchi F, Castillo C, Donato D, Gionis A, Vigna S (2008) The query-flow graph: Model and applications. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 609–618
Zurück zum Zitat Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117 CrossRef Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117 CrossRef
Zurück zum Zitat Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Networks and ISDN Systems 29:1157–1166 CrossRef Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Networks and ISDN Systems 29:1157–1166 CrossRef
Zurück zum Zitat Broder A, Carmel D, Herscovici M, Soffer A, Zien J (2003a) Efficient query evaluation using a two-level retrieval process. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 426–434 Broder A, Carmel D, Herscovici M, Soffer A, Zien J (2003a) Efficient query evaluation using a two-level retrieval process. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 426–434
Zurück zum Zitat Broder A, Najork M, Wiener J (2003b) Efficient URL caching for World Wide Web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 679–689 Broder A, Najork M, Wiener J (2003b) Efficient URL caching for World Wide Web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 679–689
Zurück zum Zitat Brown E (1995) Fast evaluation of structured queries for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 30–38 Brown E (1995) Fast evaluation of structured queries for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 30–38
Zurück zum Zitat Buckley C, Lewit A (1985) Optimization of inverted vector searches. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–110 Buckley C, Lewit A (1985) Optimization of inverted vector searches. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–110
Zurück zum Zitat Büttcher S, Clarke C (2005) Indexing time vs query time: trade-offs in dynamic information retrieval systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 317–318 Büttcher S, Clarke C (2005) Indexing time vs query time: trade-offs in dynamic information retrieval systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 317–318
Zurück zum Zitat Büttcher S, Clarke C, Lushman B (2006a) Hybrid index maintenance for growing text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 356–363 Büttcher S, Clarke C, Lushman B (2006a) Hybrid index maintenance for growing text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 356–363
Zurück zum Zitat Büttcher S, Clarke C, Lushman B (2006b) Term proximity scoring for ad-hoc retrieval on very large text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 621–622 Büttcher S, Clarke C, Lushman B (2006b) Term proximity scoring for ad-hoc retrieval on very large text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 621–622
Zurück zum Zitat Cacheda F, Carneiro V, Plachouras V, Ounis I (2007) Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43(1):204–224 CrossRef Cacheda F, Carneiro V, Plachouras V, Ounis I (2007) Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43(1):204–224 CrossRef
Zurück zum Zitat Cahoon B, McKinley K, Lu Z (2000) Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems 18(1):1–43 CrossRef Cahoon B, McKinley K, Lu Z (2000) Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems 18(1):1–43 CrossRef
Zurück zum Zitat Callan J, Lu Z, Croft W (1995b) Searching distributed collections with inference networks. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 21–28 Callan J, Lu Z, Croft W (1995b) Searching distributed collections with inference networks. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 21–28
Zurück zum Zitat Cambazoglu B, Aykanat C (2006) Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Information Processing and Management 42(4):875–898 CrossRef Cambazoglu B, Aykanat C (2006) Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Information Processing and Management 42(4):875–898 CrossRef
Zurück zum Zitat Cambazoglu B, Turk A, Aykanat C (2004) Data-parallel web crawling models. In: Proceedings of the Symposium on Computer and Information Sciences. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 801–809 Cambazoglu B, Turk A, Aykanat C (2004) Data-parallel web crawling models. In: Proceedings of the Symposium on Computer and Information Sciences. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 801–809
Zurück zum Zitat Cambazoglu B, Plachouras V, Junqueira F, Telloli L (2008) On the feasibility of geographically distributed web crawling. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences and Social-Informatics and Telecommunications Engineering), ICST, Brussels, pp 1–10 Cambazoglu B, Plachouras V, Junqueira F, Telloli L (2008) On the feasibility of geographically distributed web crawling. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences and Social-Informatics and Telecommunications Engineering), ICST, Brussels, pp 1–10
Zurück zum Zitat Cambazoglu B, Plachouras V, Baeza-Yates R (2009) Quantifying performance and quality gains in distributed web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 411–418 Cambazoglu B, Plachouras V, Baeza-Yates R (2009) Quantifying performance and quality gains in distributed web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 411–418
Zurück zum Zitat Cambazoglu B, Zaragoza H, Chapelle O, Chen J, Liao C, Zheng Z, Degenhardt J (2010a) Early exit optimizations for additive machine learned ranking systems. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 411–420 CrossRef Cambazoglu B, Zaragoza H, Chapelle O, Chen J, Liao C, Zheng Z, Degenhardt J (2010a) Early exit optimizations for additive machine learned ranking systems. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 411–420 CrossRef
Zurück zum Zitat Cambazoglu B, Varol E, Kayaaslan E, Aykanat C, Baeza-Yates R (2010b) Query forwarding in geographically distributed search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 90–97 Cambazoglu B, Varol E, Kayaaslan E, Aykanat C, Baeza-Yates R (2010b) Query forwarding in geographically distributed search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 90–97
Zurück zum Zitat Cambazoglu B, Junqueira F, Plachouras V, Banachowski S, Cui B, Lim S, Bridge B (2010c) A refreshing perspective of search engine caching. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 181–190 CrossRef Cambazoglu B, Junqueira F, Plachouras V, Banachowski S, Cui B, Lim S, Bridge B (2010c) A refreshing perspective of search engine caching. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 181–190 CrossRef
Zurück zum Zitat Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y, Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 43–50 Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y, Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 43–50
Zurück zum Zitat Castillo C (2003) Cooperation schemes between a web server and a web search engine. In: Proceedings of the Latin American Conference on World Wide Web. IEEE CS, New York, NY, pp 212–213 Castillo C (2003) Cooperation schemes between a web server and a web search engine. In: Proceedings of the Latin American Conference on World Wide Web. IEEE CS, New York, NY, pp 212–213
Zurück zum Zitat Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks and ISDN Systems 31(11–16):1623–1640 Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks and ISDN Systems 31(11–16):1623–1640
Zurück zum Zitat Cho J, Garcia-Molina H (2000) The evolution of the Web and implications for an incremental crawler. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 200–209 Cho J, Garcia-Molina H (2000) The evolution of the Web and implications for an incremental crawler. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 200–209
Zurück zum Zitat Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 124–135 Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 124–135
Zurück zum Zitat Cho J, Garcia-Molina H (2003) Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4):390–426 CrossRef Cho J, Garcia-Molina H (2003) Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4):390–426 CrossRef
Zurück zum Zitat Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1–7):161–172 CrossRef Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1–7):161–172 CrossRef
Zurück zum Zitat Cho J, Shivakumar N, Garcia-Molina H (2000) Finding replicated web collections. ACM SIGMOD Record 29(2):355–366 CrossRef Cho J, Shivakumar N, Garcia-Molina H (2000) Finding replicated web collections. ACM SIGMOD Record 29(2):355–366 CrossRef
Zurück zum Zitat Chowdhury A, Pass G (2003) Operational requirements for scalable search systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 435–442 Chowdhury A, Pass G (2003) Operational requirements for scalable search systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 435–442
Zurück zum Zitat Chowdhury A, Frieder O, Grossman D, McCabe M (2002) Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2):171–191 CrossRef Chowdhury A, Frieder O, Grossman D, McCabe M (2002) Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2):171–191 CrossRef
Zurück zum Zitat Chung C, Clarke CA (2002) Topic-oriented collaborative crawling. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 34–42 Chung C, Clarke CA (2002) Topic-oriented collaborative crawling. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 34–42
Zurück zum Zitat Clarke CA, Cormack G, Burkowski F (1994) Fast inverted indexes with on-line update. Tech Rep CS-94-40, University of Waterloo Clarke CA, Cormack G, Burkowski F (1994) Fast inverted indexes with on-line update. Tech Rep CS-94-40, University of Waterloo
Zurück zum Zitat Clarke CA, Agichtein E, Dumais S, White R (2007) The influence of caption features on clickthrough patterns in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 135–142 Clarke CA, Agichtein E, Dumais S, White R (2007) The influence of caption features on clickthrough patterns in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 135–142
Zurück zum Zitat Cooper J, Coden A, Brown E (2002) Detecting similar documents using salient terms. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 245–251 Cooper J, Coden A, Brown E (2002) Detecting similar documents using salient terms. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 245–251
Zurück zum Zitat Cutting D, Pedersen J (1990) Optimization for dynamic inverted index maintenance. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 405–411 Cutting D, Pedersen J (1990) Optimization for dynamic inverted index maintenance. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 405–411
Zurück zum Zitat Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A (2007) The discoverability of the Web. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 421–430 CrossRef Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A (2007) The discoverability of the Web. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 421–430 CrossRef
Zurück zum Zitat de Kretser O, Moffat A, Shimmin T, Zobel J (1998) Methodologies for distributed information retrieval. In: Proceedings of the International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, p 66 de Kretser O, Moffat A, Shimmin T, Zobel J (1998) Methodologies for distributed information retrieval. In: Proceedings of the International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, p 66
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1)):107–113 CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1)):107–113 CrossRef
Zurück zum Zitat Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 527–534 Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 527–534
Zurück zum Zitat Ding S, Attenberg J, Suel T (2010) Scalable techniques for document identifier assignment in inverted indexes. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 311–320 CrossRef Ding S, Attenberg J, Suel T (2010) Scalable techniques for document identifier assignment in inverted indexes. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 311–320 CrossRef
Zurück zum Zitat D’Souza D, Thom J, Zobel J (2004) Collection selection for managed distributed document databases. Information Processing and Management 40(3):527–546 CrossRef D’Souza D, Thom J, Zobel J (2004) Collection selection for managed distributed document databases. Information Processing and Management 40(3):527–546 CrossRef
Zurück zum Zitat Edwards J, McCurley K, Tomlin J (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 106–113 Edwards J, McCurley K, Tomlin J (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 106–113
Zurück zum Zitat Eichmann D (1995) Ethical web agents. Computer Networks and ISDN Systems 28(1–2):127–136 CrossRef Eichmann D (1995) Ethical web agents. Computer Networks and ISDN Systems 28(1–2):127–136 CrossRef
Zurück zum Zitat Exposto J, Macedo J, Pina A, Alves A, Rufino J (2005) Geographical partition for distributed web crawling. In: Proceedings of the Workshop on Geographic Information Retrieval. ACM Press, New York, NY, pp 55–60 CrossRef Exposto J, Macedo J, Pina A, Alves A, Rufino J (2005) Geographical partition for distributed web crawling. In: Proceedings of the Workshop on Geographic Information Retrieval. ACM Press, New York, NY, pp 55–60 CrossRef
Zurück zum Zitat Exposto J, Macedo J, Pina A, Alves A, Rufino J (2008) Efficient partitioning strategies for distributed web crawling. In: Proceedings of the International Conference on Information Networking: Towards Ubiquitous Networking and Services. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 544–553 CrossRef Exposto J, Macedo J, Pina A, Alves A, Rufino J (2008) Efficient partitioning strategies for distributed web crawling. In: Proceedings of the International Conference on Information Networking: Towards Ubiquitous Networking and Services. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 544–553 CrossRef
Zurück zum Zitat Fagni T, Perego R, Silvestri F, Orlando S (2006) Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24(1):51–78 CrossRef Fagni T, Perego R, Silvestri F, Orlando S (2006) Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24(1):51–78 CrossRef
Zurück zum Zitat Fetterly D, Manasse M, Najork M, Wiener J (2004) A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2):213–237 CrossRef Fetterly D, Manasse M, Najork M, Wiener J (2004) A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2):213–237 CrossRef
Zurück zum Zitat Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 580–587 Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 580–587
Zurück zum Zitat Fox E, Lee W (1991) FAST-INV: A fast algorithm for building large inverted files. Tech Rep 91–10, Virginia Polytechnic Institute and State University Fox E, Lee W (1991) FAST-INV: A fast algorithm for building large inverted files. Tech Rep 91–10, Virginia Polytechnic Institute and State University
Zurück zum Zitat Gan Q, Suel T (2009) Improved techniques for result caching in web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 431–440 Gan Q, Suel T (2009) Improved techniques for result caching in web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 431–440
Zurück zum Zitat Gao W, Lee H, Miao Y (2006) Geographically focused collaborative crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 287–296 CrossRef Gao W, Lee H, Miao Y (2006) Geographically focused collaborative crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 287–296 CrossRef
Zurück zum Zitat Gravano L, Garcia-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 78–89 Gravano L, Garcia-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 78–89
Zurück zum Zitat Gyöngyi Z, Garcia-Molina H (2005a) Link spam alliances. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 517–528 Gyöngyi Z, Garcia-Molina H (2005a) Link spam alliances. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 517–528
Zurück zum Zitat Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with TrustRank. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 576–587 Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with TrustRank. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 576–587
Zurück zum Zitat Harman D, Candela G (1990) Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. Journal of the American Society for Information Science 41(8):581–589 CrossRef Harman D, Candela G (1990) Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. Journal of the American Society for Information Science 41(8):581–589 CrossRef
Zurück zum Zitat Harman D, Baeza-Yates R, Fox E, Lee W (1992) Inverted files. In: Baeza-Yates WBFR (ed) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ, pp 28–43 Harman D, Baeza-Yates R, Fox E, Lee W (1992) Inverted files. In: Baeza-Yates WBFR (ed) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ, pp 28–43
Zurück zum Zitat Hawking D (1997) Scalable text retrieval for large digital libraries. In: Proceedings of the European Conference on Digital Libraries. Springer, London, pp 127–145 Hawking D (1997) Scalable text retrieval for large digital libraries. In: Proceedings of the European Conference on Digital Libraries. Springer, London, pp 127–145
Zurück zum Zitat Heinz S, Zobel J (2003) Efficient single-pass index construction for text databases. Journal of the American Society for Information Science 54(8):713–729 Heinz S, Zobel J (2003) Efficient single-pass index construction for text databases. Journal of the American Society for Information Science 54(8):713–729
Zurück zum Zitat Henzinger M (2006) Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 284–291 Henzinger M (2006) Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 284–291
Zurück zum Zitat Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229 CrossRef Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229 CrossRef
Zurück zum Zitat Hirai J, Raghavan S, Garcia-Molina H, Paepcke A (2000) WebBase: a repository of web pages. In: Proceedings of the International Conference on the World Wide Web. North-Holland, Amsterdam, pp 277–293 Hirai J, Raghavan S, Garcia-Molina H, Paepcke A (2000) WebBase: a repository of web pages. In: Proceedings of the International Conference on the World Wide Web. North-Holland, Amsterdam, pp 277–293
Zurück zum Zitat Jeh G, Widom J (2003) Scaling personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 271–279 Jeh G, Widom J (2003) Scaling personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 271–279
Zurück zum Zitat Jeong BS, Omiecinski E (1995) Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems 6(2):142–153 CrossRef Jeong BS, Omiecinski E (1995) Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems 6(2):142–153 CrossRef
Zurück zum Zitat Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 133–142 Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 133–142
Zurück zum Zitat Jónsson B, Franklin M, Srivastava D (1998) Interaction of query evaluation and buffer management for information retrieval. ACM SIGMOD Record 27(2):118–129 CrossRef Jónsson B, Franklin M, Srivastava D (1998) Interaction of query evaluation and buffer management for information retrieval. ACM SIGMOD Record 27(2):118–129 CrossRef
Zurück zum Zitat Kayaaslan E, Cambazoglu B, Aykanat C (2010) Document replication strategies for geographically distributed Web search engines. To be submitted Kayaaslan E, Cambazoglu B, Aykanat C (2010) Document replication strategies for geographically distributed Web search engines. To be submitted
Zurück zum Zitat Larkey L, Connell M, Callan J (2000) Collection selection and results merging with topically organized US patents and TREC data. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 282–289 Larkey L, Connell M, Callan J (2000) Collection selection and results merging with topically organized US patents and TREC data. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 282–289
Zurück zum Zitat Lawrence S, Giles C (2000) Accessibility of information on the Web. Intelligence 11(1):32–39 CrossRef Lawrence S, Giles C (2000) Accessibility of information on the Web. Intelligence 11(1):32–39 CrossRef
Zurück zum Zitat Lee HT, Leonard D, Wang X, Loguinov D (2008) IRLbot: Scaling to 6 billion pages and beyond. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 427–436 Lee HT, Leonard D, Wang X, Loguinov D (2008) IRLbot: Scaling to 6 billion pages and beyond. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 427–436
Zurück zum Zitat Lempel R, Moran S (2003) Predictive caching and prefetching of query results in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 19–28 Lempel R, Moran S (2003) Predictive caching and prefetching of query results in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 19–28
Zurück zum Zitat Lester N, Zobel J, Williams H (2004) In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In: Proceedings of the Australasian Database Conference. Australian Computer Society, Darlinghurst, pp 15–23 Lester N, Zobel J, Williams H (2004) In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In: Proceedings of the Australasian Database Conference. Australian Computer Society, Darlinghurst, pp 15–23
Zurück zum Zitat Lester N, Moffat A, Zobel J (2008) Efficient online index construction for text databases. ACM Transactions on Database Systems 33(3):1–33 CrossRef Lester N, Moffat A, Zobel J (2008) Efficient online index construction for text databases. ACM Transactions on Database Systems 33(3):1–33 CrossRef
Zurück zum Zitat Lewandowskii D (2008) A three-year study on the freshness of web search engine databases. Journal of Information Science 34(6):817–831 CrossRef Lewandowskii D (2008) A three-year study on the freshness of web search engine databases. Journal of Information Science 34(6):817–831 CrossRef
Zurück zum Zitat Liu X, Croft W (2004) Cluster-based retrieval using language models. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 186–193 Liu X, Croft W (2004) Cluster-based retrieval using language models. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 186–193
Zurück zum Zitat Liu F, Yu C, Meng W (2002) Personalized web search by mapping user queries to categories. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 558–565 Liu F, Yu C, Meng W (2002) Personalized web search by mapping user queries to categories. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 558–565
Zurück zum Zitat Long X, Suel T (2005) Three-level caching for efficient query processing in large web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 257–266 CrossRef Long X, Suel T (2005) Three-level caching for efficient query processing in large web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 257–266 CrossRef
Zurück zum Zitat Lu Z, McKinley K (1999) Partial replica selection based on relevance for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–104 Lu Z, McKinley K (1999) Partial replica selection based on relevance for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–104
Zurück zum Zitat Lu Z, McKinley K (2000) Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 248–255 Lu Z, McKinley K (2000) Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 248–255
Zurück zum Zitat Lucchese C, Orlando S, Perego R, Silvestri F (2007) Mining query logs to optimize index partitioning in parallel web search engines. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, pp 1–9 Lucchese C, Orlando S, Perego R, Silvestri F (2007) Mining query logs to optimize index partitioning in parallel web search engines. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, pp 1–9
Zurück zum Zitat MacFarlane A, McCann J, Robertson S (2000) Parallel search using partitioned inverted files. In: Proceedings of the International Symposium on String Processing Information Retrieval. IEEE Computer Society, Washington, DC, pp 209–220 CrossRef MacFarlane A, McCann J, Robertson S (2000) Parallel search using partitioned inverted files. In: Proceedings of the International Symposium on String Processing Information Retrieval. IEEE Computer Society, Washington, DC, pp 209–220 CrossRef
Zurück zum Zitat Markatos E (2001) On caching search engine query results. Computer Communications 24(2):137–143 CrossRef Markatos E (2001) On caching search engine query results. Computer Communications 24(2):137–143 CrossRef
Zurück zum Zitat Melnik S, Raghavan S, Yang B, Garcia-Molina H (2001) Building a distributed full-text index for the Web. ACM Transactions on Information Systems 19(3):217–241 CrossRef Melnik S, Raghavan S, Yang B, Garcia-Molina H (2001) Building a distributed full-text index for the Web. ACM Transactions on Information Systems 19(3):217–241 CrossRef
Zurück zum Zitat Moffat A, Bell TH (1995) In situ generation of compressed inverted files. Journal of the American Society for Information Science 46(7):537–550 CrossRef Moffat A, Bell TH (1995) In situ generation of compressed inverted files. Journal of the American Society for Information Science 46(7):537–550 CrossRef
Zurück zum Zitat Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Journal of Information Retrieval 3(1):25–47 CrossRef Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Journal of Information Retrieval 3(1):25–47 CrossRef
Zurück zum Zitat Moffat A, Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4):349–379 CrossRef Moffat A, Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4):349–379 CrossRef
Zurück zum Zitat Moffat A, Webber W, Zobel J, Baeza-Yates R (2007) A pipelined architecture for distributed text query evaluation. Journal of Information Retrieval 10(3):205–231 CrossRef Moffat A, Webber W, Zobel J, Baeza-Yates R (2007) A pipelined architecture for distributed text query evaluation. Journal of Information Retrieval 10(3):205–231 CrossRef
Zurück zum Zitat Najork M, Wiener J (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 114–118 Najork M, Wiener J (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 114–118
Zurück zum Zitat Ntoulas A, Cho J (2007) Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 191–198 Ntoulas A, Cho J (2007) Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 191–198
Zurück zum Zitat Ntoulas A, Cho J, Olston C (2004) What’s new on the Web?: The evolution of the Web from a search engine perspective. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1–12 Ntoulas A, Cho J, Olston C (2004) What’s new on the Web?: The evolution of the Web from a search engine perspective. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1–12
Zurück zum Zitat Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 437–446 Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 437–446
Zurück zum Zitat Ozcan R, Altingovde I, Ulusoy O (2008) Static query result caching revisited. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1169–1170 Ozcan R, Altingovde I, Ulusoy O (2008) Static query result caching revisited. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1169–1170
Zurück zum Zitat Pandey S, Olston C (2005) User-centric web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–411 CrossRef Pandey S, Olston C (2005) User-centric web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–411 CrossRef
Zurück zum Zitat Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 3–14 CrossRef Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 3–14 CrossRef
Zurück zum Zitat Persin M (1994) Document filtering for fast ranking. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 339–348 Persin M (1994) Document filtering for fast ranking. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 339–348
Zurück zum Zitat Pitkow J, Schütze H, Cass T, Cooley R, Turnbull D, Edmonds A, Adar E, Breuel T (2002) Personalized search. Communications of the ACM 45(9):50–55 Pitkow J, Schütze H, Cass T, Cooley R, Turnbull D, Edmonds A, Adar E, Breuel T (2002) Personalized search. Communications of the ACM 45(9):50–55
Zurück zum Zitat Puppin D, Silvestri F, Perego R, Baeza-Yates R (2010) Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems 28(2):1–36 CrossRef Puppin D, Silvestri F, Perego R, Baeza-Yates R (2010) Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems 28(2):1–36 CrossRef
Zurück zum Zitat Radoslavov P, Govindan R, Estrin D (2002) Topology-informed Internet replica placement. Computer Communications 25(4):384–392 CrossRef Radoslavov P, Govindan R, Estrin D (2002) Topology-informed Internet replica placement. Computer Communications 25(4):384–392 CrossRef
Zurück zum Zitat Rafiei D, Bharat K, Shukla A (2010) Diversifying web search results. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 781–790 CrossRef Rafiei D, Bharat K, Shukla A (2010) Diversifying web search results. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 781–790 CrossRef
Zurück zum Zitat Raghavan S, Garcia-Molina H (2001) Crawling the hidden Web. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 129–138 Raghavan S, Garcia-Molina H (2001) Crawling the hidden Web. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 129–138
Zurück zum Zitat Rasolofo Y, Savoy J (2003) Term proximity scoring for keyword-based retrieval systems. In: Sebastiani F (ed) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 2633. Springer, Berlin/Heidelberg, pp 79. doi:10.1007/3-540-36618-0_15, visited on December, 2010 Rasolofo Y, Savoy J (2003) Term proximity scoring for keyword-based retrieval systems. In: Sebastiani F (ed) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 2633. Springer, Berlin/Heidelberg, pp 79. doi:10.​1007/​3-540-36618-0_​15, visited on December, 2010
Zurück zum Zitat Ribeiro-Neto B, Barbosa R (1998) Query performance for tightly coupled distributed digital libraries. In: Proceedings of the ACM Conference on Digital Libraries. ACM Press, New York, NY, pp 182–190 CrossRef Ribeiro-Neto B, Barbosa R (1998) Query performance for tightly coupled distributed digital libraries. In: Proceedings of the ACM Conference on Digital Libraries. ACM Press, New York, NY, pp 182–190 CrossRef
Zurück zum Zitat Ribeiro-Neto B, Kitajima J, Navarro G, Sant’Ana C, Ziviani N (1998) Parallel generation of inverted files for distributed text collections. In: Proceedings of the Conference of the Chilean Computer Science Society. IEEE Computer Society, Washington, DC, pp 149–157 Ribeiro-Neto B, Kitajima J, Navarro G, Sant’Ana C, Ziviani N (1998) Parallel generation of inverted files for distributed text collections. In: Proceedings of the Conference of the Chilean Computer Science Society. IEEE Computer Society, Washington, DC, pp 149–157
Zurück zum Zitat Ribeiro-Neto B, Moura E, Neubert M, Ziviani N (1999) Efficient distributed algorithms to build inverted files. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 105–112 Ribeiro-Neto B, Moura E, Neubert M, Ziviani N (1999) Efficient distributed algorithms to build inverted files. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 105–112
Zurück zum Zitat Risvik K, Aasheim Y, Lidal M (2003) Multi-tier architecture for web search engines. In: Proceedings of the Latin American Conference on World Wide Web. IEEE Computer Society, Washington, DC, p 132 Risvik K, Aasheim Y, Lidal M (2003) Multi-tier architecture for web search engines. In: Proceedings of the Latin American Conference on World Wide Web. IEEE Computer Society, Washington, DC, p 132
Zurück zum Zitat Saraiva P, Silva de Moura E, Ziviani N, Meira W, Fonseca R, Riberio-Neto B (2001) Rank-preserving two-level caching for scalable search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 51–58 Saraiva P, Silva de Moura E, Ziviani N, Meira W, Fonseca R, Riberio-Neto B (2001) Rank-preserving two-level caching for scalable search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 51–58
Zurück zum Zitat Sarigiannis C, Plachouras V, Baeza-Yates R (2009) A study of the impact of index updates on distributed query processing for web search. In: Proceedings of the European Conference on Information Retrieval. Springer, Berlin/Heidelberg, pp 595–602 Sarigiannis C, Plachouras V, Baeza-Yates R (2009) A study of the impact of index updates on distributed query processing for web search. In: Proceedings of the European Conference on Information Retrieval. Springer, Berlin/Heidelberg, pp 595–602
Zurück zum Zitat Schenkel R, Broschart A, Hwang S, Theobald M, Weikum G (2007) Efficient text proximity search. In: Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 287–299 CrossRef Schenkel R, Broschart A, Hwang S, Theobald M, Weikum G (2007) Efficient text proximity search. In: Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 287–299 CrossRef
Zurück zum Zitat Scholer F, Williams H, Yiannis J, Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 222–229 Scholer F, Williams H, Yiannis J, Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 222–229
Zurück zum Zitat Shieh WY, Chung CP (2005) A statistics-based approach to incrementally update inverted files. Information Processing and Management 41(2):275–288 Shieh WY, Chung CP (2005) A statistics-based approach to incrementally update inverted files. Information Processing and Management 41(2):275–288
Zurück zum Zitat Shieh WY, Chen TF, Shann J, Chung CP (2003) Inverted file compression through document identifier reassignment. Information Processing and Management 39(1):117–131 MATHCrossRef Shieh WY, Chen TF, Shann J, Chung CP (2003) Inverted file compression through document identifier reassignment. Information Processing and Management 39(1):117–131 MATHCrossRef
Zurück zum Zitat Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data Engineering. IEEE Computer Society, Washington, DC, p 357 Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data Engineering. IEEE Computer Society, Washington, DC, p 357
Zurück zum Zitat Si L, Jin R, Callan J, Ogilvie P (2002a) A language modeling framework for resource selection and results merging. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 391–397 Si L, Jin R, Callan J, Ogilvie P (2002a) A language modeling framework for resource selection and results merging. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 391–397
Zurück zum Zitat Silvestri F (2007) Sorting out the document identifier assignment problem. In: Amati G, Carpineto C, Romano G (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 4425. Springer, Berlin/Heidelberg, pp 101–112 Silvestri F (2007) Sorting out the document identifier assignment problem. In: Amati G, Carpineto C, Romano G (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 4425. Springer, Berlin/Heidelberg, pp 101–112
Zurück zum Zitat Silvestri F, Orlando S, Perego R (2004) Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 305–312 Silvestri F, Orlando S, Perego R (2004) Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 305–312
Zurück zum Zitat Skobeltsyn G, Junqueira F, Plachouras V, Baeza-Yates R (2008) ResIn: a combination of results caching and index pruning for high-performance web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 131–138 Skobeltsyn G, Junqueira F, Plachouras V, Baeza-Yates R (2008) ResIn: a combination of results caching and index pruning for high-performance web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 131–138
Zurück zum Zitat Strohman T, Turtle H, Croft W (2005) Optimization strategies for complex queries. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 219–225 Strohman T, Turtle H, Croft W (2005) Optimization strategies for complex queries. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 219–225
Zurück zum Zitat Sun JT, Zeng HJ, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 382–390 CrossRef Sun JT, Zeng HJ, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 382–390 CrossRef
Zurück zum Zitat Tan B, Shen X, Zhai C (2006) Mining long-term search history to improve search accuracy. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 718–723 CrossRef Tan B, Shen X, Zhai C (2006) Mining long-term search history to improve search accuracy. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 718–723 CrossRef
Zurück zum Zitat Teevan J, Dumais S, Horvitz E (2005) Personalizing search via automated analysis of interests and activities. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 449–456 Teevan J, Dumais S, Horvitz E (2005) Personalizing search via automated analysis of interests and activities. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 449–456
Zurück zum Zitat Tomasic A, Garcia-Molina H (1993) Caching and database scaling in distributed shared-nothing information retrieval systems. ACM SIGMOD Record 22(2):129–138 CrossRef Tomasic A, Garcia-Molina H (1993) Caching and database scaling in distributed shared-nothing information retrieval systems. ACM SIGMOD Record 22(2):129–138 CrossRef
Zurück zum Zitat Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Proceedings of the ACM Conference on Management of Data. ACM Press, New York, NY, pp 289–300 Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Proceedings of the ACM Conference on Management of Data. ACM Press, New York, NY, pp 289–300
Zurück zum Zitat Tomasic A, Gravano L, Lue C, Schwarz P, Haas L (1997) Data structures for efficient broker implementation. ACM Transactions on Information Systems 15(3):223–253 CrossRef Tomasic A, Gravano L, Lue C, Schwarz P, Haas L (1997) Data structures for efficient broker implementation. ACM Transactions on Information Systems 15(3):223–253 CrossRef
Zurück zum Zitat Turpin A, Tsegay Y, Hawking D, Williams H (2007) Fast generation of result snippets in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 127–134 Turpin A, Tsegay Y, Hawking D, Williams H (2007) Fast generation of result snippets in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 127–134
Zurück zum Zitat Turtle H, Flood J (1995) Query evaluation: Strategies and optimizations. Information Processing and Management 31(6):831–850 CrossRef Turtle H, Flood J (1995) Query evaluation: Strategies and optimizations. Information Processing and Management 31(6):831–850 CrossRef
Zurück zum Zitat Varadarajan R, Hristidis V (2006) A system for query-specific document summarization. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 622–631 CrossRef Varadarajan R, Hristidis V (2006) A system for query-specific document summarization. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 622–631 CrossRef
Zurück zum Zitat Wang L, Lin J, Metzler D (2010) Learning to efficiently rank. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 138–145 Wang L, Lin J, Metzler D (2010) Learning to efficiently rank. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 138–145
Zurück zum Zitat Witten I, Moffat A, Bell T (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco, CA Witten I, Moffat A, Bell T (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco, CA
Zurück zum Zitat Wolf J, Squillante M, Yu P, Sethuraman J, Ozsen L (2002) Optimal crawling strategies for web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 136–147 Wolf J, Squillante M, Yu P, Sethuraman J, Ozsen L (2002) Optimal crawling strategies for web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 136–147
Zurück zum Zitat Wong WP, Lee D (1993) Implementations of partial document ranking using inverted files. Information Processing and Management 29(5):647–669 CrossRef Wong WP, Lee D (1993) Implementations of partial document ranking using inverted files. Information Processing and Management 29(5):647–669 CrossRef
Zurück zum Zitat Xu J, Callan J (1998) Effective retrieval with distributed collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 112–120 Xu J, Callan J (1998) Effective retrieval with distributed collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 112–120
Zurück zum Zitat Xu J, Croft W (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 254–261 Xu J, Croft W (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 254–261
Zurück zum Zitat Yan H, Ding S, Suel T (2009a) Compressing term positions in web indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 147–154 Yan H, Ding S, Suel T (2009a) Compressing term positions in web indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 147–154
Zurück zum Zitat Yan H, Ding S, Suel T (2009b) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–410 Yan H, Ding S, Suel T (2009b) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–410
Zurück zum Zitat Yu F, Xie Y, Ke Q (2010) Sbotminer: Large scale search bot detection. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 421–430 CrossRef Yu F, Xie Y, Ke Q (2010) Sbotminer: Large scale search bot detection. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 421–430 CrossRef
Zurück zum Zitat Yuwono B, Lee D (1997) Server ranking for distributed text retrieval systems on the Internet. In: Proceedings of the International Conference on Database Systems for Advanced Applications. World Scientific, Singapore, pp 41–50 CrossRef Yuwono B, Lee D (1997) Server ranking for distributed text retrieval systems on the Internet. In: Proceedings of the International Conference on Database Systems for Advanced Applications. World Scientific, Singapore, pp 41–50 CrossRef
Zurück zum Zitat Zeinalipour-Yazti D, Dikaiakos M (2002) Design and implementation of a distributed crawler and filtering processor. In: Proceedings of the International Workshop on Next Generation Information Technologies and Systems. Springer, London, pp 58–74 CrossRef Zeinalipour-Yazti D, Dikaiakos M (2002) Design and implementation of a distributed crawler and filtering processor. In: Proceedings of the International Workshop on Next Generation Information Technologies and Systems. Springer, London, pp 58–74 CrossRef
Zurück zum Zitat Zhang J, Long X, Suel T (2008) Performance of compressed inverted list caching in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 387–396 Zhang J, Long X, Suel T (2008) Performance of compressed inverted list caching in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 387–396
Zurück zum Zitat Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Computing Surveys 38(2):6 CrossRef Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Computing Surveys 38(2):6 CrossRef
Metadaten
Titel
Scalability Challenges in Web Search Engines
verfasst von
Berkant Barla Cambazoglu
Ricardo Baeza-Yates
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-642-20946-8_2

Neuer Inhalt