Skip to main content

2020 | OriginalPaper | Buchkapitel

The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines

verfasst von : Maik Fröbe, Jan Philipp Bittner, Martin Potthast, Matthias Hagen

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
Literatur
1.
Zurück zum Zitat Allan, J., Harman, D., Kanoulas, E., Li, D., Gysel, C.V., Voorhees, E.M.: TREC 2017 common core track overview. In: Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, 15–17 November 2017 (2017) Allan, J., Harman, D., Kanoulas, E., Li, D., Gysel, C.V., Voorhees, E.M.: TREC 2017 common core track overview. In: Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, 15–17 November 2017 (2017)
2.
Zurück zum Zitat Allan, J., Harman, D., Kanoulas, E., Voorhees, E.M.: TREC 2018 common core track overview. In: Notebooks of The Twenty-Seventh Text REtrieval Conference (TREC 2018), Gaithersburg, Maryland, USA, 14–16 November 2018 (2018) Allan, J., Harman, D., Kanoulas, E., Voorhees, E.M.: TREC 2018 common core track overview. In: Notebooks of The Twenty-Seventh Text REtrieval Conference (TREC 2018), Gaithersburg, Maryland, USA, 14–16 November 2018 (2018)
4.
Zurück zum Zitat Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, 31 October – 5 November 2005, pp. 736–743 (2005) Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, 31 October – 5 November 2005, pp. 736–743 (2005)
5.
Zurück zum Zitat Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, 11–13 June 1997, pp. 21–29 (1997) Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, 11–13 June 1997, pp. 21–29 (1997)
6.
Zurück zum Zitat Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 terabyte track. In: Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006), Gaithersburg, Maryland, USA, 14–17 November 2006 (2006) Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 terabyte track. In: Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006), Gaithersburg, Maryland, USA, 14–17 November 2006 (2006)
7.
Zurück zum Zitat Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004), Gaithersburg, Maryland, USA, 16–19 November 2004 (2004) Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004), Gaithersburg, Maryland, USA, 16–19 November 2004 (2004)
8.
Zurück zum Zitat Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of The Eighteenth Text REtrieval Conference (TREC 2009), Gaithersburg, Maryland, USA, 17–20 November 2009 (2009) Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of The Eighteenth Text REtrieval Conference (TREC 2009), Gaithersburg, Maryland, USA, 17–20 November 2009 (2009)
9.
Zurück zum Zitat Clarke, C.L.A., Craswell, N., Soboroff, I., Cormack, G.V.: Overview of the TREC 2010 web track. In: Proceedings of The Nineteenth Text REtrieval Conference (TREC 2010), Gaithersburg, Maryland, USA, 16–19 November 2010 (2010) Clarke, C.L.A., Craswell, N., Soboroff, I., Cormack, G.V.: Overview of the TREC 2010 web track. In: Proceedings of The Nineteenth Text REtrieval Conference (TREC 2010), Gaithersburg, Maryland, USA, 16–19 November 2010 (2010)
10.
Zurück zum Zitat Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 web track. In: Proceedings of The Twentieth Text REtrieval Conference (TREC 2011), Gaithersburg, Maryland, USA, 15–18 November 2011 (2011) Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 web track. In: Proceedings of The Twentieth Text REtrieval Conference (TREC 2011), Gaithersburg, Maryland, USA, 15–18 November 2011 (2011)
11.
Zurück zum Zitat Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: Proceedings of The Twenty-First Text REtrieval Conference (TREC 2012), Gaithersburg, Maryland, USA, 6–9 November 2012 (2012) Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: Proceedings of The Twenty-First Text REtrieval Conference (TREC 2012), Gaithersburg, Maryland, USA, 6–9 November 2012 (2012)
12.
Zurück zum Zitat Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, Maryland, USA, 15–18 November 2005 (2005) Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, Maryland, USA, 15–18 November 2005 (2005)
13.
Zurück zum Zitat Collins-Thompson, K., Bennett, P.N., Diaz, F., Clarke, C., Voorhees, E.M.: TREC 2013 web track overview. In: Proceedings of The Twenty-Second Text REtrieval Conference (TREC 2013), Gaithersburg, Maryland, USA, 19–22 November 2013 (2013) Collins-Thompson, K., Bennett, P.N., Diaz, F., Clarke, C., Voorhees, E.M.: TREC 2013 web track overview. In: Proceedings of The Twenty-Second Text REtrieval Conference (TREC 2013), Gaithersburg, Maryland, USA, 19–22 November 2013 (2013)
14.
Zurück zum Zitat Collins-Thompson, K., Macdonald, C., Bennett, P.N., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of The Twenty-Third Text REtrieval Conference (TREC 2014), Gaithersburg, Maryland, USA, 19–21 November 2014 (2014) Collins-Thompson, K., Macdonald, C., Bennett, P.N., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of The Twenty-Third Text REtrieval Conference (TREC 2014), Gaithersburg, Maryland, USA, 19–21 November 2014 (2014)
15.
Zurück zum Zitat Fetterly, D., Manasse, M.S., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: 1st Latin American Web Congress (LA-WEB2003), Empowering Our Web, Sanitago, Chile, 10–12 November 2003, pp. 37–45 (2003) Fetterly, D., Manasse, M.S., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: 1st Latin American Web Congress (LA-WEB2003), Empowering Our Web, Sanitago, Chile, 10–12 November 2003, pp. 37–45 (2003)
16.
Zurück zum Zitat Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum 51(3), 32–41 (2017)CrossRef Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum 51(3), 32–41 (2017)CrossRef
17.
Zurück zum Zitat Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of Lucene for information retrieval research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017, pp. 1253–1256 (2017) Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of Lucene for information retrieval research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017, pp. 1253–1256 (2017)
Metadaten
Titel
The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
verfasst von
Maik Fröbe
Jan Philipp Bittner
Martin Potthast
Matthias Hagen
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-45442-5_2

Neuer Inhalt