Skip to main content

2017 | OriginalPaper | Buchkapitel

T\(^2\)K\(^2\): The Twitter Top-K Keywords Benchmark

verfasst von : Ciprian-Octavian Truică, Jérôme Darmont

Erschienen in: New Trends in Databases and Information Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T\(^2\)K\(^2\), which features a real tweet dataset and queries with various complexities and selectivities. T\(^2\)K\(^2\) helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T\(^2\)K\(^2\)’s relevance and genericity, we show how to implement the TF-IDF and Okapi BM25 weighting schemes, on one hand, and relational and document-oriented database instantiations, on the other hand.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., Teisseire, M.: Towards an on-line analysis of tweets processing. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 154–161. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23091-2_15 CrossRef Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., Teisseire, M.: Towards an on-line analysis of tweets processing. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 154–161. Springer, Heidelberg (2011). doi:10.​1007/​978-3-642-23091-2_​15 CrossRef
2.
Zurück zum Zitat Cooper, J.D., Robinson, M.D., Slansky, J.A., Kiger, N.D.: Literacy: Helping Students Construct Meaning. Cengage Learning, Boston (2014) Cooper, J.D., Robinson, M.D., Slansky, J.A., Kiger, N.D.: Literacy: Helping Students Construct Meaning. Cengage Learning, Boston (2014)
3.
Zurück zum Zitat Darmont, J.: Data processing benchmarks. In: Khosrow, M. (ed.) Encyclopedia of Information Science and Technology, 3rd edn, pp. 146–152. IGI Global, Hershey (2014) Darmont, J.: Data processing benchmarks. In: Khosrow, M. (ed.) Encyclopedia of Information Science and Technology, 3rd edn, pp. 146–152. IGI Global, Hershey (2014)
4.
Zurück zum Zitat Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J.: PRIMEBALL: a parallel processing framework benchmark for big data applications in the cloud. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 109–124. Springer, Cham (2014). doi:10.1007/978-3-319-04936-6_8 CrossRef Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J.: PRIMEBALL: a parallel processing framework benchmark for big data applications in the cloud. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 109–124. Springer, Cham (2014). doi:10.​1007/​978-3-319-04936-6_​8 CrossRef
5.
Zurück zum Zitat Gattiker, A.E., Gebara, F.H., Hofstee, H.P., Hayes, J.D., Hylick, A.: Big data text-oriented benchmark creation for Hadoop. IBM J. Res. Dev. 57(3/4), 10: 1–10: 6 (2013)CrossRef Gattiker, A.E., Gebara, F.H., Hofstee, H.P., Hayes, J.D., Hylick, A.: Big data text-oriented benchmark creation for Hadoop. IBM J. Res. Dev. 57(3/4), 10: 1–10: 6 (2013)CrossRef
6.
Zurück zum Zitat Gray, J.: The Benchmark Handbook for Database and Transaction Systems, 2nd edn. Morgan Kaufmann, Burlington (1993)MATH Gray, J.: The Benchmark Handbook for Database and Transaction Systems, 2nd edn. Morgan Kaufmann, Burlington (1993)MATH
7.
Zurück zum Zitat Guille, A., Favre, C.: Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Soc. Netw. Anal. Min. 5(1), 18 (2015)CrossRef Guille, A., Favre, C.: Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Soc. Netw. Anal. Min. 5(1), 18 (2015)CrossRef
8.
Zurück zum Zitat Kılınç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., Borandag, E.: TTC-3600: A new benchmark dataset for turkish text categorization. J. Inf. Sci. 43(2), 174–185 (2017)CrossRef Kılınç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., Borandag, E.: TTC-3600: A new benchmark dataset for turkish text categorization. J. Inf. Sci. 43(2), 174–185 (2017)CrossRef
9.
Zurück zum Zitat Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004) Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
10.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH
11.
Zurück zum Zitat O’Shea, J., Bandar, Z., Crockett, K.A., McLean, D.: Benchmarking short text semantic similarity. Int. J. Intell. Inf. Database Syst. 4(2), 103–120 (2010) O’Shea, J., Bandar, Z., Crockett, K.A., McLean, D.: Benchmarking short text semantic similarity. Int. J. Intell. Inf. Database Syst. 4(2), 103–120 (2010)
12.
Zurück zum Zitat Paltoglou, G., Thelwall, M.: A study of information retrieval weighting schemes for sentiment analysis. In: 48th Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395 (2010) Paltoglou, G., Thelwall, M.: A study of information retrieval weighting schemes for sentiment analysis. In: 48th Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395 (2010)
13.
Zurück zum Zitat Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M., Gallinari, P.: LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581 (2015) Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M., Gallinari, P.: LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581 (2015)
14.
Zurück zum Zitat Ravat, F., Teste, O., Tournier, R., Zurfluh, G.: Top_Keyword: an aggregation function for textual document OLAP. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 55–64. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85836-2_6 CrossRef Ravat, F., Teste, O., Tournier, R., Zurfluh, G.: Top_Keyword: an aggregation function for textual document OLAP. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 55–64. Springer, Heidelberg (2008). doi:10.​1007/​978-3-540-85836-2_​6 CrossRef
15.
Zurück zum Zitat Reagan, A.J., Tivnan, B.F., Williams, J.R., Danforth, C.M., Dodds, P.S.: Benchmarking sentiment analysis methods for large-scale texts: a case for using continuum-scored words and word shift graphs. CoRR abs/1512.00531 (2015) Reagan, A.J., Tivnan, B.F., Williams, J.R., Danforth, C.M., Dodds, P.S.: Benchmarking sentiment analysis methods for large-scale texts: a case for using continuum-scored words and word shift graphs. CoRR abs/1512.00531 (2015)
16.
Zurück zum Zitat Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 1. Inf. Process. Manage. 36(6), 779–808 (2000)CrossRef Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 1. Inf. Process. Manage. 36(6), 779–808 (2000)CrossRef
17.
Zurück zum Zitat Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 2. Inf. Process. Manage. 36(6), 809–840 (2000)CrossRef Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 2. Inf. Process. Manage. 36(6), 809–840 (2000)CrossRef
18.
Zurück zum Zitat Truică, C.O., Darmont, J., Velcin, J.: A scalable document-based architecture for text analysis. In: International Conference on Advanced Data Mining and Applications (ADMA), pp. 481–494 (2016) Truică, C.O., Darmont, J., Velcin, J.: A scalable document-based architecture for text analysis. In: International Conference on Advanced Data Mining and Applications (ADMA), pp. 481–494 (2016)
19.
Zurück zum Zitat Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., Feng, G.: TextGen: a realistic text data content generation method for modern storage system benchmarks. Front. Inf. Technol. Electron. Eng. 17(10), 982–993 (2016)CrossRef Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., Feng, G.: TextGen: a realistic text data content generation method for modern storage system benchmarks. Front. Inf. Technol. Electron. Eng. 17(10), 982–993 (2016)CrossRef
Metadaten
Titel
TK: The Twitter Top-K Keywords Benchmark
verfasst von
Ciprian-Octavian Truică
Jérôme Darmont
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-67162-8_3