nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

T\(^2\)K\(^2\): The Twitter Top-K Keywords Benchmark

verfasst von : Ciprian-Octavian Truică, Jérôme Darmont

Erschienen in: New Trends in Databases and Information Systems

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T\(^2\)K\(^2\), which features a real tweet dataset and queries with various complexities and selectivities. T\(^2\)K\(^2\) helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T\(^2\)K\(^2\)’s relevance and genericity, we show how to implement the TF-IDF and Okapi BM25 weighting schemes, on one hand, and relational and document-oriented database instantiations, on the other hand.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Assessing the Quality of Spatio-Textual Datasets in the Absence of Ground Truth

Nächstes Kapitel Outlier Detection in Data Streams Using OLAP Cubes

Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., Teisseire, M.: Towards an on-line analysis of tweets processing. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 154–161. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23091-2_15 CrossRef

Cooper, J.D., Robinson, M.D., Slansky, J.A., Kiger, N.D.: Literacy: Helping Students Construct Meaning. Cengage Learning, Boston (2014)

Darmont, J.: Data processing benchmarks. In: Khosrow, M. (ed.) Encyclopedia of Information Science and Technology, 3rd edn, pp. 146–152. IGI Global, Hershey (2014)

Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J.: PRIMEBALL: a parallel processing framework benchmark for big data applications in the cloud. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 109–124. Springer, Cham (2014). doi:10.1007/978-3-319-04936-6_8 CrossRef

Gattiker, A.E., Gebara, F.H., Hofstee, H.P., Hayes, J.D., Hylick, A.: Big data text-oriented benchmark creation for Hadoop. IBM J. Res. Dev. 57(3/4), 10: 1–10: 6 (2013)CrossRef

Gray, J.: The Benchmark Handbook for Database and Transaction Systems, 2nd edn. Morgan Kaufmann, Burlington (1993)MATH

Guille, A., Favre, C.: Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Soc. Netw. Anal. Min. 5(1), 18 (2015)CrossRef

Kılınç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., Borandag, E.: TTC-3600: A new benchmark dataset for turkish text categorization. J. Inf. Sci. 43(2), 174–185 (2017)CrossRef

Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

10.

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH

11.

O’Shea, J., Bandar, Z., Crockett, K.A., McLean, D.: Benchmarking short text semantic similarity. Int. J. Intell. Inf. Database Syst. 4(2), 103–120 (2010)

12.

Paltoglou, G., Thelwall, M.: A study of information retrieval weighting schemes for sentiment analysis. In: 48th Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395 (2010)

13.

Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M., Gallinari, P.: LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581 (2015)

14.

Ravat, F., Teste, O., Tournier, R., Zurfluh, G.: Top_Keyword: an aggregation function for textual document OLAP. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 55–64. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85836-2_6 CrossRef

15.

Reagan, A.J., Tivnan, B.F., Williams, J.R., Danforth, C.M., Dodds, P.S.: Benchmarking sentiment analysis methods for large-scale texts: a case for using continuum-scored words and word shift graphs. CoRR abs/1512.00531 (2015)

16.

Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 1. Inf. Process. Manage. 36(6), 779–808 (2000)CrossRef

17.

Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 2. Inf. Process. Manage. 36(6), 809–840 (2000)CrossRef

18.

Truică, C.O., Darmont, J., Velcin, J.: A scalable document-based architecture for text analysis. In: International Conference on Advanced Data Mining and Applications (ADMA), pp. 481–494 (2016)

19.

Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., Feng, G.: TextGen: a realistic text data content generation method for modern storage system benchmarks. Front. Inf. Technol. Electron. Eng. 17(10), 982–993 (2016)CrossRef

Titel: TK: The Twitter Top-K Keywords Benchmark
verfasst von: Ciprian-Octavian Truică
Jérôme Darmont
Verlag: Springer International Publishing
Buch: New Trends in Databases and Information Systems
Print ISBN: 978-3-319-67161-1

Electronic ISBN: 978-3-319-67162-8

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-67162-8_3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"