nach oben

Information Systems Frontiers

Erschienen in:

06.03.2020

TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

verfasst von: Ciprian-Octavian Truică, Elena-Simona Apostol, Jérôme Darmont, Ira Assent

Erschienen in: Information Systems Frontiers | Ausgabe 1/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.

Vorheriger Artikel Atypical Sample Regularizer Autoencoder for Cross-Domain Human Activity Recognition

Nächster Artikel GarNLP: A Natural Language Processing Pipeline for Garnishment Documents

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Source code https://github.com/cipriantruica/TextBenDS

Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.

Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.

Bellot, P., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., SanJuan, E., Schenkel, R., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Sanderson, M., Scholer, F., & Wang, Q. (2013). Report on inex 2013. SIGIR Forum, 47(2), 21–32. https://doi.org/10.1145/2568388.2568393.CrossRef

Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.

Bouakkaz, M., Loudcher, S., & Ouinten, Y. (2016). OLAP textual aggregation approach using the google similarity distance. International Journal of Business Intelligence and Data Mining, 11(1), 31. https://doi.org/10.1504/ijbidm.2016.076425.CrossRef

Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., & Teisseire, M. (2011). Towards an on-line analysis of tweets processing. In International Conference on Database and Expert Systems Applications (pp. 154–161). https://doi.org/10.1007/978-3-642-23091-2_15.CrossRef

Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.

Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.

Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492.CrossRef

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.CrossRef

Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8

Gattiker, A. E., Gebara, F. H., Hofstee, H. P., Hayes, J. D., & Hylick, A. (2013). Big data text-oriented benchmark creation for Hadoop. IBM Journal of Research and Development, 57(3/4), 10:1–10:6. https://doi.org/10.1147/JRD.2013.2240732.CrossRef

Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1197–1208). https://doi.org/10.1145/2463676.2463712.CrossRef

Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., & Zicari, R. V. (2017). Bigbench v2: The new and improved bigbench. In 2017 IEEE 33rd International Conference on Data Engineering (pp. 1225–1236). https://doi.org/10.1109/ICDE.2017.167.CrossRef

Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers.

Guille, A., & Favre, C. (2015). Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Social Network Analysis and Mining, 5(1), 18. https://doi.org/10.1007/s13278-015-0258-0.CrossRef

Hofmann, T. (2017). Probabilistic latent semantic indexing. SIGIR Forum, 51(2), 211–218. https://doi.org/10.1145/3130348.3130370.CrossRef

Huang, S., Huang, J., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In International Conference on Data Engineering (pp. 41–51). https://doi.org/10.1109/ICDEW.2010.5452747.CrossRef

Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang, Q., Luo, C., & Li, J. (2014). Characterizing and subsetting big data workloads. In 2014 IEEE International Symposium on Workload Characterization (pp. 191–201). https://doi.org/10.1109/IISWC.2014.6983058.CrossRef

Kılıç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). Ttc-3600: A new benchmark dataset for turkish text categorization. Journal of Information Science, 43(2), 174–185. https://doi.org/10.1177/0165551515620551.CrossRef

Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.

Lavrenko, V., & Croft, W. B. (2017). Relevance-based language models. SIGIR Forum, 51(2), 260–267. https://doi.org/10.1145/3130348.3130376.CrossRef

Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397 URL http://www.jmlr.org/papers/v5/lewis04a.html.

Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.

Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.

O’Shea, J., Bandar, Z., Crockett, K. A., & McLean, D. (2010). Benchmarking short text semantic similarity. International Journal of Intelligent Information and Database Systems, 4(2), 103–120. https://doi.org/10.1504/IJIIDS.2010.032437.CrossRef

Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.

Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.

Pirzadeh, P., Carey, M. J., & Westmann, T. (2015). Bigfun: A performance study of big data management system functionality. In IEEE International Conference on Big Data (pp. 507–514). https://doi.org/10.1109/BigData.2015.7363793.CrossRef

Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062.

Ravat, F., Teste, O., Tournier, R., & Zurfluh, G. (2008). Top−keyword: an aggregation function for textual document olap. In International Conference on Data Warehousing and Knowledge Discovery (pp. 55–64). https://doi.org/10.1007/978-3-540-85836-2-6.CrossRef

Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015). Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (pp. 1357–1369). New York: ACM. https://doi.org/10.1145/2723372.2742790.CrossRef

Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.

Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597.3137600.CrossRef

Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.

Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. In Symposium on Mass Storage Systems and Technologies (pp. 1–10). https://doi.org/10.1109/MSST.2010.5496972.CrossRef

Spärck Jones, K., Walker, S., & Robertson, S. E. (2000a). A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6), 779–808. https://doi.org/10.1016/S0306-4573(00)00015-7.CrossRef

Spärck Jones, K., Walker, S., & Robertson, S. E. (2000b). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9.CrossRef

Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. VLDB Endowment, 2(2), 1626–1629. https://doi.org/10.14778/1687553.1687609.CrossRef

Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019.

Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.

Truică, C. O., & Darmont, J. (2017). T²K²: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.

Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.

Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.

Truică, C. O., Darmont, J., Boicea, A., & Rădulescu, F. (2018). Benchmarking top-k keyword and top-k document processing with T²K² and T²K²D². Future Generation Computer Systems, 85, 60–75. https://doi.org/10.1016/j.future.2018.02.037.CrossRef

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., & Baldeschwieler, E. (2013). Apache hadoop yarn: Yet another resource negotiator. In Annual Symposium on Cloud Computing (pp. 5:1–5:16). https://doi.org/10.1145/2523616.2523633.CrossRef

Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., & Qiu, B. (2014). BigDataBench: A big data benchmark suite from internet services. In IEEE International Symposium on High Performance Computer Architecture (pp. 488–499). https://doi.org/10.1109/HPCA.2014.6835958.CrossRef

Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., & Feng, G. (2016). Textgen: a realistic text data content generation method for modern storage system benchmarks. Frontiers of Information Technology & Electronic Engineering, 17(10), 982–993. https://doi.org/10.1631/FITEE.1500332.CrossRef

Wang, X., Ah-Pine, J., & Darmont, J. (2017). Shcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections. In 2017 IEEE International Conference on Fuzzy Systems (pp. 1–6). https://doi.org/10.1109/FUZZ-IEEE.2017.8015720.CrossRef

Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. https://doi.org/10.1145/2934664.CrossRef

Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96

Zhang, D., Zhai, C., & Han, J. (2012). MiTexCube: MicroTextCluster cube for online analysis of text cells and its applications. Statistical Analysis and Data Mining, 6(3), 243–259. https://doi.org/10.1002/sam.11159.CrossRef

Titel: TextBenDS: a Generic Textual Data Benchmark for Distributed Systems
verfasst von: Ciprian-Octavian Truică
Elena-Simona Apostol
Jérôme Darmont
Ira Assent
Publikationsdatum: 06.03.2020
Verlag: Springer US
Erschienen in: Information Systems Frontiers / Ausgabe 1/2021
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-020-09999-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2021

NetDER: An Architecture for Reasoning About Malicious Behavior

A Twitter-Based Study of the European Internet of Things

An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

Quarry: A User-centered Big Data Integration Platform

Breakthroughs on Cross-Cutting Data Management, Data Analytics, and Applied Data Science

GarNLP: A Natural Language Processing Pipeline for Garnishment Documents