Skip to main content
Erschienen in: Information Systems Frontiers 1/2022

04.11.2020

Analyzing the Quality of Twitter Data Streams

verfasst von: Franco Arolfo, Kevin Cortés Rodriguez, Alejandro Vaisman

Erschienen in: Information Systems Frontiers | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

There is a general belief that the quality of Twitter data streams is generally low and unpredictable, making, in some way, unreliable to take decisions based on such data. The work presented here addresses this problem from a Data Quality (DQ) perspective, adapting the traditional methods used in relational databases, based on quality dimensions and metrics, to capture the characteristics of Twitter data streams in particular, and of Big Data in a more general sense. Therefore, as a first contribution, this paper re-defines the classic DQ dimensions and metrics for the scenario under study. Second, the paper introduces a software tool that allows capturing Twitter data streams in real time, computing their DQ and displaying the results through a wide variety of graphics. As a third contribution of this paper, using the aforementioned machinery, a thorough analysis of the DQ of Twitter streams is performed, based on four dimensions: Readability, Completeness, Usefulness, and Trustworthiness. These dimensions are studied for several different cases, namely unfiltered data streams, data streams filtered using a collection of keywords, and classifying tweets referring to different topics, studying the DQ for each topic. Further, although it is well known that the number of geolocalized tweets is very low, the paper studies the DQ of tweets with respect to the place from where they are posted. Last but not least, the tool allows changing the weights of each quality dimension considered in the computation of the overall data quality of a tweet. This allows defining weights that fit different analysis contexts and/or different user profiles. Interestingly, this study reveals that the quality of Twitter streams is higher than what would have been expected.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
The implementation is available at http://​dataquality.​it.​ita.​edu.​arb/​, and can be used with credentials usertest/usertest.
 
5
Kafka records are organized into topics, such that a Kafka topic is a feed name to which records are stored and published. Producer applications write data to topics and consumer applications read from topics.
 
Literatur
Zurück zum Zitat Arolfo, F., & Vaisman, A.A. (2018). Data quality in a big data context. In Advances in databases and information systems - 22nd european conference, ADBIS 2018, budapest, hungary, september 2-5, 2018, proceedings, lecture notes in computer science, (Vol. 11019 pp. 159–172). New York: Springer. Arolfo, F., & Vaisman, A.A. (2018). Data quality in a big data context. In Advances in databases and information systems - 22nd european conference, ADBIS 2018, budapest, hungary, september 2-5, 2018, proceedings, lecture notes in computer science, (Vol. 11019 pp. 159–172). New York: Springer.
Zurück zum Zitat Batini, C., Rula, A., Scannapieco, M., & Viscusi, G. (2015). From data quality to big data quality. Journal of Database Management, 26(1), 60–82.CrossRef Batini, C., Rula, A., Scannapieco, M., & Viscusi, G. (2015). From data quality to big data quality. Journal of Database Management, 26(1), 60–82.CrossRef
Zurück zum Zitat Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies and techniques. Data-centric systems and applications. New York: Springer. Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies and techniques. Data-centric systems and applications. New York: Springer.
Zurück zum Zitat Ciaccia, P., & Torlone, R. (2011). Modeling the propagation of user preferences. In Proceedings of conceptual modeling – ER (pp. 304–317). Berlin: Springer. Ciaccia, P., & Torlone, R. (2011). Modeling the propagation of user preferences. In Proceedings of conceptual modeling – ER (pp. 304–317). Berlin: Springer.
Zurück zum Zitat Firmani, D., Mecella, M., Scannapieco, M., & Batini, C. (2015). On the meaningfulness of big data quality (invited paper). Data Science and Engineering pp 1–15. Firmani, D., Mecella, M., Scannapieco, M., & Batini, C. (2015). On the meaningfulness of big data quality (invited paper). Data Science and Engineering pp 1–15.
Zurück zum Zitat Fornacciari, P., Mordonini, M., & Tomaiuolo, M. (2015). Social network and sentiment analysis on twitter: towards a combined approach. In Proceedings of the 1st international workshop on knowledge discovery on the WEB, KDWeb 2015, Cagliari, Italy, September 3-5, 2015 (pp. 53–64). Fornacciari, P., Mordonini, M., & Tomaiuolo, M. (2015). Social network and sentiment analysis on twitter: towards a combined approach. In Proceedings of the 1st international workshop on knowledge discovery on the WEB, KDWeb 2015, Cagliari, Italy, September 3-5, 2015 (pp. 53–64).
Zurück zum Zitat Guruprasad, H.S., Suprajha, S., Yogitha, C., & J Sanghvi, A. (2015). A study on sentiment analysis using tweeter data. 1, 213–218. Guruprasad, H.S., Suprajha, S., Yogitha, C., & J Sanghvi, A. (2015). A study on sentiment analysis using tweeter data. 1, 213–218.
Zurück zum Zitat Hao, M.C., Rohrdantz, C., Janetzko, H., Dayal, U., Keim, D.A., Haug, L., & Hsu, M. (2011). Visual sentiment analysis on twitter data streams. In 2011 IEEE conference on visual analytics science and technology, VAST 2011, providence, rhode island, USA, October 23-28, 2011 (pp. 277–278). Hao, M.C., Rohrdantz, C., Janetzko, H., Dayal, U., Keim, D.A., Haug, L., & Hsu, M. (2011). Visual sentiment analysis on twitter data streams. In 2011 IEEE conference on visual analytics science and technology, VAST 2011, providence, rhode island, USA, October 23-28, 2011 (pp. 277–278).
Zurück zum Zitat Marotta, A., & Vaisman, A.A. (2016). Rule-based multidimensional data quality assessment using contexts. In 18Th international conference, dawak 2016, porto, portugal, september 6-8, 2016, proceedings (pp. 299–313). Marotta, A., & Vaisman, A.A. (2016). Rule-based multidimensional data quality assessment using contexts. In 18Th international conference, dawak 2016, porto, portugal, september 6-8, 2016, proceedings (pp. 299–313).
Zurück zum Zitat Poeppelmann, D., & Schultewolter, C. (2012). Towards a data quality framework for decision support in a multidimensional context. IJBIR, 3(1), 17–29. Poeppelmann, D., & Schultewolter, C. (2012). Towards a data quality framework for decision support in a multidimensional context. IJBIR, 3(1), 17–29.
Zurück zum Zitat Scannapieco, M., & Catarci, T. (2002). Data quality under a computer science perspective. Archivi & Computer, 2, 1–15. Scannapieco, M., & Catarci, T. (2002). Data quality under a computer science perspective. Archivi & Computer, 2, 1–15.
Zurück zum Zitat Soto, A.J., Ryan, C., Silva, F.P., Das, T., Wolkowicz, J., Milios, E.E., & Brooks, S. (2018). Data quality challenges in twitter content analysis for informing policy making in health care. In 51st hawaii international conference on system sciences, HICSS 2018, hilton waikoloa village, hawaii, USA, January 3-6, 2018. Soto, A.J., Ryan, C., Silva, F.P., Das, T., Wolkowicz, J., Milios, E.E., & Brooks, S. (2018). Data quality challenges in twitter content analysis for informing policy making in health care. In 51st hawaii international conference on system sciences, HICSS 2018, hilton waikoloa village, hawaii, USA, January 3-6, 2018.
Zurück zum Zitat Stefanidis, K., Pitoura, E., & Vassiliadis, P. (2011). Managing contextual preferences. Information Systems, 36(8), 1158–1180.CrossRef Stefanidis, K., Pitoura, E., & Vassiliadis, P. (2011). Managing contextual preferences. Information Systems, 36(8), 1158–1180.CrossRef
Zurück zum Zitat Wagner, S., Toftegaard, T.S., & Bertelsen, O.W. (2011). Increased data quality in home blood pressure monitoring through context awareness. In 5th international conference on pervasive computing technologies for healthcare, Dublin, Ireland (pp. 234–237). Wagner, S., Toftegaard, T.S., & Bertelsen, O.W. (2011). Increased data quality in home blood pressure monitoring through context awareness. In 5th international conference on pervasive computing technologies for healthcare, Dublin, Ireland (pp. 234–237).
Zurück zum Zitat Wang, R.Y., & Strong, D.M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.CrossRef Wang, R.Y., & Strong, D.M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.CrossRef
Metadaten
Titel
Analyzing the Quality of Twitter Data Streams
verfasst von
Franco Arolfo
Kevin Cortés Rodriguez
Alejandro Vaisman
Publikationsdatum
04.11.2020
Verlag
Springer US
Erschienen in
Information Systems Frontiers / Ausgabe 1/2022
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI
https://doi.org/10.1007/s10796-020-10072-x

Weitere Artikel der Ausgabe 1/2022

Information Systems Frontiers 1/2022 Zur Ausgabe

Premium Partner