research-article

TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic

Authors:
Dimitar Dimitrov

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
View Profile

,
Erdal Baran

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
View Profile

,
Pavlos Fafalios

Institute of Computer Science, FORTH-ICS, Heraklion, Greece

Institute of Computer Science, FORTH-ICS, Heraklion, Greece
View Profile

,
Ran Yu

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
View Profile

,
Xiaofei Zhu

Chongqing University of Technology, Chongqing, China

Chongqing University of Technology, Chongqing, China
View Profile

,
Matthäus Zloch

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
View Profile

,
Stefan Dietze

GESIS - Leibniz Institute for the Social Sciences, Heinrich-Heine-University, & L3S Research Center, Cologne, Düsseldorf & Hannover, Germany

GESIS - Leibniz Institute for the Social Sciences, Heinrich-Heine-University, & L3S Research Center, Cologne, Düsseldorf & Hannover, Germany
View Profile

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementOctober 2020Pages 2991–2998https://doi.org/10.1145/3340531.3412765

Published:19 October 2020Publication History

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 2991–2998

ABSTRACT

Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. With respect to the recent outbreak of the Coronavirus disease 2019 (COVID-19), online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigating measures and their societal impact. Understanding such discourse, its evolution, and interdependencies with real-world events or (mis)information can foster valuable insights. On the other hand, such corpora are crucial facilitators for computational methods addressing tasks such as sentiment analysis, event detection, or entity recognition. However, obtaining, archiving, and semantically annotating large amounts of tweets is costly. In this paper, we describe TweetsCOV19, a publicly available knowledge base of currently more than 8 million tweets, spanning October 2019 - April 2020. Metadata about the tweets as well as extracted entities, hashtags, user mentions, sentiments, and URLs are exposed using established RDF/S vocabularies, providing an unprecedented knowledge base for a range of knowledge discovery tasks. Next to a description of the dataset and its extraction and annotation process, we present an initial analysis and use cases of the corpus.

References

Sarah Alqurashi, Ahmad Alhindi, and Eisa Alanazi. 2020. Large arabic twitter dataset on covid-19. arXiv preprint arXiv:2004.04315Google Scholar
Juan M. Banda, Ramya Tekumalla, Guanyu Wang, Jingyuan Yu, Tuo Liu, Yuning Ding, Katya Artemova, Elena Tutubalina, and Gerardo Chowell. 2020. A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration. https://doi.org/10.5281/zenodo.3831406Google Scholar
Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and space-efficient entity linking for queries. In Eighth International Conference on Web Search and Data Mining. ACM, 179--188.Google ScholarDigital Library
John G Breslin, Stefan Decker, Andreas Harth, and Uldis Bojars. 2006. SIOC: an approach to connect web-based communities. International Journal of Web Based Communities, Vol. 2, 2 (2006).Google ScholarDigital Library
Axel Bruns and Katrin Weller. 2016. Twitter as a first draft of the present: and the challenges of preserving it for the future. In 8th ACM Conference on Web Science. ACM, 183--189.Google ScholarDigital Library
Abhijnan Chakraborty, Johnnatan Messias, Fabricio Benevenuto, Saptarshi Ghosh, Niloy Ganguly, and Krishna Gummadi. 2017. Who Makes Trends? Understanding Demographic Biases in Crowdsourced Recommendations.Google Scholar
Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set. JMIR Public Health Surveill , Vol. 6, 2 (2020). https://doi.org/10.2196/19273Google ScholarCross Ref
Jae Eun Chung. 2017. Retweeting in health promotion: Analysis of tweets about Breast Cancer Awareness Month. Computers in Human Behavior , Vol. 74 (2017), 112--119.Google ScholarCross Ref
Pavlos Fafalios, Manolis Baritakis, and Yannis Tzitzikas. 2015. Exploiting linked data for open and configurable named entity extraction. International Journal on Artificial Intelligence Tools, Vol. 24, 2 (2015), 1540012.Google ScholarCross Ref
Pavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi, and Stefan Dietze. 2018. Tweetskb: A public and large-scale rdf corpus of annotated tweets. In European Semantic Web Conference. Springer, 177--190.Google ScholarDigital Library
Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2020. NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. (2020). arxiv: 2004.08145Google Scholar
Malo Gasquet, Darlene Brechtel, Matthaus Zloch, Andon Tchechmedjiev, Katarina Boland, Pavlos Fafalios, Stefan Dietze, and Konstantin Todorov. 2019. Exploring Fact-checked Claims and their Descriptive Statistics. In ISWC 2019 Satellite Tracks-18th International Semantic Web Conference.Google Scholar
Ramanathan V Guha, Dan Brickley, and Steve Macbeth. 2016. Schema.org: evolution of structured data on the web. Commun. ACM, Vol. 59, 2 (2016), 44--51.Google ScholarDigital Library
Manish Gupta, Jing Gao, ChengXiang Zhai, and Jiawei Han. 2012. Predicting future popularity trend of events in microblogging platforms. Proceedings of the American Society for Information Science and Technology, Vol. 49, 1 (2012), 1--10.Google ScholarCross Ref
Fatima Haouari, Maram Hasanain, Reem Suwaileh, and Tamer Elsayed. 2020. ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks. arXiv preprint arXiv:2004.05861 (2020).Google Scholar
Xiaolei Huang, Amelia Jamison, David Broniatowski, Sandra Quinn, and Mark Dredze. 2020. Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations. https://doi.org/10.5281/zenodo.3735015Google Scholar
Daniel Kerchner and Laura Wrubel. 2020. Coronavirus Tweet Ids. https://doi.org/10.7910/DVN/LW0BTBGoogle Scholar
Eunice Kim, Yongjun Sung, and Hamsu Kang. 2014. Brand followers? retweeting behavior on Twitter: How brand relationships influence brand electronic word-of-mouth. Computers in Human Behavior , Vol. 37 (2014), 18--25.Google ScholarDigital Library
Marina Kogan, Leysia Palen, and Kenneth M Anderson. 2015. Think local, retweet global: Retweeting by the geographically-vulnerable during Hurricane Sandy. In 18th Conference on Computer Supported Cooperative Work & Social Computing. ACM, 981--993.Google ScholarDigital Library
Rabindra Lamsal. 2020 a. Coronavirus (COVID-19) Geo-tagged Tweets Dataset. https://doi.org/10.21227/fpsb-jz61Google Scholar
Rabindra Lamsal. 2020 b. Coronavirus (COVID-19) Tweets Dataset. https://doi.org/10.21227/781w-ef42Google Scholar
Cristian Lumezanu, Nick Feamster, and Hans Klein. 2012. #bias: Measuring the Tweeting Behavior of Propagandists.Google Scholar
Abhilash Mittal and Sanjay Patidar. 2019. Sentiment Analysis on Twitter Data: A Survey. In 7th International Conference on Computer and Communications Management. ACM, 91--95.Google ScholarDigital Library
Martin M Müller and Marcel Salathé. 2019. Crowdbreaks: Tracking health trends using public social media data and crowdsourcing. Frontiers in public health , Vol. 7 (2019).Google Scholar
Kashyap Popat, Subhabrata Mukherjee, Jannik Strötgen, and Gerhard Weikum. 2017. Where the truth lies: Explaining the credibility of emerging claims on the web and social media. In 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1003--1012.Google ScholarDigital Library
Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information. ACM SIGSPATIAL Special , Vol. 12, 1 (2020).Google Scholar
Ibrahim Sabuncu and Zyenep Yurek. 2020. Corona Virus (COVID-19) Turkish Tweets Dataset. https://doi.org/10.21227/0wf0-0792Google Scholar
J Fernando Sánchez-Rada and Carlos A Iglesias. 2016. Onyx: A linked data approach to emotion representation. Information Processing & Management , Vol. 52, 1 (2016), 99--114.Google ScholarDigital Library
Surendra Sedhai and Aixin Sun. 2015. Hspam14: A collection of 14 million tweets for hashtag-oriented spam research. In 38th International Conference on Research and Development in Information Retrieval. ACM, 223--232.Google ScholarDigital Library
Stefan Stieglitz and Linh Dang-Xuan. 2012. Political communication and influence through microblogging--An empirical analysis of sentiment in Twitter messages and retweet behavior. In 45th Hawaii International Conference on System Sciences. IEEE, 3500--3509.Google ScholarDigital Library
A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze, and K. Todorov. 2019. ClaimsKG: A Live Knowledge Graph of Fact-Checked Claims. In 18th International Semantic Web Conference. Springer, 309--324.Google Scholar
Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology, Vol. 63, 1 (2012), 163--173.Google ScholarDigital Library
Sebastian Tschiatschek, Adish Singla, Manuel Gomez Rodriguez, Arpit Merchant, and Andreas Krause. 2018. Fake News Detection in Social Networks via Crowd Signals. In Companion Proceedings of the The Web Conference 2018. International World Wide Web Conferences Steering Committee, 517--524.Google ScholarDigital Library
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. Science, Vol. 359, 6380 (2018), 1146--1151.Google Scholar

Index Terms

TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic
1. Information systems

Recommendations

TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets
The Semantic Web
Abstract
Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and ...
Read More
An Analytical Insight of Discussions and Sentiments of Indians on Omicron-Driven Third Wave of COVID-19
Abstract
Microblogging site Twitter is one of the most crucial tool for expressing and sharing the opinions and views of everyday life events. Many researchers have used tweets made during the COVID-19 pandemic to monitor the opinion of the people towards ...
Read More
Portuguese Twitter Dataset on COVID-19
ASONAM '22: Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Over the last two years, the COVID-19 pandemic has affected hundreds of millions of people around the world. As in many crises, people turn to social media platforms, like Twitter, to communicate and share information. Twitter datasets have been used ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
coronavirus
covid-19
entity linking
rdf
sentiment analysis
social media archives
twitter
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 664
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets

An Analytical Insight of Discussions and Sentiments of Indians on Omicron-Driven Third Wave of COVID-19

Portuguese Twitter Dataset on COVID-19