ABSTRACT
Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. With respect to the recent outbreak of the Coronavirus disease 2019 (COVID-19), online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigating measures and their societal impact. Understanding such discourse, its evolution, and interdependencies with real-world events or (mis)information can foster valuable insights. On the other hand, such corpora are crucial facilitators for computational methods addressing tasks such as sentiment analysis, event detection, or entity recognition. However, obtaining, archiving, and semantically annotating large amounts of tweets is costly. In this paper, we describe TweetsCOV19, a publicly available knowledge base of currently more than 8 million tweets, spanning October 2019 - April 2020. Metadata about the tweets as well as extracted entities, hashtags, user mentions, sentiments, and URLs are exposed using established RDF/S vocabularies, providing an unprecedented knowledge base for a range of knowledge discovery tasks. Next to a description of the dataset and its extraction and annotation process, we present an initial analysis and use cases of the corpus.
- Sarah Alqurashi, Ahmad Alhindi, and Eisa Alanazi. 2020. Large arabic twitter dataset on covid-19. arXiv preprint arXiv:2004.04315Google Scholar
- Juan M. Banda, Ramya Tekumalla, Guanyu Wang, Jingyuan Yu, Tuo Liu, Yuning Ding, Katya Artemova, Elena Tutubalina, and Gerardo Chowell. 2020. A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration. https://doi.org/10.5281/zenodo.3831406Google Scholar
- Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and space-efficient entity linking for queries. In Eighth International Conference on Web Search and Data Mining. ACM, 179--188.Google ScholarDigital Library
- John G Breslin, Stefan Decker, Andreas Harth, and Uldis Bojars. 2006. SIOC: an approach to connect web-based communities. International Journal of Web Based Communities, Vol. 2, 2 (2006).Google ScholarDigital Library
- Axel Bruns and Katrin Weller. 2016. Twitter as a first draft of the present: and the challenges of preserving it for the future. In 8th ACM Conference on Web Science. ACM, 183--189.Google ScholarDigital Library
- Abhijnan Chakraborty, Johnnatan Messias, Fabricio Benevenuto, Saptarshi Ghosh, Niloy Ganguly, and Krishna Gummadi. 2017. Who Makes Trends? Understanding Demographic Biases in Crowdsourced Recommendations.Google Scholar
- Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set. JMIR Public Health Surveill , Vol. 6, 2 (2020). https://doi.org/10.2196/19273Google ScholarCross Ref
- Jae Eun Chung. 2017. Retweeting in health promotion: Analysis of tweets about Breast Cancer Awareness Month. Computers in Human Behavior , Vol. 74 (2017), 112--119.Google ScholarCross Ref
- Pavlos Fafalios, Manolis Baritakis, and Yannis Tzitzikas. 2015. Exploiting linked data for open and configurable named entity extraction. International Journal on Artificial Intelligence Tools, Vol. 24, 2 (2015), 1540012.Google ScholarCross Ref
- Pavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi, and Stefan Dietze. 2018. Tweetskb: A public and large-scale rdf corpus of annotated tweets. In European Semantic Web Conference. Springer, 177--190.Google ScholarDigital Library
- Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2020. NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. (2020). arxiv: 2004.08145Google Scholar
- Malo Gasquet, Darlene Brechtel, Matthaus Zloch, Andon Tchechmedjiev, Katarina Boland, Pavlos Fafalios, Stefan Dietze, and Konstantin Todorov. 2019. Exploring Fact-checked Claims and their Descriptive Statistics. In ISWC 2019 Satellite Tracks-18th International Semantic Web Conference.Google Scholar
- Ramanathan V Guha, Dan Brickley, and Steve Macbeth. 2016. Schema.org: evolution of structured data on the web. Commun. ACM, Vol. 59, 2 (2016), 44--51.Google ScholarDigital Library
- Manish Gupta, Jing Gao, ChengXiang Zhai, and Jiawei Han. 2012. Predicting future popularity trend of events in microblogging platforms. Proceedings of the American Society for Information Science and Technology, Vol. 49, 1 (2012), 1--10.Google ScholarCross Ref
- Fatima Haouari, Maram Hasanain, Reem Suwaileh, and Tamer Elsayed. 2020. ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks. arXiv preprint arXiv:2004.05861 (2020).Google Scholar
- Xiaolei Huang, Amelia Jamison, David Broniatowski, Sandra Quinn, and Mark Dredze. 2020. Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations. https://doi.org/10.5281/zenodo.3735015Google Scholar
- Daniel Kerchner and Laura Wrubel. 2020. Coronavirus Tweet Ids. https://doi.org/10.7910/DVN/LW0BTBGoogle Scholar
- Eunice Kim, Yongjun Sung, and Hamsu Kang. 2014. Brand followers? retweeting behavior on Twitter: How brand relationships influence brand electronic word-of-mouth. Computers in Human Behavior , Vol. 37 (2014), 18--25.Google ScholarDigital Library
- Marina Kogan, Leysia Palen, and Kenneth M Anderson. 2015. Think local, retweet global: Retweeting by the geographically-vulnerable during Hurricane Sandy. In 18th Conference on Computer Supported Cooperative Work & Social Computing. ACM, 981--993.Google ScholarDigital Library
- Rabindra Lamsal. 2020 a. Coronavirus (COVID-19) Geo-tagged Tweets Dataset. https://doi.org/10.21227/fpsb-jz61Google Scholar
- Rabindra Lamsal. 2020 b. Coronavirus (COVID-19) Tweets Dataset. https://doi.org/10.21227/781w-ef42Google Scholar
- Cristian Lumezanu, Nick Feamster, and Hans Klein. 2012. #bias: Measuring the Tweeting Behavior of Propagandists.Google Scholar
- Abhilash Mittal and Sanjay Patidar. 2019. Sentiment Analysis on Twitter Data: A Survey. In 7th International Conference on Computer and Communications Management. ACM, 91--95.Google ScholarDigital Library
- Martin M Müller and Marcel Salathé. 2019. Crowdbreaks: Tracking health trends using public social media data and crowdsourcing. Frontiers in public health , Vol. 7 (2019).Google Scholar
- Kashyap Popat, Subhabrata Mukherjee, Jannik Strötgen, and Gerhard Weikum. 2017. Where the truth lies: Explaining the credibility of emerging claims on the web and social media. In 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1003--1012.Google ScholarDigital Library
- Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information. ACM SIGSPATIAL Special , Vol. 12, 1 (2020).Google Scholar
- Ibrahim Sabuncu and Zyenep Yurek. 2020. Corona Virus (COVID-19) Turkish Tweets Dataset. https://doi.org/10.21227/0wf0-0792Google Scholar
- J Fernando Sánchez-Rada and Carlos A Iglesias. 2016. Onyx: A linked data approach to emotion representation. Information Processing & Management , Vol. 52, 1 (2016), 99--114.Google ScholarDigital Library
- Surendra Sedhai and Aixin Sun. 2015. Hspam14: A collection of 14 million tweets for hashtag-oriented spam research. In 38th International Conference on Research and Development in Information Retrieval. ACM, 223--232.Google ScholarDigital Library
- Stefan Stieglitz and Linh Dang-Xuan. 2012. Political communication and influence through microblogging--An empirical analysis of sentiment in Twitter messages and retweet behavior. In 45th Hawaii International Conference on System Sciences. IEEE, 3500--3509.Google ScholarDigital Library
- A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S. Dietze, and K. Todorov. 2019. ClaimsKG: A Live Knowledge Graph of Fact-Checked Claims. In 18th International Semantic Web Conference. Springer, 309--324.Google Scholar
- Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology, Vol. 63, 1 (2012), 163--173.Google ScholarDigital Library
- Sebastian Tschiatschek, Adish Singla, Manuel Gomez Rodriguez, Arpit Merchant, and Andreas Krause. 2018. Fake News Detection in Social Networks via Crowd Signals. In Companion Proceedings of the The Web Conference 2018. International World Wide Web Conferences Steering Committee, 517--524.Google ScholarDigital Library
- Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. Science, Vol. 359, 6380 (2018), 1146--1151.Google Scholar
Index Terms
- TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic
Recommendations
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets
The Semantic WebAbstractPublicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and ...
An Analytical Insight of Discussions and Sentiments of Indians on Omicron-Driven Third Wave of COVID-19
AbstractMicroblogging site Twitter is one of the most crucial tool for expressing and sharing the opinions and views of everyday life events. Many researchers have used tweets made during the COVID-19 pandemic to monitor the opinion of the people towards ...
Portuguese Twitter Dataset on COVID-19
ASONAM '22: Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and MiningOver the last two years, the COVID-19 pandemic has affected hundreds of millions of people around the world. As in many crises, people turn to social media platforms, like Twitter, to communicate and share information. Twitter datasets have been used ...
Comments