Skip to main content

2022 | OriginalPaper | Buchkapitel

Information Extraction from Social Media: A Hands-On Tutorial on Tasks, Data, and Open Source Tools

verfasst von : Shubhanshu Mishra, Rezvaneh Rezapour, Jana Diesner

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Information extraction (IE) is a common sub-area of natural language processing that focuses on identifying structured data from unstructured data. The community of Information Retrieval (IR) relies on accurate and high-performance IE to be able to retrieve high quality results from massive datasets. One example of IE is to identify named entities in a text, e.g., “Barack Obama served as the president of the USA”. Here, Barack Obama and USA are named entities of types of PERSON and LOCATION, respectively. Another example is to identify sentiment expressed in a text, e.g., “This movie was awesome”. Here, the sentiment expressed is positive. Finally, identifying various linguistic aspects of a text, e.g., part of speech tags, noun phrases, dependency parses, etc., which can serve as features for additional IE tasks. This tutorial introduces participants to a) the usage of Python based, open-source tools that support IE from social media data (mainly Twitter), and b) best practices for ensuring the reproducibility of research. Participants will learn and practice various semantic and syntactic IE techniques that are commonly used for analyzing tweets. Additionally, participants will be familiarized with the landscape of publicly available tweet data, and methods for collecting and preparing them for analysis. Finally, participants will be trained to use a suite of open source tools (SAIL for active learning, TwitterNER for named entity recognition3, and SocialMediaIE for multi task learning), which utilize advanced machine learning techniques (e.g., deep learning, active learning with human-in-the-loop, multi-lingual, and multi-task learning) to perform IE on their own or existing datasets. Participants will also learn how social context can be integrated in Information Extraction systems to make them better. The tools introduced in the tutorial will focus on the three main stages of IE, namely, collection of data (including annotation), data processing and analytics, and visualization of the extracted information. More details can be found at: https://​socialmediaie.​github.​io/​tutorials/​.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Addawood, A., Rezapour, R., Mishra, S., Schneider, J., Diesner, J.: Developing an information source lexicon. In: Prioritising Online Content Workshop Co-located at NIPS (2017) Addawood, A., Rezapour, R., Mishra, S., Schneider, J., Diesner, J.: Developing an information source lexicon. In: Prioritising Online Content Workshop Co-located at NIPS (2017)
2.
Zurück zum Zitat Collier, D., Mishra, S., Houston, D., Hensley, B., Mitchell, S., Hartlep, N.: Who is most likely to oppose federal tuition-free college policies? Investigating variable interactions of sentiments to America’s college promise. SSRN Electron. J. (2019). https://doi.org/10.2139/ssrn.3423054 Collier, D., Mishra, S., Houston, D., Hensley, B., Mitchell, S., Hartlep, N.: Who is most likely to oppose federal tuition-free college policies? Investigating variable interactions of sentiments to America’s college promise. SSRN Electron. J. (2019). https://​doi.​org/​10.​2139/​ssrn.​3423054
4.
Zurück zum Zitat Diesner, J., Carley, K.M.: Relation extraction from texts (in German: Extraktion relationaler Daten aus Texten). In: Stegbauer, C., Häußling, R. (eds.) Handbook network research (Handbuch Netzwerkforschung), pp. 507–521. VS Verlag (2010) Diesner, J., Carley, K.M.: Relation extraction from texts (in German: Extraktion relationaler Daten aus Texten). In: Stegbauer, C., Häußling, R. (eds.) Handbook network research (Handbuch Netzwerkforschung), pp. 507–521. VS Verlag (2010)
5.
Zurück zum Zitat Diesner, J., Kumaraguru, P., Carley, K.M.: Mental models of data privacy and security extracted from interviews with Indians. In: Proceedings of 55th Annual Conference of International Communication Association (ICA). New York, NY (2005) Diesner, J., Kumaraguru, P., Carley, K.M.: Mental models of data privacy and security extracted from interviews with Indians. In: Proceedings of 55th Annual Conference of International Communication Association (ICA). New York, NY (2005)
6.
Zurück zum Zitat Diesner, J., Chin, C.L.: Usable ethics: practical considerations for responsibly conducting research with social trace data. In: Proceedings of Beyond IRBs: Ethical Review Processes for Big Data Research (2015) Diesner, J., Chin, C.L.: Usable ethics: practical considerations for responsibly conducting research with social trace data. In: Proceedings of Beyond IRBs: Ethical Review Processes for Big Data Research (2015)
7.
Zurück zum Zitat Diesner, J., Chin, C.L.: Seeing the forest for the trees: considering applicable types of regulation for the responsible collection and analysis of human centered data. In: Human-Centered Data Science (HCDS) Workshop at 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing (2016) Diesner, J., Chin, C.L.: Seeing the forest for the trees: considering applicable types of regulation for the responsible collection and analysis of human centered data. In: Human-Centered Data Science (HCDS) Workshop at 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing (2016)
8.
Zurück zum Zitat Eisenstein, J.: What to do about bad language on the internet. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 359–369. Association for Computational Linguistics, Atlanta, Georgia (June 2013) Eisenstein, J.: What to do about bad language on the internet. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 359–369. Association for Computational Linguistics, Atlanta, Georgia (June 2013)
9.
Zurück zum Zitat Han, K., Yang, P., Mishra, S., Diesner, J.: WikiCSSH: extracting computer science subject headings from Wikipedia. In: Workshop on Scientific Knowledge Graphs (SKG 2020) (2020) Han, K., Yang, P., Mishra, S., Diesner, J.: WikiCSSH: extracting computer science subject headings from Wikipedia. In: Workshop on Scientific Knowledge Graphs (SKG 2020) (2020)
10.
Zurück zum Zitat Hutto, C.J., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of social media text. In: International AAAI Conference on Web and Social Media. Ann Arbor, Michigan, USA (2014) Hutto, C.J., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of social media text. In: International AAAI Conference on Web and Social Media. Ann Arbor, Michigan, USA (2014)
12.
Zurück zum Zitat Kosinski, M., Matz, S.C., Gosling, S.D., Popov, V., Stillwell, D.: Facebook as a research tool for the social sciences: opportunities, challenges, ethical considerations, and practical guidelines. Am. Psychol. 70(6), 543–556 (2015). https://doi.org/10.1037/a0039210 Kosinski, M., Matz, S.C., Gosling, S.D., Popov, V., Stillwell, D.: Facebook as a research tool for the social sciences: opportunities, challenges, ethical considerations, and practical guidelines. Am. Psychol. 70(6), 543–556 (2015). https://​doi.​org/​10.​1037/​a0039210
14.
Zurück zum Zitat Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web–WWW 2010, p. 591. ACM Press, New York, New York, USA (April 2010). https://doi.org/10.1145/1772690.1772751 Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web–WWW 2010, p. 591. ACM Press, New York, New York, USA (April 2010). https://​doi.​org/​10.​1145/​1772690.​1772751
15.
Zurück zum Zitat Mishra, S.: SCTG: social communications temporal graph - a novel approach to visualize temporal communication graphs from social data. In: UIUC Data Science Day (October 2017) Mishra, S.: SCTG: social communications temporal graph - a novel approach to visualize temporal communication graphs from social data. In: UIUC Data Science Day (October 2017)
16.
Zurück zum Zitat Mishra, S.: Multi-dataset-multi-task neural sequence tagging for information extraction from tweets. In: Proceedings of the 30th ACM Conference on Hypertext and Social Media - HT 2019, pp. 283–284. ACM Press, New York, New York, USA (2019). https://doi.org/10.1145/3342220.3344929 Mishra, S.: Multi-dataset-multi-task neural sequence tagging for information extraction from tweets. In: Proceedings of the 30th ACM Conference on Hypertext and Social Media - HT 2019, pp. 283–284. ACM Press, New York, New York, USA (2019). https://​doi.​org/​10.​1145/​3342220.​3344929
18.
Zurück zum Zitat Mishra, S.: Information Extraction from Digital Social Trace Data with Applications to Social Media and Scholarly Communication Data. Ph.D. thesis, University of Illinois at Urbana-Champaign (2020) Mishra, S.: Information Extraction from Digital Social Trace Data with Applications to Social Media and Scholarly Communication Data. Ph.D. thesis, University of Illinois at Urbana-Champaign (2020)
19.
Zurück zum Zitat Mishra, S.: Non-neural structured prediction for event detection from news in Indian languages. In: Mehta, P., Mandl, T., Majumder, P., Mitra, M. (eds.) Working Notes of FIRE 2020–Forum for Information Retrieval Evaluation. CEUR Workshop Proceedings, CEUR-WS.org, Hyderabad, India (2020) Mishra, S.: Non-neural structured prediction for event detection from news in Indian languages. In: Mehta, P., Mandl, T., Majumder, P., Mitra, M. (eds.) Working Notes of FIRE 2020–Forum for Information Retrieval Evaluation. CEUR Workshop Proceedings, CEUR-WS.org, Hyderabad, India (2020)
20.
Zurück zum Zitat Mishra, S., Agarwal, S., Guo, J., Phelps, K., Picco, J., Diesner, J.: Enthusiasm and support: alternative sentiment classification for social movements on social media. In: Proceedings of the 2014 ACM conference on Web science - WebSci 2014, pp. 261–262. ACM Press, Bloomington, Indiana, USA (June 2014). https://doi.org/10.1145/2615569.2615667 Mishra, S., Agarwal, S., Guo, J., Phelps, K., Picco, J., Diesner, J.: Enthusiasm and support: alternative sentiment classification for social movements on social media. In: Proceedings of the 2014 ACM conference on Web science - WebSci 2014, pp. 261–262. ACM Press, Bloomington, Indiana, USA (June 2014). https://​doi.​org/​10.​1145/​2615569.​2615667
22.
Zurück zum Zitat Mishra, S., Diesner, J.: Semi-supervised named entity recognition in noisy-text. In: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pp. 203–212. The COLING 2016 Organizing Committee, Osaka, Japan (2016) Mishra, S., Diesner, J.: Semi-supervised named entity recognition in noisy-text. In: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pp. 203–212. The COLING 2016 Organizing Committee, Osaka, Japan (2016)
23.
Zurück zum Zitat Mishra, S., Diesner, J.: Detecting the correlation between sentiment and user-level as well as text-level meta-data from benchmark corpora. In: Proceedings of the 29th on Hypertext and Social Media - HT 2018, pp. 2–10. ACM Press, New York, New York, USA (2018). https://doi.org/10.1145/3209542.3209562 Mishra, S., Diesner, J.: Detecting the correlation between sentiment and user-level as well as text-level meta-data from benchmark corpora. In: Proceedings of the 29th on Hypertext and Social Media - HT 2018, pp. 2–10. ACM Press, New York, New York, USA (2018). https://​doi.​org/​10.​1145/​3209542.​3209562
24.
Zurück zum Zitat Mishra, S., Diesner, J.: Capturing signals of enthusiasm and support towards social issues from Twitter. In: Proceedings of the 5th International Workshop on Social Media World Sensors - SIdEWayS 2019, pp. 19–24. ACM Press, New York, New York, USA (2019). https://doi.org/10.1145/3345645.3351104 Mishra, S., Diesner, J.: Capturing signals of enthusiasm and support towards social issues from Twitter. In: Proceedings of the 5th International Workshop on Social Media World Sensors - SIdEWayS 2019, pp. 19–24. ACM Press, New York, New York, USA (2019). https://​doi.​org/​10.​1145/​3345645.​3351104
25.
Zurück zum Zitat Mishra, S., Diesner, J., Byrne, J., Surbeck, E.: Sentiment analysis with incremental human-in-the-loop learning and lexical resource customization. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT 2015, pp. 323–325. ACM Press, New York, New York, USA (2015). https://doi.org/10.1145/2700171.2791022 Mishra, S., Diesner, J., Byrne, J., Surbeck, E.: Sentiment analysis with incremental human-in-the-loop learning and lexical resource customization. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT 2015, pp. 323–325. ACM Press, New York, New York, USA (2015). https://​doi.​org/​10.​1145/​2700171.​2791022
26.
Zurück zum Zitat Mishra, S., Haghighi, A.: Improved multilingual language model pretraining for social media text via translation pair prediction. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 381–388. Association for Computational Linguistics, Stroudsburg, PA, USA (November 2021). https://doi.org/10.18653/v1/2021.wnut-1.42 Mishra, S., Haghighi, A.: Improved multilingual language model pretraining for social media text via translation pair prediction. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 381–388. Association for Computational Linguistics, Stroudsburg, PA, USA (November 2021). https://​doi.​org/​10.​18653/​v1/​2021.​wnut-1.​42
27.
Zurück zum Zitat Mishra, S., He, S., Belli, L.: Assessing demographic bias in named entity recognition. In: Bias in Automatic Knowledge Graph Construction–A Workshop at AKBC 2020 (August 2020) Mishra, S., He, S., Belli, L.: Assessing demographic bias in named entity recognition. In: Bias in Automatic Knowledge Graph Construction–A Workshop at AKBC 2020 (August 2020)
28.
Zurück zum Zitat Mishra, S., Mishra, S.: 3Idiots at HASOC 2019: fine-tuning transformer neural networks for hate speech identification in Indo-European languages. In: Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 208–213. Kolkata, India (2019) Mishra, S., Mishra, S.: 3Idiots at HASOC 2019: fine-tuning transformer neural networks for hate speech identification in Indo-European languages. In: Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 208–213. Kolkata, India (2019)
29.
Zurück zum Zitat Mishra, S., Mishra, S.: Scubed at 3C task a–a simple baseline for citation context purpose classification. In: Proceedings of the 8th International Workshop on Mining Scientific Publications, pp. 59–64. Association for Computational Linguistics, Wuhan, China (2020) Mishra, S., Mishra, S.: Scubed at 3C task a–a simple baseline for citation context purpose classification. In: Proceedings of the 8th International Workshop on Mining Scientific Publications, pp. 59–64. Association for Computational Linguistics, Wuhan, China (2020)
30.
Zurück zum Zitat Mishra, S., Mishra, S.: Scubed at 3C task b–a simple baseline for citation context influence classification. In: Proceedings of the 8th International Workshop on Mining Scientific Publications, pp. 65–70. Association for Computational Linguistics, Wuhan, China (2020) Mishra, S., Mishra, S.: Scubed at 3C task b–a simple baseline for citation context influence classification. In: Proceedings of the 8th International Workshop on Mining Scientific Publications, pp. 65–70. Association for Computational Linguistics, Wuhan, China (2020)
31.
Zurück zum Zitat Mishra, S., Prasad, S., Mishra, S.: Multilingual joint fine-tuning of transformer models for identifying trolling, aggression and cyberbullying at TRAC 2020. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pp. 120–125. European Language Resources Association (ELRA), Marseille, France (2020) Mishra, S., Prasad, S., Mishra, S.: Multilingual joint fine-tuning of transformer models for identifying trolling, aggression and cyberbullying at TRAC 2020. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pp. 120–125. European Language Resources Association (ELRA), Marseille, France (2020)
33.
Zurück zum Zitat Mohammad, S.M., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 321–327. Association for Computational Linguistics, Atlanta, Georgia, USA (2013) Mohammad, S.M., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 321–327. Association for Computational Linguistics, Atlanta, Georgia, USA (2013)
35.
Zurück zum Zitat Rezapour, R., Dinh, L., Diesner, J.: Incorporating the measurement of moral foundations theory into analyzing stances on controversial topics. In: Proceedings of the 32st ACM Conference on Hypertext and Social Media, pp. 177–188. ACM, New York, NY, USA (August 2021). https://doi.org/10.1145/3465336.3475112 Rezapour, R., Dinh, L., Diesner, J.: Incorporating the measurement of moral foundations theory into analyzing stances on controversial topics. In: Proceedings of the 32st ACM Conference on Hypertext and Social Media, pp. 177–188. ACM, New York, NY, USA (August 2021). https://​doi.​org/​10.​1145/​3465336.​3475112
36.
Zurück zum Zitat Rezapour, R., Shah, S.H., Diesner, J.: Enhancing the measurement of social effects by capturing morality. In: Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 35–45. Association for Computational Linguistics, Stroudsburg, PA, USA (2019). https://doi.org/10.18653/v1/W19-1305 Rezapour, R., Shah, S.H., Diesner, J.: Enhancing the measurement of social effects by capturing morality. In: Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 35–45. Association for Computational Linguistics, Stroudsburg, PA, USA (2019). https://​doi.​org/​10.​18653/​v1/​W19-1305
37.
Zurück zum Zitat Rezapour, R., Wang, L., Abdar, O., Diesner, J.: Identifying the overlap between election result and candidates’ ranking based on hashtag-enhanced, lexicon-based sentiment analysis. In: 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pp. 93–96. IEEE (2017). https://doi.org/10.1109/ICSC.2017.92 Rezapour, R., Wang, L., Abdar, O., Diesner, J.: Identifying the overlap between election result and candidates’ ranking based on hashtag-enhanced, lexicon-based sentiment analysis. In: 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pp. 93–96. IEEE (2017). https://​doi.​org/​10.​1109/​ICSC.​2017.​92
39.
Zurück zum Zitat Sarol, M.J., Dinh, L., Rezapour, R., Chin, C.L., Yang, P., Diesner, J.: An empirical methodology for detecting and prioritizing needs during crisis events. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4102–4107. Association for Computational Linguistics, Stroudsburg, PA, USA (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.366 Sarol, M.J., Dinh, L., Rezapour, R., Chin, C.L., Yang, P., Diesner, J.: An empirical methodology for detecting and prioritizing needs during crisis events. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4102–4107. Association for Computational Linguistics, Stroudsburg, PA, USA (2020). https://​doi.​org/​10.​18653/​v1/​2020.​findings-emnlp.​366
41.
Zurück zum Zitat Yee, K., Tantipongpipat, U., Mishra, S.: Image cropping on twitter: fairness metrics, their limitations, and the importance of representation, design, and agency. Proc. ACM Hum. Comput. Interact. 5(CSCW2), 1–24 (2021). https://doi.org/10.1145/3479594 Yee, K., Tantipongpipat, U., Mishra, S.: Image cropping on twitter: fairness metrics, their limitations, and the importance of representation, design, and agency. Proc. ACM Hum. Comput. Interact. 5(CSCW2), 1–24 (2021). https://​doi.​org/​10.​1145/​3479594
Metadaten
Titel
Information Extraction from Social Media: A Hands-On Tutorial on Tasks, Data, and Open Source Tools
verfasst von
Shubhanshu Mishra
Rezvaneh Rezapour
Jana Diesner
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-99739-7_74