Skip to main content

2021 | OriginalPaper | Buchkapitel

Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings

verfasst von : Joseph Miano, Charity Hilton, Vasu Gangrade, Mary Pomeroy, Jacqueline Siven, Michael Flynn, Frances Tilashalski

Erschienen in: Artificial Intelligence in Medicine

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Current surveillance methods may not capture the full extent of COVID-19 spread in high-risk settings like food establishments. Thus, we propose a new method for surveillance that identifies COVID-19 cases among food establishment workers from news reports via web-scraping and natural language processing (NLP). First, we used web-scraping to identify a broader set of articles (n = 67,078) related to COVID-19 based on keyword mentions. In this dataset, we used an open-source NLP platform (ClarityNLP) to extract location, industry, case, and death counts automatically. These articles were vetted and validated by CDC subject matter experts (SMEs) to identify those containing COVID-19 outbreaks in food establishments. CDC and Georgia Tech Research Institute SMEs provided a human-labeled test dataset containing 388 articles to validate our algorithms. Then, to improve quality, we fine-tuned a pretrained RoBERTa instance, a bidirectional transformer language model, to classify articles containing \(\ge \)1 positive COVID-19 cases in food establishments. The application of RoBERTa decreased the number of articles from 67,078 to 1,112 and classified (\(\ge \)1 positive COVID-19 cases in food establishments) articles with 88% accuracy in the human-labeled test dataset. Therefore, by automating the pipeline of web-scraping and COVID-19 case prediction using RoBERTa, we enable an efficient human in-the-loop process by which COVID-19 data could be manually collected from articles flagged by our model, thus reducing the human labor requirements. Furthermore, our approach could be used to predict and monitor locations of COVID-19 development by geography and could also be extended to other industries and news article datasets of interest.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
At the time we trained the RoBERTa model, we evaluated just over 67,000 articles.
 
3
The US map, which can be accessed at: https://​www.​google.​com/​maps/​d/​u/​0/​viewer?​mid=​1ymY4bzI70AOCeFz​RYvfe4HPWVvgPBoJ​h&​ll=​45.​80359787060013%2C-114.​35715944999998&​z=​4 shows the food-setting locations found using manual validation of news articles. The legend on the left within the map shows the different types of food settings based on NAICS codes. The user may access more details about each facility including the title, a link to the article, and descriptors including case and death counts by clicking on a chosen location on the map.
 
4
We developed a public dashboard of news article keywords of COVID-19 in food processing facilities from March 15-September 30, 2020. It is available at https://​public.​tableau.​com/​profile/​charity.​hilton#!/​vizhome/​COVID-19NewsReportsabo​utFoodSettings/​COVID-19NewsMap.
 
Literatur
11.
Zurück zum Zitat Althubaiti, A.: Information bias in health research: definition, pitfalls, and adjustment methods. J. Multi. Healthc. 9, 211 (2016)CrossRef Althubaiti, A.: Information bias in health research: definition, pitfalls, and adjustment methods. J. Multi. Healthc. 9, 211 (2016)CrossRef
13.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
14.
Zurück zum Zitat Dyal, J.W.: COVID-19 among workers in meat and poultry processing facilities–19 states, April 2020. MMWR Morb. Mortal. Wkly Rep. 69 (2020) Dyal, J.W.: COVID-19 among workers in meat and poultry processing facilities–19 states, April 2020. MMWR Morb. Mortal. Wkly Rep. 69 (2020)
18.
Zurück zum Zitat Kakimoto, K., Kamiya, H., Yamagishi, T., Matsui, T., Suzuki, M., Wakita, T.: Initial investigation of transmission of COVID-19 among crew members during quarantine of a cruise ship–Yokohama, Japan, February 2020 Kakimoto, K., Kamiya, H., Yamagishi, T., Matsui, T., Suzuki, M., Wakita, T.: Initial investigation of transmission of COVID-19 among crew members during quarantine of a cruise ship–Yokohama, Japan, February 2020
20.
Zurück zum Zitat Krebs, C.: Guidance on the essential critical infrastructure workforce: ensuring community and national resilience in COVID-19 response. Cybersecurity and Infrastructure Security Agency (CISA) 5 (2020) Krebs, C.: Guidance on the essential critical infrastructure workforce: ensuring community and national resilience in COVID-19 response. Cybersecurity and Infrastructure Security Agency (CISA) 5 (2020)
22.
Zurück zum Zitat McMichael, T.M.: COVID-19 in a long-term care facility—king county, Washington, February 27–March 9, 2020. MMWR Morb. Mortal. Wkly Rep. 69, 339 (2020) McMichael, T.M.: COVID-19 in a long-term care facility—king county, Washington, February 27–March 9, 2020. MMWR Morb. Mortal. Wkly Rep. 69, 339 (2020)
23.
Zurück zum Zitat Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019) Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)
24.
Zurück zum Zitat Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
25.
Zurück zum Zitat Richardson, L.: Beautiful soup documentation, April 2007 Richardson, L.: Beautiful soup documentation, April 2007
26.
Zurück zum Zitat Rodriguez-Lainz, A., et al.: Collection of data on race, ethnicity, language, and nativity by us public health surveillance and monitoring systems: gaps and opportunities. Public Health Rep. 133(1), 45–54 (2018)CrossRef Rodriguez-Lainz, A., et al.: Collection of data on race, ethnicity, language, and nativity by us public health surveillance and monitoring systems: gaps and opportunities. Public Health Rep. 133(1), 45–54 (2018)CrossRef
27.
Zurück zum Zitat National Academies of Sciences, Engineering, and Medicine and others: A smarter national surveillance system for occupational safety and health in the 21st century. National Academies Press (2018) National Academies of Sciences, Engineering, and Medicine and others: A smarter national surveillance system for occupational safety and health in the 21st century. National Academies Press (2018)
28.
Zurück zum Zitat Steinberg, J., et al.: COVID-19 outbreak among employees at a meat processing facility–South Dakota, March-April 2020. Morb. Mortal. Wkly Rep. 69(31), 1015 (2020)CrossRef Steinberg, J., et al.: COVID-19 outbreak among employees at a meat processing facility–South Dakota, March-April 2020. Morb. Mortal. Wkly Rep. 69(31), 1015 (2020)CrossRef
30.
Zurück zum Zitat Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
31.
Zurück zum Zitat Wallace, M.: Public health response to COVID-19 cases in correctional and detention facilities—Louisiana, March–April 2020. MMWR Morb. Mortal. Wkly Rep. 69 (2020) Wallace, M.: Public health response to COVID-19 cases in correctional and detention facilities—Louisiana, March–April 2020. MMWR Morb. Mortal. Wkly Rep. 69 (2020)
32.
Zurück zum Zitat Waltenburg, M.A., et al.: Update: COVID-19 among workers in meat and poultry processing facilities–united states, April-May 2020. Morb. Mortal. Wkly Rep. 69(27), 887 (2020)CrossRef Waltenburg, M.A., et al.: Update: COVID-19 among workers in meat and poultry processing facilities–united states, April-May 2020. Morb. Mortal. Wkly Rep. 69(27), 887 (2020)CrossRef
33.
Zurück zum Zitat Wilson, K., Brownstein, J.S.: Early detection of disease outbreaks using the internet. Cmaj 180(8), 829–831 (2009)CrossRef Wilson, K., Brownstein, J.S.: Early detection of disease outbreaks using the internet. Cmaj 180(8), 829–831 (2009)CrossRef
34.
Zurück zum Zitat Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020) Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Metadaten
Titel
Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings
verfasst von
Joseph Miano
Charity Hilton
Vasu Gangrade
Mary Pomeroy
Jacqueline Siven
Michael Flynn
Frances Tilashalski
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-77211-6_21