Skip to main content

2022 | OriginalPaper | Buchkapitel

The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions

verfasst von : Beatriz Garcia Santa Cruz, Carlos Vega, Frank Hertel

Erschienen in: Computational Intelligence Methods for Bioinformatics and Biostatistics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we discuss the importance of considering causal relations in the development of machine learning solutions to prevent factors hampering the robustness and generalisation capacity of the models, such as induced biases. This issue often arises when the algorithm decision is affected by confounding factors. In this work, we argue that the integration of research assumptions as causal relationships can help identify potential confounders. Together with metadata information, it can enable meta-comparison of data acquisition pipelines. We call for standardised meta-information practices as a crucial step for proper machine learning solutions development, validation, and data sharing. Such practices include detailing the data acquisition process, aiming for automatic integration of causal relationships and actionable metadata.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
3.
Zurück zum Zitat Repecka, D., et al.: Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021)CrossRef Repecka, D., et al.: Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021)CrossRef
4.
Zurück zum Zitat Xu, C., Jackson, S.: Machine learning and complex biological data. Genome Biol. 20, 1–4 (2019)CrossRef Xu, C., Jackson, S.: Machine learning and complex biological data. Genome Biol. 20, 1–4 (2019)CrossRef
5.
Zurück zum Zitat Wilkinson, M., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)CrossRef Wilkinson, M., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)CrossRef
6.
Zurück zum Zitat Walsh, I., et al.: DOME: recommendations for supervised machine learning validation in biology. Nat. Methods 18, 1122–1127 (2021)CrossRefPubMed Walsh, I., et al.: DOME: recommendations for supervised machine learning validation in biology. Nat. Methods 18, 1122–1127 (2021)CrossRefPubMed
7.
Zurück zum Zitat Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Natu. Methods 15, 233 (2018)CrossRef Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Natu. Methods 15, 233 (2018)CrossRef
8.
Zurück zum Zitat Smuha, N.: The EU approach to ethics guidelines for trustworthy artificial intelligence. Comput. Law Rev. Int. 20, 97–106 (2019)CrossRef Smuha, N.: The EU approach to ethics guidelines for trustworthy artificial intelligence. Comput. Law Rev. Int. 20, 97–106 (2019)CrossRef
9.
Zurück zum Zitat Hutchinson, B., et al.: Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575 (2021) Hutchinson, B., et al.: Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575 (2021)
10.
Zurück zum Zitat Mora-Cantallops, M., Sanchez-Alonso, S., Garcia-Barriocanal, E., Sicilia, M.: Traceability for trustworthy AI: a review of models and tools. Big Data Cogn. Comput. 5, 20 (2021)CrossRef Mora-Cantallops, M., Sanchez-Alonso, S., Garcia-Barriocanal, E., Sicilia, M.: Traceability for trustworthy AI: a review of models and tools. Big Data Cogn. Comput. 5, 20 (2021)CrossRef
11.
Zurück zum Zitat Paschali, M., Conjeti, S., Navarro, F., Navab, N.: Generalizability vs. robustness: adversarial examples for medical imaging. ArXiv Preprint ArXiv:1804.00504 (2018) Paschali, M., Conjeti, S., Navarro, F., Navab, N.: Generalizability vs. robustness: adversarial examples for medical imaging. ArXiv Preprint ArXiv:​1804.​00504 (2018)
12.
Zurück zum Zitat Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.: Everyone wants to do the model work, not the data work: data cascades in high-stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021) Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.: Everyone wants to do the model work, not the data work: data cascades in high-stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021)
13.
Zurück zum Zitat Mitchell, M., et al.: Model cards for model reporting. In: Proceedings Of The Conference On Fairness, Accountability, And Transparency, pp. 220–229 (2019) Mitchell, M., et al.: Model cards for model reporting. In: Proceedings Of The Conference On Fairness, Accountability, And Transparency, pp. 220–229 (2019)
14.
Zurück zum Zitat Santa Cruz, B., Bossa, M., Sölter, J., Husch, A.: Public Covid-19 X-ray datasets and their impact on model bias-a systematic review of a significant problem. Med. Image Anal. 74, 102225 (2021)CrossRef Santa Cruz, B., Bossa, M., Sölter, J., Husch, A.: Public Covid-19 X-ray datasets and their impact on model bias-a systematic review of a significant problem. Med. Image Anal. 74, 102225 (2021)CrossRef
15.
Zurück zum Zitat Castro, D., Walker, I., Glocker, B.: Causality matters in medical imaging. Nat. Commun. 11, 1–10 (2020)CrossRef Castro, D., Walker, I., Glocker, B.: Causality matters in medical imaging. Nat. Commun. 11, 1–10 (2020)CrossRef
16.
Zurück zum Zitat Zhu, Y., et al.: Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 1–11 (2021) Zhu, Y., et al.: Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 1–11 (2021)
17.
Zurück zum Zitat Bazgir, O., Zhang, R., Dhruba, S., Rahman, R., Ghosh, S., Pal, R.: Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 1–13 (2020)CrossRef Bazgir, O., Zhang, R., Dhruba, S., Rahman, R., Ghosh, S., Pal, R.: Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 1–13 (2020)CrossRef
18.
Zurück zum Zitat Mäkinen, S., Skogström, H., Laaksonen, E., Mikkonen, T.: Who needs MLOps: what data scientists seek to accomplish and how can MLOps Help?. In: 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering For AI (WAIN), pp. 109–112 (2021) Mäkinen, S., Skogström, H., Laaksonen, E., Mikkonen, T.: Who needs MLOps: what data scientists seek to accomplish and how can MLOps Help?. In: 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering For AI (WAIN), pp. 109–112 (2021)
19.
Zurück zum Zitat Sweenor, D., Hillion, S., Rope, D., Kannabiran, D., Hill, T., O’Connell, M.: ML Ops: Operationalizing Data Science. O’Reilly Media, Incorporated (2020) Sweenor, D., Hillion, S., Rope, D., Kannabiran, D., Hill, T., O’Connell, M.: ML Ops: Operationalizing Data Science. O’Reilly Media, Incorporated (2020)
20.
Zurück zum Zitat Vega, C.: From Hume to Wuhan: an epistemological journey on the problem of induction in COVID-19 machine learning models and its impact upon medical research. IEEE Access. 9, 97243–97250 (2021)CrossRefPubMed Vega, C.: From Hume to Wuhan: an epistemological journey on the problem of induction in COVID-19 machine learning models and its impact upon medical research. IEEE Access. 9, 97243–97250 (2021)CrossRefPubMed
21.
23.
24.
Zurück zum Zitat Altevogt, B., Davis, M., Pankevich, D., Norris, S.: Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary. National Academies Press, Washington (2014) Altevogt, B., Davis, M., Pankevich, D., Norris, S.: Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary. National Academies Press, Washington (2014)
25.
Zurück zum Zitat Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface. 15, 20170387 (2018)CrossRef Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface. 15, 20170387 (2018)CrossRef
26.
Zurück zum Zitat Leek, J., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010)CrossRefPubMed Leek, J., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010)CrossRefPubMed
28.
Zurück zum Zitat Griffith, G., et al.: Others collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11, 1–12 (2020)CrossRef Griffith, G., et al.: Others collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11, 1–12 (2020)CrossRef
29.
Zurück zum Zitat Leipzig, J., Nüst, D., Hoyt, C., Ram, K., Greenberg, J.: The role of metadata in reproducible computational research. Patterns. 2, 100322 (2021)CrossRefPubMedPubMedCentral Leipzig, J., Nüst, D., Hoyt, C., Ram, K., Greenberg, J.: The role of metadata in reproducible computational research. Patterns. 2, 100322 (2021)CrossRefPubMedPubMedCentral
32.
Zurück zum Zitat Shimoni, Y., et al.: An evaluation toolkit to guide model selection and cohort definition in causal inference. ArXiv Preprint ArXiv:1906.00442 (2019) Shimoni, Y., et al.: An evaluation toolkit to guide model selection and cohort definition in causal inference. ArXiv Preprint ArXiv:​1906.​00442 (2019)
33.
34.
Zurück zum Zitat Touré, V., Flobak, A., Niarakis, A., Vercruysse, S., Kuiper, M.: The status of causality in biological databases: data resources and data retrieval possibilities to support logical modeling. Briefings Bioinform. 22, bbaa390 (2021) Touré, V., Flobak, A., Niarakis, A., Vercruysse, S., Kuiper, M.: The status of causality in biological databases: data resources and data retrieval possibilities to support logical modeling. Briefings Bioinform. 22, bbaa390 (2021)
35.
Zurück zum Zitat Juty, N., Le Novere, N., Laibe, C.: Identifiers. org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res. 40, D580–D586 (2012) Juty, N., Le Novere, N., Laibe, C.: Identifiers. org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res. 40, D580–D586 (2012)
Metadaten
Titel
The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions
verfasst von
Beatriz Garcia Santa Cruz
Carlos Vega
Frank Hertel
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-031-20837-9_16

Premium Partner