Skip to main content
Top

Open Access 2025 | Open Access | Book

AI in Drug Discovery

First International Workshop, AIDD 2024, Held in Conjunction with ICANN 2024, Lugano, Switzerland, September 19, 2024, Proceedings

insite
SEARCH

About this book

This open Access book constitutes the refereed proceedings of the First International Workshop on AI in Drug Discovery, AIDD 2024, held as a part of the 33rd International Conference on Artificial Neural Networks, ICANN 2024, in Lugano, Switzerland, on September 19, 2024.

The 12 papers presented here were carefully reviewed and selected for these open access proceedings. These papers focus on various aspects of the rapidly evolving field of Artificial Intelligence (AI)-driven drug discovery in chemistry, including Big Data and advanced Machine Learning, eXplainable AI (XAI), Chemoinformatics, Use of deep learning to predict molecular properties, Modeling and prediction of chemical reaction data and Generative models.

Table of Contents

Frontmatter

Open Access

Enhancing Interpretability in Molecular Property Prediction with Contextual Explanations of Molecular Graphical Depictions
Abstract
The field of explainable AI applied to molecular property prediction models has often been reduced to deriving atomic contributions. This has impaired the interpretability of such models, as chemists rather think in terms of larger, chemically meaningful structures, which often do not simply reduce to the sum of their atomic constituents. We develop an explanatory strategy yielding both local as well as more complex structural attributions. We derive such contextual explanations in pixel space, exploiting the property that a molecule is not merely encoded through a collection of atoms and bonds, as is the case for string- or graph-based approaches. We provide evidence that the proposed explanation method satisfies desirable properties, namely sparsity and invariance with respect to the molecule’s symmetries, to a larger degree that the SMILES-based counterpart model. Nonetheless, they correlate as expected with these string-based explanation as well as with ground truths, when available. Contextual explanations thus maintain the accuracy of the original explanations while improving their interpretability.
Marco Bertolini, Linlin Zhao, Floriane Montanari, Djork-Arné Clevert

Open Access

Temporal Evaluation of Probability Calibration with Experimental Errors
Abstract
The quantification of uncertainties associated with neural network predictions can facilitate optimal decision-making and accelerate workflows where time and resource efficiency are essential.
Hannah Rosa Friesacher, Emma Svensson, Adam Arany, Lewis Mervin, Ola Engkvist

Open Access

Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map
Abstract
The increasing use of machine learning and artificial intelligence in chemical reaction studies demands high-quality reaction data, necessitating specialized tools enabling data understanding and curation. Our work introduces a novel methodology for reaction data examination centered on reagents - essential molecules in reactions that do not contribute atoms to products. We propose an intuitive tool for creating interactive reagent space maps using distributed vector representations, akin to word2vec in Natural Language Processing, capturing the statistics of reagent usage within datasets. Our approach enables swift assessment of reagent action patterns and identification of erroneous reagent entries, which we demonstrate using the USPTO dataset. Our contributions include an open-source web application for visual reagent pattern analysis and a table cataloging around six hundred of the most frequent reagents in USPTO annotated with detailed roles. Our method aims to support organic chemists and cheminformatics experts in reaction data curation routine.
Mikhail Andronov, Natalia Andronova, Michael Wand, Jürgen Schmidhuber, Djork-Arné Clevert

Open Access

Latent-Conditioned Equivariant Diffusion for Structure-Based De Novo Ligand Generation
Abstract
We propose PoLiGenX for de novo ligand design using latent-conditioned, target-aware equivariant diffusion. Our model leverages the conditioning of the generation process on reference molecules within a protein pocket to produce shape-similar de novo ligands that can be used for target-aware hit expansion and hit optimization. The results of our study showcase the efficacy of PoLiGenX in ligand design. Docking scores indicate that the generated ligands exhibit superior binding affinity compared to the reference molecule while preserving the shape. At the same time, our model maintains chemical diversity, ensuring the exploration of diverse chemical space. The evaluation of Lipinski’s rule of five suggests that the sampled molecules possess a higher drug-likeness than the reference data. This constitutes an important step towards the controlled generation of therapeutically relevant de novo ligands tailored to specific protein targets.
Julian Cremer, Tuan Le, Djork-Arné Clevert, Kristof T. Schütt

Open Access

Leveraging Quantum Mechanical Properties to Predict Solvent Effects on Large Drug-Like Molecules
Abstract
Understanding how solvation affects structure-property and property-property relationships of drug-like molecules is crucial for de novo design, as most relevant reactions occur in aqueous environments. We have thus performed an exhaustive analysis of the recently proposed Aquamarine dataset to gain insights into the effect of solvent-molecule interaction on the quantum-mechanical (QM) properties of large drug-like molecules. Our results show that the inclusion of an implicit solvent model of water changes the values of (extensive and intensive) QM properties but it does not alter the correlations among them. Moreover, we have found that solvation can limit the identification of unique molecular conformations, with variations in specific properties being rationalized by the extent of structural changes. \(\varDelta \)-learning approach was used to predict solvent effects on the dipole moment \(\mu \) and the many-body dispersion energy \(E_\textrm{MBD}\), resulting in more accurate and scalable predictive models compared to these directly trained on solvated properties. Hence, our work provides valuable insights into the effect of solvent-molecule interaction on physicochemical properties, which could assist in the development of machine-learning models for designing solvated molecules of pharmaceutical and biological relevance.
Mathias Hilfiker, Leonardo Medrano Sandonas, Marco Klähn, Ola Engkvist, Alexandre Tkatchenko

Open Access

Towards Interpretable Models of Chemist Preferences for Human-in-the-Loop Assisted Drug Discovery
Abstract
In recent years, there has been growing interest in leveraging human preferences for drug discovery to build models that capture chemists’ intuition for de novo molecular design, lead optimization, and prioritization for experimental validation. However, existing models derived from human preferences in chemistry are often black-boxes, lacking interpretability regarding how humans form their preferences. Enhancing transparency in human-in-the-loop learning is crucial to ensure that such approaches in drug discovery are not unduly affected by subjective bias, noise or inconsistency. Moreover, interpretability can promote the development and use of multi-user models in drug design projects, integrating multiple expert perspectives and insights into multi-objective optimization frameworks for de novo molecular design. This also allows for assigning more or less weight to experts based on their knowledge of specific properties. In this paper, we present a methodology for decomposing human preferences based on binary responses (like/dislike) to molecules essentially proposed by generative chemistry models, and inferring interpretable preference models that represent human reasoning. Our approach aims to bridge the gap between human-in-the-loop learning and user model interpretability in drug discovery applications, providing a transparent framework that elucidates how human judgments can shape molecular design outcomes.
Yasmine Nahal, Markus Heinonen, Mikhail Kabeshov, Jon Paul Janet, Eva Nittinger, Ola Engkvist, Samuel Kaski

Open Access

Atom-Level Quantum Pretraining Enhances the Spectral Perception of Molecular Graphs in Graphormer
Abstract
This study explores the impact of pretraining Graph Transformers using atom-level quantum-mechanical features for molecular property modeling. We utilize the ADMET Therapeutic Data Commons datasets to evaluate the benefits of this approach. Our results show that pretraining on quantum atomic properties improves the performance of the Graphormer model. We conduct comparisons with two other pretraining strategies: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and another using a self-supervised atom masking technique. Additionally, we employ a spectral analysis of Attention Rollout matrices to understand the underlying reasons for these performance enhancements. Our findings suggest that models pretrained on atom-level quantum mechanics are better at capturing low-frequency Laplacian eigenmodes from the molecular graphs, which correlates with improved outcomes on most evaluated downstream tasks, as measured by our custom metric.
Alessio Fallani, José Arjona-Medina, Konstantin Chernichenko, Ramil Nugmanov, Jörg Kurt Wegner, Alexandre Tkatchenko

Open Access

Balancing Imbalanced Toxicity Models: Using MolBERT with Focal Loss
Abstract
Drug-induced liver injury (DILI) presents a multifaceted challenge, influenced by interconnected biological mechanisms. Current DILI datasets are characterized by small sizes and high imbalance, posing difficulties in learning robust representations and accurate modeling. To address these challenges, we trained a multi-modal multi-task model integrating preclinical histopathologies, biochemistry (blood markers), and clinical DILI-related adverse drug reactions (ADRs). Leveraging pretrained BERT models, we extracted representations covering a broad chemical space, facilitating robust learning in both frozen and fine-tuned settings. To address imbalanced data, we explored weighted Binary Cross-Entropy (w-BCE) and weighted Focal Loss (w-FL) . Our results demonstrate that the frozen BERT model consistently enhances performance across all metrics and modalities with weighted loss functions compared to their non-weighted counterparts. However, the efficacy of fine-tuning BERT varies across modalities, yielding inconclusive results. In summary, the incorporation of BERT features with weighted loss functions demonstrates advantages, while the efficacy of fine-tuning remains uncertain.
Muhammad Arslan Masood, Samuel Kaski, Hugo Ceulemans, Dorota Herman, Markus Heinonen

Open Access

Registries in Machine Learning-Based Drug Discovery: A Shortcut to Code Reuse
Abstract
Computer-aided drug discovery gradually builds on previous work and requires reusable code to advance research. Currently, research code is mainly used to provide further insights into the original research whilst code reuse has a lower priority. Modularity, the segmentation of code for independent modules, promotes good coding practices and code reuse. The registry pattern has been proposed as a way to call functionalities dynamically, but it is currently overlooked as a shortcut to promote code reuse. In this work, we expand the registry pattern to better suit computer-aided drug discovery and achieve a unified, reusable, and interchangeable interface with optional meta information. Our reformulated pattern is particularly suitable for collaborative research with standardized frameworks where multiple internal and external modules are used interchangeably and coding is more focused on fast iteration over low-debt technical code, such as in machine learning-based research for drug discovery. In a workflow, we exemplify the usage of the design patterns. Additionally, we provide two case studies where we 1) showcase the effectiveness of registration in a larger collaborative research group, and 2) overview the potential of registration in currently available open-source tools. Finally, we empirically evaluate the registry pattern through previous implementations and indicate where additional functionality can improve its use.
Peter B. R. Hartog, Emma Svensson, Lewis Mervin, Samuel Genheden, Ola Engkvist, Igor V. Tetko

Open Access

Artificial Intelligence Methods for Evaluating Mitochondrial Dysfunction: Exploring Various Chemical Notations Suitable for Neural Language Processing Models
Abstract
In recent years, the integration of Artificial Intelligence and Machine Learning methods, such as Neural Language Processing (NLP), with biochemical and biomedical research has revolutionized the field of toxicology defining a profound advancement in our understanding of the toxicological effects of diverse chemical compounds on biological systems.
Among various toxic effects, mitochondrial dysfunction has emerged as a crucial endpoint due to its role in various diseases related to the liver, heart brain, and more in general related to different physiological processes. Indeed, mitochondria are indispensable organelles in cells that serve as the primary hub for energy production, and they are responsible for critical functions in cell metabolism, signaling, and cellular demise. Traditional methods for assessing chemical hazards and their impact on mitochondrial function have relied heavily on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations, in silico methods have emerged as indispensable tools in toxicological research to reduce the need for traditional in vivo testing and saving valuable resources in terms of time and money.
This study utilized NLP models to explore diverse chemical notations utilized to encode chemical information such as Simplified Molecular Input Line Entry System (SMILES), DeepSMILES and Self-Referencing Embedded Strings (SELFIES), with the aim of evaluating toxic interactions between chemicals and specific biological targets, achieving high predictivity performance.
Edoardo Luca Viganò, Erika Colombo, Davide Ballabio, Alessandra Roncaglioni

Open Access

Temporal Evaluation of Uncertainty Quantification Under Distribution Shift
Abstract
Uncertainty quantification is emerging as a critical tool in high-stakes decision-making processes, where trust in automated predictions that lack accuracy and precision can be time-consuming and costly. In drug discovery, such high-stakes decisions are based on modeling the properties of potential drug compounds on biological assays. So far, existing uncertainty quantification methods have primarily been evaluated using public datasets that lack the temporal context necessary to understand their performance over time. In this work, we address the pressing need for a comprehensive, large-scale temporal evaluation of uncertainty quantification methodologies in the context of assay-based molecular property prediction. Our novel framework benchmarks three ensemble-based approaches to uncertainty quantification and explores the effect of adding lower-quality data during training in the form of censored labels. We investigate the robustness of the predictive performance and the calibration and reliability of predictive uncertainty by the models as time evolves. Moreover, we explore how the predictive uncertainty behaves in response to varying degrees of distribution shift. By doing so, our analysis not only advances the field but also provides practical implications for real-world pharmaceutical applications.
Emma Svensson, Hannah Rosa Friesacher, Adam Arany, Lewis Mervin, Ola Engkvist

Open Access

Deep Bayesian Experimental Design for Drug Discovery
Abstract
In drug discovery, prioritizing compounds for testing is an important task. Active learning can assist in this endeavor by prioritizing molecules for label acquisition based on their estimated potential to enhance in-silico models. However, in specialized cases like toxicity modeling, limited dataset sizes can hinder effective training of modern neural networks for representation learning and to perform active learning. In this study, we leverage a transformer-based BERT model pretrained on millions of SMILES to perform active learning. Additionally, we explore different acquisition functions to assess their compatibility with pretrained BERT model. Our results demonstrate that pretrained models enhance active learning outcomes. Furthermore, we observe that active learning selects a higher proportion of positive compounds compared to random acquisition functions, an important advantage, especially in dealing with imbalanced toxicity datasets. Through a comparative analysis, we find that both BALD and EPIG acquisition functions outperform random acquisition, with EPIG exhibiting slightly superior performance over BALD. In summary, our study highlights the effectiveness of active learning in conjunction with pretrained models to tackle the problem of data scarcity.
Muhammad Arslan Masood, Tianyu Cui, Samuel Kaski
Backmatter
Metadata
Title
AI in Drug Discovery
Editors
Djork-Arné Clevert
Michael Wand
Kristína Malinovská
Jürgen Schmidhuber
Igor V. Tetko
Copyright Year
2025
Electronic ISBN
978-3-031-72381-0
Print ISBN
978-3-031-72380-3
DOI
https://doi.org/10.1007/978-3-031-72381-0

Premium Partner