Non-linear QSAR modeling by using multilayer perceptron feedforward neural networks trained by back-propagation

doi:10.1016/S0039-9140(01)00537-9

Talanta

Volume 56, Issue 1, 4 January 2002, Pages 79-90

https://doi.org/10.1016/S0039-9140(01)00537-9 Get rights and content

Abstract

The use of multilayer perceptrons (MLP) feedforward neural networks trained by back-propagation (BP) for non-linear QSAR model building is presented and explained in detail through a case study. This method was compared with others often used in this field, such as multiple linear regression (MLR), partial least squares (PLS) and quadratic PLS (QPLS). The case study deals with a series of 18 alpha adrenoreceptors agonists belonging to three different classes (alpha-1, alpha-2 and alpha-1,2) according to their different pharmacological effects. Each of them is described by 15 chemical features (the X block). Six pharmacological responses were also measured for each one to build the matrix of biological responses (the Y block). The results obtained indicated a slightly better performance of MLP against the other procedures, when using the correlation coefficient of the observed versus predicted response plots as an indicator of the goodness of the fit.

Introduction

The QSAR perspective on the fourth level of pattern recognition (PR) entails the establishment of relationships between the chemical structure descriptors block (X) and another block of biological responses measurements (Y) [1]. The four levels of PR can be clarified in the following. At the first level, the goal is to classify a compound of unknown activity into one of a prefixed number of possible classes (hard classification). At the second level, a compound of unknown activity can be classified (i) as belonging to one of the given classes, (ii) as belonging to more than one class (a multiclass object) or (iii) as an outlier (a classless object), leading to soft classification techniques. At the third level of PR, after the unknown has been classified, a single biological response is correlated with its structure. If, instead of a single measurement, a matrix of biological responses (the Y block) is correlated with its structure, we have attained the fourth level of PR.

Accurate models of QSAR are crucial for any rational molecular design and for understanding the structuring of molecular information. The data analysis techniques which can be applied at this level to find such relationships are the stepwise multiple linear regression (MLR) [2], principal component regression (PCR) [3], [4], partial least squares (PLS) [5], [6] and methods based on artificial neural networks (ANNs) [7], [8], [9]. However, to consider the intrinsically non-linear nature of QSAR it is advisable to use flexible methods to model any non-linear relationships. An extension of the PLS called second-order or quadratic PLS (QPLS) was developed by Wold et al. [10] based on a principal component-approach similar to that of the PLS model. Thus, once the score matrices T and U corresponding to the block matrices X and Y, respectively, have been obtained, two scores t_h and u_h are correlated through the relationship: $u_{h} =c_{0} +c_{1} t_{h} +c_{2} t_{h}^{2}$

This technique represents the non-linear modeling of QSAR as a first approximation [11], but the consideration of the fully non-linear features of these relationships can only be attained by the use of ANNs. These algorithms are able to model the non-linear relationships that usually exist between molecular attributes and their influence on biological/chemical activity. Multilayer perceptrons (MLP) are feedforwarded multilayer networks that provide flexible frameworks for non-linear function estimation.

A MLP consists of formal neurons or nodes and connections (weights) among them. In a MLP architecture, the neurons are arranged in layers (an input layer, one or more hidden layers and an output layer), and the connections are unidirectional from input to output. Adjacent layers are fully connected but no connections exist between neurons within the same layer. This architecture computes a numerical output value f(x) for a given numerical input vector x, which is the row of the X matrix corresponding to a given object (molecule, species, etc…). A formal neuron sums up incoming signals multiplied by the connection weights, subtracts a threshold value (or bias θ) and calculates an output signal by using the so-called transfer function. Neurons can have different transfer functions. Input neurons simply distribute the descriptor data to the hidden layer neurons without any further computation. Hidden layer neurons typically have a sigmoidal transfer function: $sf (input)= 1 1+ e^{−input}$ that limits the neuron's output signal to values between 0 and 1. The output layer neurons usually have sigmoidal or linear transfer functions, depending on the application. The whole network represents a non-linear relationship which can be written for each output as: $y ̂ =f(x)= ∑ h sf ∑ i x_{i} w_{ih} −θ_{i} w_{h} −θ_{h}$ where w_ih is the connection weight between the input node i with the hidden node h and w_h are the connection weights between each hidden node h with the final output considered, y. θ_i and θ_h are the biases corresponding to the input and hidden layers. The difference between $y ̂$ and y is the target error, which is subsequently back-propagated to modify the weights in order to attain the best fit.

For QSAR purposes and according to Kolmogorov's theorem [12], [13], three network layers are sufficient to approximate arbitrary continuous functions. As Duprat et al. [14] pointed out in their excellent paper, a special feature of ANNs that justifies their impact on QSAR modeling is the so-called ‘parsimony’, i.e. some ANNs give better results than other approximations (MLR, PCR, PLS) with the same number of parameters. MLP having one layer of hidden neurons with sigmoidal transfer functions behaves as parsimonious ANN. The number of hidden nodes in a MLP indicates the complexity of the relationship in a way very similar to the degree of a polynomial fit to a curve. It is always possible to build a parameterized model which perfectly fits the available data by taking a huge number of parameters. However, such overparametrized models produce very poor results to fit data which have not been used to estimate the parameters (training step). This phenomenon is called overfitting and can be easily observed when using ANN with too large a network size (many connection weights) and hence, the overall network function is too complicated for a reasonable solution. In such a case, there is a risk of network specialization on the training data which results in poor predictive ability on separate test sets.

The best approach to successful data modeling is to build the smallest model (smallest number of weights) in order to attain the same performance in recalling (estimation of training data) as in predicting (estimation of separate test data), both being sufficiently reliable. The overfitting problem may be minimized by monitoring the performance of the network during training by using a verification data set different from the training set.

In this work, a comparison is made between QSAR model-building using MLP and other commonly used techniques such as MLR, PCR, PLS and QPLS on the basis of a case study. A discussion is given on the reliability of global QSAR models versus built-in-class ones.

The compounds treated in this work, whose chemical structures are shown in Fig. 1, have been studied previously by Megens et al. [15], from a pharmacological point of view. These compounds fall into the alpha-adrenergic agonist group because they are able to activate the alpha-adrenergic receptors. Adrenoreceptors are membrane-bound receptors located throughout the body's neuronal and non-neuronal tissues where they mediate a diverse range of responses to the endogenous catecholamines, such as adrenaline and noradrenaline. They are less potent than other endogenous agonists, such as epinephrine or norepinephrine. However, because of structural modification they are orally active and have longer plasma half-lives. Thus, they produce an increase in blood pressure and constriction of blood vessels, among other effects.

The studied compounds can be divided into three subtypes of agonist depending on the actuating receptor family: alpha-1 (e.g. methoxamine), alpha-2 (e.g. tetrahydrazoline) and other less specific mixed types (e.g. xylometazoline), denoted by alpha-1,2. For the sake of brevity, alpha-1, alpha-2 and alpha-1,2 agonists will be ascribed as belonging to classes 1, 2 and 3, respectively, as indicated in Fig. 1.

The alpha-1 adrenoreceptors are located in the central and peripheral nervous system. In the first case, they are predominantly located post-synaptically where they play an excitatory role. In the second case, they are located on both vascular and non-vascular smooth muscle where activation of the receptors results in contraction. The alpha-2 adrenoreceptors are found on both pre- and post-synaptic neurons where they mediate an inhibitory role in the central and peripheral nervous system.

The biological activities of these compounds, gathered in Table 1, correspond to in vivo (columns A–C) and in vitro tests (columns E–F) taken from Ref. [16]. In that work, biological activity for in vivo tests was defined as the reciprocal of the amount of substance per unit body weight that produces a given effect in half of the animals tested (kg of body weight/mmol compound). For in vitro tests, the activity was defined as the reciprocal of the amount of test substances per unit volume of sample that displays half of the marker (liter sample/mmol compound). Given that all the biological responses (A–F) are positive quantities, an often-used procedure is to take logarithms of the data prior to autoscaling [17], and accordingly, in Table 1 the corresponding logarithmic values are presented.

Columns A–C of Table 1 describe the response of live rats to different tests. The antidiarrheal effect (A) is marked by the absence of diarrhea after the ingestion of castor oil, which is a strongly diarrheal agent. The diuretic effect (B) is defined as an excess production of urine. The antiptotic effect (C) is measured by the degree of eyelids opening after the administration of an eye closure agent (prazosin).

Columns E–F of Table 1 describe the measured binding abilities of the compounds to alpha-receptors during in vitro tests using specific markers. Thus, clonidine (D) is used as a specific marker for alpha-2, idazoxan (E) is a non-specific marker and compound WB-4101 is specific for alpha-1 receptors.

Section snippets

Chemical descriptors

In a general form, chemical descriptors can be divided into two main categories: global types and substituent types. The former are based in chemical properties that can be obtained directly from their molecular structures. The latter require the calculation of properties taking into account atom-based fragmental constants.

In this paper, both kinds of computational descriptors have been employed: quantum chemical descriptors (which fall into the global category) and predicted data for pK_a and

Class structure of the data matrix

Lewi [16] performed a spectral map analysis (SMA) to this data set on the basis of the so-called “contrasts analysis”. Contrasts were defined as the logarithm of the specificity of a compound divided by the mean specificity computed over all compounds, being the specificity ratio of activities of a compound in two tests. Their findings showed the capability of SMA for identification and quantification of contrasts and give some insight into the discrimination among the three types of agonists,

Feature selection

In QSAR studies, it is necessary to select relevant variables that satisfactorily represent the relationships between the biological responses and the descriptors to find an optimal model. This problem is further complicated when correlations exist among input variables. Thus, PR techniques are highly recommended tools to find the most significant variables.

In the literature, an extensive range of methods for feature selection has been applied showing advantages and disadvantages of each one.

Results and discussion

As mentioned in the introduction, the aim of this work is to compare the results obtained when modeling QSAR using MLP against other commonly used procedures. In the present case study, different external dependent variables (the Y block) are modeled by using the descriptor data matrix (X block) that best characterizes the set of agonist molecules under study. The nature of their effects on the receptors gives rise to the classification of the samples into three classes as explained earlier:

Conclusion

The use of MLP for building QSAR models at the fourth level of PR is the most suitable tool to ensure good fits and predictions in cases where non-linear patterns occur. The most critical stages are identified as the establishment of the net architecture and scrutinizing for overfitting effects by means of a monitoring test.

Acknowledgements

The authors are indebted to Professor W.H. Mulder for his valuable comments.

References (31)

W.J. Dunn
Chemom. Intell. Lab. Syst.
(1989)
P. Geladi et al.
Anal. Chim. Acta
(1986)
S. Wold et al.
Chemom. Intell. Lab. Syst.
(1989)
A.H.P. Megens et al.
Eur. J. Pharmacol.
(1986)
P.J. Lewi
Chemom. Intell. Lab. Syst.
(1989)
W.M. Meylan et al.
J. Pharm. Sci.
(1995)
S.Z. Langer
Biochem. Pharmacol.
(1974)
D. González-Arjona et al.
Anal. Chim. Acta
(1998)
N.A. Armstrong et al.
Pharmaceutical Experimental Design and Interpretation
(1996)
S. Wold et al.
J. Chemom.
(1987)

S. Wold et al.

Chemom. Intell. Lab. Syst.

(1987)

S. Wold et al.

SIAM J. Sci. Stat. Comput.

(1984)

C.M. Bishop

Neural Networks for Pattern Recognition

(1995)

J. Devillers (Ed.), Neural Networks in QSAR and Drug Design, Academic Press, London,...

J. Zupan et al.

Neural Networks for Chemists: an Introduction

(1993)

Cited by (40)

Optimizing active learning for free energy calculations
2022, Artificial Intelligence in the Life Sciences
While Relative Binding Free Energy (RBFE) calculations have become a mainstay in lead optimization programs, the computational expense of performing these calculations has limited their broader application. Active learning (AL), a machine learning method used to direct a search iteratively, has explored larger chemical libraries using RBFE calculations. While AL has been successfully applied, there has not been a systematic study of the impact of parameter settings on the performance of AL. To address this gap, we have generated an exhaustive dataset of RBFE calculations on 10,000 congeneric molecules. We used this dataset to explore the impact of several AL design choices, including the number of molecules sampled at each iteration, the method used to select an initial sample, the method used to build a machine learning model, and the acquisition function that defines the balance between exploration and exploitation in the search. Our studies demonstrated that the performance of AL is largely insensitive to the specific machine learning method and acquisition functions used. In our studies, the most significant factor impacting performance was the number of molecules sampled at each iteration where selecting too few molecules hurts performance. Under the best conditions, we were able to identify 75% of the 100 top scoring molecules by sampling only 6% of the dataset. We hope that the dataset of 10K molecules will provide the basis for future studies exploring additional AL strategies. The source code and supporting data for the work are available at https://github.com/google-research/google-research/tree/master/al_for_fep.
Intermolecular interactions of substituted benzenes on multi-walled carbon nanotubes grafted on HPLC silica microspheres and interaction study through artificial neural networks
2019, Arabian Journal of Chemistry
Citation Excerpt :
However, for the descriptors selected in this work, linear methods performed very poorly. In case of non-linearity, artificial neural networks (ANNs) are instead the best candidates (Borosy et al., 2000; González-Arjona et al., 2002; Manallack and Livingstone, 1999). In fact, ANNs are mostly used for finding models when numerous solutions can be suggested or the underlying model is unknown, such as pattern recognition, clustering and data reduction.
Purified multi-walled carbon nanotubes (MWCNTs) grafted onto silica microspheres by gamma-radiation were applied as a HPLC stationary phase for investigating the intermolecular interactions between MWCNTs and substituted benzenes. The synthetic route, simple and not requiring CNTs derivatization, involved no alteration of the nanotube original morphology and physical–chemical properties. The affinity of a set of substituted benzenes for the MWCNTs was studied by correlating the capacity factor (k′) of each probe to its physico-chemical characteristics (calculated by Density Functional Theory). The correlation was found through a theoretical approach based on feedforward neural networks. This strategy was adopted because today these calculations are easily affordable for small molecules (like the analytes), and many critical parameters needed are not known. This might increase the applicability of the proposed method to other cases of study. Moreover, it was seen that the normal linear fit does not provide a good model. The interaction on the MWCNT phase was compared to that of an octadecyl (C18) reversed phase, under the same elution conditions. Results from trained neural networks indicated that the main role in the interactions between the analytes and the stationary phases is due to dipole moment, polarizability and LUMO energy. As expected for the C18 stationary phase correlation, is due to dipole moment and polarizability, while for the MWCNT stationary phase primarily to LUMO energy followed by polarizability, evidence for a specific interaction between MWCNTs and analytes. The CNT-based hybrid material proved to be not only a chromatographic phase but also a useful tool to investigate the MWCNT-molecular interactions with variously substituted benzenes.
Application of biopartitioning micellar chromatography and QSRR modeling for prediction of gastrointestinal absorption and design of novel β-hydroxy-β-arylalkanoic acids
2017, European Journal of Pharmaceutical Sciences
Citation Excerpt :
The first step was appropriate independent variable selection. This can be performed using genetic algorithm (Gupta et al., 2011), principal component analysis (Zhang, 2007) or stepwise MLR (Dobričić et al., 2014a; Dobričić et al. 2014b; Filipić et al., 2013; Gonzalez-Arjona et al., 2002). In this study, stepwise MLR was applied.
Gastrointestinal absorption of thirteen novel β-hydroxy-β-arylalkanoic acids (HAA) with anti-inflammatory activity was predicted by use of biopartitioning micellar chromatography and compared to ibuprofen. All tested HAA have lower retention factors (k) and lower expected gastrointestinal absorption than ibuprofen, whereas derivatives with the highest values of k are 1C, 2APTF and 2C. Quantitative structure-retention relationship (QSRR) analysis was performed in order to identify molecular descriptors with the highest influence on k and ANN(k) model was selected as optimal. Descriptors which form this model (nBM, P_VSA_LogP_8 and Eta_L) indicate that replacement of phenyl ring with a saturated or partially unsaturated one, as well as presence of halogens and nitro group should positively affect k values. On the basis of these conclusions, six novel HAA were designed and selected QSRR model was used for the prediction of their k values.
In vitro prediction of gastrointestinal absorption of novel β-hydroxy-β-arylalkanoic acids using PAMPA technique
2017, European Journal of Pharmaceutical Sciences
Citation Excerpt :
The first step was appropriate independent variable selection. For that purpose, genetic algorithm (Gupta et al., 2011), principal component analysis (Zhang, 2007) or stepwise MLR (Dobričić et al., 2014a; Dobričić et al., 2014b; Filipić et al., 2013; Gonzalez-Arjona et al., 2002) are usually used. In this study, stepwise MLR was applied.
Prediction of gastrointestinal absorption of thirteen newly synthesized β-hydroxy-β-arylalkanoic acids (HAA) and ibuprofen was performed using PAMPA test. The highest values of PAMPA parameters (%T and P_app) were calculated for 1C, 1B and 2C and these parameters were significantly lower in comparison to ibuprofen. QSPR analysis was performed in order to identify molecular descriptors with the highest influence on %T and − logP_app and to create models which could be used for the design of novel HAA with improved gastrointestinal absorption. Obtained results indicate that introduction of branched side chain, as well as introduction of substituents on one phenyl ring (which disturb symmetry of the molecule) could have positive impact on gastrointestinal absorption. On the basis of these results, six novel HAA were designed and PAMPA parameters %T and − logP_app were predicted by use of selected QSPR models. Designed derivatives should have better gastrointestinal absorption than HAA tested in this study.
17β-carboxamide steroids - In vitro prediction of human skin permeability and retention using PAMPA technique
2014, European Journal of Pharmaceutical Sciences
Citation Excerpt :
The first step in ANN construction was appropriate independent variable selection. The genetic algorithm (Gupta et al., 2011), principal component analysis (Zhang, 2007), and stepwise MLR (Filipic et al., 2013; Gonzalez-Arjona et al., 2002; Jalali-Heravi and Garkani-Nejad, 2002) had been used for independent variable selection. In this study, stepwise MLR was applied.
In this paper, twenty-two 17β-carboxamide steroids were synthesized from five corticosteroids (hydrocortisone, prednisolone, methylprednisolone, dexamethasone and betamethasone) in two steps. The first step was periodic acid oxydation of these corticosteroids to corresponding cortienic acids and the second step was amidation of thus obtained cortienic acids with esterified l-amino acids. These compounds are potential soft corticosteroids with local anti-inflammatory activity in the skin. Parallel artificial membrane permeability assay (PAMPA) was applied in order to predict permeability and retention of these compounds in human skin. Comparison of permeability and retention parameters between 17β-carboxamide steroids and corresponding corticosteroids was performed. Compounds with significantly higher retention were identified and the derivative that does not have significantly higher permeability was underlined. Molecular structures of all compounds were optimized by use of Gaussian semiempirical/PM3 method. Geometrical, thermodynamic, physicochemical and electronical molecular parameters of the optimized structures were calculated and quantitative structure–property relationship (QSPR) analysis was performed in order to explain permeability and retention of these compounds. ANN-, PLS- and MLR-QSPR models were created. Quality of these models was evaluated by commonly used statistical parameters and the most reliable models were selected. Analyzing descriptors in the selected models, main molecular properties that influence permeability and retention in the PAMPA artificial membrane were identified. Based on these data, further structural modifications could be applied in order to increase retention without significant increase of permeability, which can positively affect potential local anti-inflammatory activity of these compounds. Selected QSPR models could be used as in silico tool for predicting human skin permeability and retention of novel 17β-carboxamide steroids without performing PAMPA experiments.
Preliminary assessment of a model to predict mold contamination based on microbial volatile organic compound profiles
2010, Science of the Total Environment
Identification of mold growth based on microbial volatile organic compounds (MVOCs) may be a viable alternative to current bioaerosol assessment methodologies. A feed-forward back propagation (FFBP) artificial neural network (ANN) was developed to correlate MVOCs with bioaerosol levels in built environments. A cross-validation MATLAB script was developed to train the ANN and produce model results. Entech Bottle-Vacs were used to collect chemical grab samples at 10 locations in northern NY during 17 sampling periods from July 2006 to August 2007. Bioaerosol samples were collected concurrently with chemical samples. An Anderson N6 impactor was used in conjunction with malt extract agar and dichloran glycerol 18 to collect viable mold samples. Non-viable samples were collected with Air-O-Cell cassettes. Chemical samples and bioaerosol samples were used as model inputs and model targets, respectively. Previous researchers have suggested the use of MVOCs as indicators of mold growth without the use of a pattern recognition program limiting their success. The current proposed strategy implements a pattern recognition program making it instrumental for field applications. This paper demonstrates that FFBP ANN may be used in conjunction with chemical sampling in built environments to predict the presence of mold growth.

View all citing articles on Scopus

View full text

Non-linear QSAR modeling by using multilayer perceptron feedforward neural networks trained by back-propagation

Abstract

Introduction

Section snippets

Chemical descriptors

Class structure of the data matrix

Feature selection

Results and discussion

Conclusion

Acknowledgements

Chemom. Intell. Lab. Syst.

Anal. Chim. Acta

Chemom. Intell. Lab. Syst.

Eur. J. Pharmacol.

Chemom. Intell. Lab. Syst.

J. Pharm. Sci.

Biochem. Pharmacol.

Anal. Chim. Acta

Pharmaceutical Experimental Design and Interpretation

J. Chemom.

Chemom. Intell. Lab. Syst.

SIAM J. Sci. Stat. Comput.

Neural Networks for Pattern Recognition

Neural Networks for Chemists: an Introduction