Non-linear QSAR modeling by using multilayer perceptron feedforward neural networks trained by back-propagation
Introduction
The QSAR perspective on the fourth level of pattern recognition (PR) entails the establishment of relationships between the chemical structure descriptors block (X) and another block of biological responses measurements (Y) [1]. The four levels of PR can be clarified in the following. At the first level, the goal is to classify a compound of unknown activity into one of a prefixed number of possible classes (hard classification). At the second level, a compound of unknown activity can be classified (i) as belonging to one of the given classes, (ii) as belonging to more than one class (a multiclass object) or (iii) as an outlier (a classless object), leading to soft classification techniques. At the third level of PR, after the unknown has been classified, a single biological response is correlated with its structure. If, instead of a single measurement, a matrix of biological responses (the Y block) is correlated with its structure, we have attained the fourth level of PR.
Accurate models of QSAR are crucial for any rational molecular design and for understanding the structuring of molecular information. The data analysis techniques which can be applied at this level to find such relationships are the stepwise multiple linear regression (MLR) [2], principal component regression (PCR) [3], [4], partial least squares (PLS) [5], [6] and methods based on artificial neural networks (ANNs) [7], [8], [9]. However, to consider the intrinsically non-linear nature of QSAR it is advisable to use flexible methods to model any non-linear relationships. An extension of the PLS called second-order or quadratic PLS (QPLS) was developed by Wold et al. [10] based on a principal component-approach similar to that of the PLS model. Thus, once the score matrices T and U corresponding to the block matrices X and Y, respectively, have been obtained, two scores th and uh are correlated through the relationship:
This technique represents the non-linear modeling of QSAR as a first approximation [11], but the consideration of the fully non-linear features of these relationships can only be attained by the use of ANNs. These algorithms are able to model the non-linear relationships that usually exist between molecular attributes and their influence on biological/chemical activity. Multilayer perceptrons (MLP) are feedforwarded multilayer networks that provide flexible frameworks for non-linear function estimation.
A MLP consists of formal neurons or nodes and connections (weights) among them. In a MLP architecture, the neurons are arranged in layers (an input layer, one or more hidden layers and an output layer), and the connections are unidirectional from input to output. Adjacent layers are fully connected but no connections exist between neurons within the same layer. This architecture computes a numerical output value f(x) for a given numerical input vector x, which is the row of the X matrix corresponding to a given object (molecule, species, etc…). A formal neuron sums up incoming signals multiplied by the connection weights, subtracts a threshold value (or bias θ) and calculates an output signal by using the so-called transfer function. Neurons can have different transfer functions. Input neurons simply distribute the descriptor data to the hidden layer neurons without any further computation. Hidden layer neurons typically have a sigmoidal transfer function:that limits the neuron's output signal to values between 0 and 1. The output layer neurons usually have sigmoidal or linear transfer functions, depending on the application. The whole network represents a non-linear relationship which can be written for each output as:where wih is the connection weight between the input node i with the hidden node h and wh are the connection weights between each hidden node h with the final output considered, y. θi and θh are the biases corresponding to the input and hidden layers. The difference between and y is the target error, which is subsequently back-propagated to modify the weights in order to attain the best fit.
For QSAR purposes and according to Kolmogorov's theorem [12], [13], three network layers are sufficient to approximate arbitrary continuous functions. As Duprat et al. [14] pointed out in their excellent paper, a special feature of ANNs that justifies their impact on QSAR modeling is the so-called ‘parsimony’, i.e. some ANNs give better results than other approximations (MLR, PCR, PLS) with the same number of parameters. MLP having one layer of hidden neurons with sigmoidal transfer functions behaves as parsimonious ANN. The number of hidden nodes in a MLP indicates the complexity of the relationship in a way very similar to the degree of a polynomial fit to a curve. It is always possible to build a parameterized model which perfectly fits the available data by taking a huge number of parameters. However, such overparametrized models produce very poor results to fit data which have not been used to estimate the parameters (training step). This phenomenon is called overfitting and can be easily observed when using ANN with too large a network size (many connection weights) and hence, the overall network function is too complicated for a reasonable solution. In such a case, there is a risk of network specialization on the training data which results in poor predictive ability on separate test sets.
The best approach to successful data modeling is to build the smallest model (smallest number of weights) in order to attain the same performance in recalling (estimation of training data) as in predicting (estimation of separate test data), both being sufficiently reliable. The overfitting problem may be minimized by monitoring the performance of the network during training by using a verification data set different from the training set.
In this work, a comparison is made between QSAR model-building using MLP and other commonly used techniques such as MLR, PCR, PLS and QPLS on the basis of a case study. A discussion is given on the reliability of global QSAR models versus built-in-class ones.
The compounds treated in this work, whose chemical structures are shown in Fig. 1, have been studied previously by Megens et al. [15], from a pharmacological point of view. These compounds fall into the alpha-adrenergic agonist group because they are able to activate the alpha-adrenergic receptors. Adrenoreceptors are membrane-bound receptors located throughout the body's neuronal and non-neuronal tissues where they mediate a diverse range of responses to the endogenous catecholamines, such as adrenaline and noradrenaline. They are less potent than other endogenous agonists, such as epinephrine or norepinephrine. However, because of structural modification they are orally active and have longer plasma half-lives. Thus, they produce an increase in blood pressure and constriction of blood vessels, among other effects.
The studied compounds can be divided into three subtypes of agonist depending on the actuating receptor family: alpha-1 (e.g. methoxamine), alpha-2 (e.g. tetrahydrazoline) and other less specific mixed types (e.g. xylometazoline), denoted by alpha-1,2. For the sake of brevity, alpha-1, alpha-2 and alpha-1,2 agonists will be ascribed as belonging to classes 1, 2 and 3, respectively, as indicated in Fig. 1.
The alpha-1 adrenoreceptors are located in the central and peripheral nervous system. In the first case, they are predominantly located post-synaptically where they play an excitatory role. In the second case, they are located on both vascular and non-vascular smooth muscle where activation of the receptors results in contraction. The alpha-2 adrenoreceptors are found on both pre- and post-synaptic neurons where they mediate an inhibitory role in the central and peripheral nervous system.
The biological activities of these compounds, gathered in Table 1, correspond to in vivo (columns A–C) and in vitro tests (columns E–F) taken from Ref. [16]. In that work, biological activity for in vivo tests was defined as the reciprocal of the amount of substance per unit body weight that produces a given effect in half of the animals tested (kg of body weight/mmol compound). For in vitro tests, the activity was defined as the reciprocal of the amount of test substances per unit volume of sample that displays half of the marker (liter sample/mmol compound). Given that all the biological responses (A–F) are positive quantities, an often-used procedure is to take logarithms of the data prior to autoscaling [17], and accordingly, in Table 1 the corresponding logarithmic values are presented.
Columns A–C of Table 1 describe the response of live rats to different tests. The antidiarrheal effect (A) is marked by the absence of diarrhea after the ingestion of castor oil, which is a strongly diarrheal agent. The diuretic effect (B) is defined as an excess production of urine. The antiptotic effect (C) is measured by the degree of eyelids opening after the administration of an eye closure agent (prazosin).
Columns E–F of Table 1 describe the measured binding abilities of the compounds to alpha-receptors during in vitro tests using specific markers. Thus, clonidine (D) is used as a specific marker for alpha-2, idazoxan (E) is a non-specific marker and compound WB-4101 is specific for alpha-1 receptors.
Section snippets
Chemical descriptors
In a general form, chemical descriptors can be divided into two main categories: global types and substituent types. The former are based in chemical properties that can be obtained directly from their molecular structures. The latter require the calculation of properties taking into account atom-based fragmental constants.
In this paper, both kinds of computational descriptors have been employed: quantum chemical descriptors (which fall into the global category) and predicted data for pKa and
Class structure of the data matrix
Lewi [16] performed a spectral map analysis (SMA) to this data set on the basis of the so-called “contrasts analysis”. Contrasts were defined as the logarithm of the specificity of a compound divided by the mean specificity computed over all compounds, being the specificity ratio of activities of a compound in two tests. Their findings showed the capability of SMA for identification and quantification of contrasts and give some insight into the discrimination among the three types of agonists,
Feature selection
In QSAR studies, it is necessary to select relevant variables that satisfactorily represent the relationships between the biological responses and the descriptors to find an optimal model. This problem is further complicated when correlations exist among input variables. Thus, PR techniques are highly recommended tools to find the most significant variables.
In the literature, an extensive range of methods for feature selection has been applied showing advantages and disadvantages of each one.
Results and discussion
As mentioned in the introduction, the aim of this work is to compare the results obtained when modeling QSAR using MLP against other commonly used procedures. In the present case study, different external dependent variables (the Y block) are modeled by using the descriptor data matrix (X block) that best characterizes the set of agonist molecules under study. The nature of their effects on the receptors gives rise to the classification of the samples into three classes as explained earlier:
Conclusion
The use of MLP for building QSAR models at the fourth level of PR is the most suitable tool to ensure good fits and predictions in cases where non-linear patterns occur. The most critical stages are identified as the establishment of the net architecture and scrutinizing for overfitting effects by means of a monitoring test.
Acknowledgements
The authors are indebted to Professor W.H. Mulder for his valuable comments.
References (31)
Chemom. Intell. Lab. Syst.
(1989)- et al.
Anal. Chim. Acta
(1986) - et al.
Chemom. Intell. Lab. Syst.
(1989) - et al.
Eur. J. Pharmacol.
(1986) Chemom. Intell. Lab. Syst.
(1989)- et al.
J. Pharm. Sci.
(1995) Biochem. Pharmacol.
(1974)- et al.
Anal. Chim. Acta
(1998) - et al.
Pharmaceutical Experimental Design and Interpretation
(1996) - et al.
J. Chemom.
(1987)
Chemom. Intell. Lab. Syst.
SIAM J. Sci. Stat. Comput.
Neural Networks for Pattern Recognition
Neural Networks for Chemists: an Introduction
Cited by (40)
Optimizing active learning for free energy calculations
2022, Artificial Intelligence in the Life SciencesIntermolecular interactions of substituted benzenes on multi-walled carbon nanotubes grafted on HPLC silica microspheres and interaction study through artificial neural networks
2019, Arabian Journal of ChemistryCitation Excerpt :However, for the descriptors selected in this work, linear methods performed very poorly. In case of non-linearity, artificial neural networks (ANNs) are instead the best candidates (Borosy et al., 2000; González-Arjona et al., 2002; Manallack and Livingstone, 1999). In fact, ANNs are mostly used for finding models when numerous solutions can be suggested or the underlying model is unknown, such as pattern recognition, clustering and data reduction.
Application of biopartitioning micellar chromatography and QSRR modeling for prediction of gastrointestinal absorption and design of novel β-hydroxy-β-arylalkanoic acids
2017, European Journal of Pharmaceutical SciencesCitation Excerpt :The first step was appropriate independent variable selection. This can be performed using genetic algorithm (Gupta et al., 2011), principal component analysis (Zhang, 2007) or stepwise MLR (Dobričić et al., 2014a; Dobričić et al. 2014b; Filipić et al., 2013; Gonzalez-Arjona et al., 2002). In this study, stepwise MLR was applied.
In vitro prediction of gastrointestinal absorption of novel β-hydroxy-β-arylalkanoic acids using PAMPA technique
2017, European Journal of Pharmaceutical SciencesCitation Excerpt :The first step was appropriate independent variable selection. For that purpose, genetic algorithm (Gupta et al., 2011), principal component analysis (Zhang, 2007) or stepwise MLR (Dobričić et al., 2014a; Dobričić et al., 2014b; Filipić et al., 2013; Gonzalez-Arjona et al., 2002) are usually used. In this study, stepwise MLR was applied.
17β-carboxamide steroids - In vitro prediction of human skin permeability and retention using PAMPA technique
2014, European Journal of Pharmaceutical SciencesCitation Excerpt :The first step in ANN construction was appropriate independent variable selection. The genetic algorithm (Gupta et al., 2011), principal component analysis (Zhang, 2007), and stepwise MLR (Filipic et al., 2013; Gonzalez-Arjona et al., 2002; Jalali-Heravi and Garkani-Nejad, 2002) had been used for independent variable selection. In this study, stepwise MLR was applied.
Preliminary assessment of a model to predict mold contamination based on microbial volatile organic compound profiles
2010, Science of the Total Environment