2005 Speical IssueGraph kernels for chemical informatics
Section snippets
Introduction: processing structured data
Many problems in artificial intelligence, data mining, and machine learning involve variable-size structured data, such as strings and sequences, trees, and directed or undirected graphs. These data are to be contrasted with the standard, fixed-length, vectorial data found also in many applications. Examples of structured data include: (1) text and documents in information retrieval; (2) DNA/RNA/protein sequences and evolutionary trees in bioinformatics; and (3) molecular structures in chemical
Kernel methods
Although the basic idea behind kernel methods is quite old (e.g. Kimeldorf & Wahba, 1971), over the last decade it has regained considerable interest spurred by the work of Boser et al., 1992, Cortes and Vapnik, 1995 on support vector machines. Many new kernel methods have been developed and/or successfully applied to challenging problems in recent years (Cristianini and Shawe-Taylor, 2000, Schölkopf and Smola, 2002).
In essence, kernel methods handle non-linear complex tasks using linear
Convolution kernels for chemistry
In this section, we develop classes of graph kernels that can be viewed as spectral kernels derived by counting and comparing walks in the underlying graphs. To circumvent the limitations posed by other graph kernels and address problems of molecular classification, we propose to use and extend the common technique of molecular fingerprinting (Flower, 1998, Raymond and Willett, 2001) used by chemists to build new feature vector representations of molecules. What differentiates our approach from
Data
We apply the kernels introduced in the previous section together with the Voted Perceptron classifier to the problems of predicting mutagenicity, toxicity, and anti-cancer activity on three different data sets (Mutag, PTC, and NCI).
Results
We conducted several experiments to compare the various classes of kernels with different parameter settings. Here, we report representative subsets of results together with the main findings. In addition to the usual accuracy measure, we also measure the ROC score, which is the normalized area under the ROC curve plotting the proportion of true positive (TP) predictions as a function of the proportion of false positive (FP) predictions, as the classification threshold is varied (Hanley &
Discussion
We have reviewed graph kernels and developed new graph kernels for chemical molecules that yield state-of-the art results on several datasets. These kernels measure the similarity between feature vectors, or molecular fingerprints, consisting of binary vectors or vectors of counts associated with all labeled paths of length less or equal to some value d derived by depth-first searches, starting from each vertex of a molecule. The accuracies obtained, for instance, 72% on the NCI dataset, are
Acknowledgements
Work supported by a Laurel Wilkening Faculty Innovation award, an NIH Biomedical Informatics Training grant (LM-07443-01), an NSF MRI grant (EIA-0321390), a Sun Microsystems award, a grant from the University of California Systemwide Biotechnology Research and Education Program to PB and an MD/PhD Harvey Fellowship to S.J.S. We would also like to acknowledge OpenEye Scientific Software for their free academic software license.
References (78)
- et al.
Some results on Tchebycheffian spline functions
Journal of Mathematical and Analytical Application
(1971) - et al.
Theories for mutagenicity: a study of first-order and feature based induction
Artificial Intelligence
(1996) - et al.
Theoretical foundations of the potential function method in pattern recognition learning
Automation and Remote Control
(1964) - et al.
Kernel independent component analysis
Journal of Machine Learning Research
(2002) - Chen, J., Swamidass, S. J., Dou, Y., Bruand, J., & Baldi, P. (2005). ChemDB: a public, database of small molecules and...
- et al.
Bioinformatics: the machine learning approach
(2001) - et al.
Hybrid modeling, HMM/NN architectures, and protein applications
Neural Computation
(1996) - et al.
The principled design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem
Journal of Machine Learning Research
(2003) - Baldi, P., & Rosen-Zvi, M. On the relationship between deterministic and probabilistic directed graphical models: from...
- et al.
A training algorithm for optimal margin classifiers
(1992)