Skip to main content
Log in

Validation tools for variable subset regression

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Variable selection is applied frequently in QSAR research. Since the selection process influences the characteristics of the finally chosen model, thorough validation of the selection technique is very important. Here, a validation protocol is presented briefly and two of the tools which are part of this protocol are introduced in more detail. The first tool, which is based on permutation testing, allows to assess the inflation of internal figures of merit (such as the cross-validated prediction error). The other tool, based on noise addition, can be used to determine the complexity and with it the stability of models generated by variable selection. The obtained statistical information is important in deciding whether or not to trust the predictive abilities of a specific model. The graphical output of the validation tools is easily accessible and provides a reliable impression of model performance. Among others, the tools were employed to study the influence of leave-one-out and leave-multiple-out cross-validation on model characteristics. Here, it was confirmed that leave-multiple-out cross-validation yields more stable models. To study the performance of the entire validation protocol, it was applied to eight different QSAR data sets with default settings. In all cases internal and external model performance was good, indicating that the protocol serves its purpose quite well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • R.D. Cramer D.E. Patterson J.D. Bunce (1988) J. Am. Chem. Soc. 110 5959

    Google Scholar 

  • G. Cruciani P. Crivori P.-A. Carrupt B. Testa (2000) J. Mol. Struct. 503 17

    Google Scholar 

  • J.G. Topliss R.J. Costello (1972) J. Med. Chem. 15 1066

    Google Scholar 

  • J.G. Topliss R.P. Edwards (1979) J. Med. Chem. 22 1238

    Google Scholar 

  • W. Zucchini (2000) J. Math. Psychol. 44 41

    Google Scholar 

  • D.W. Osten (1988) J. Chemom. 2 39

    Google Scholar 

  • K. Baumann H. Albert M. von Korff (2002) J. Chemom. 16 339

    Google Scholar 

  • K. Baumann M. von Korff H. Albert (2002) J. Chemom. 16 351

    Google Scholar 

  • S. Geisser (1975) J. Am. Stat. Assoc. 70 320

    Google Scholar 

  • J. Shao (1993) J. Am. Stat. Assoc. 88 486

    Google Scholar 

  • G. Cruciani M. Baroni S. Clementi G. Costantino D. Riganelli B. Skagerberg (1992) J. Chemom. 6 335

    Google Scholar 

  • K. Baumann (2003) Trends Anal. Chem. 22 395

    Google Scholar 

  • J. Shao (1996) J. Am. Stat. Assoc. 91 655

    Google Scholar 

  • R. Wehrens H. Putter L.M.C. Buydens (2000) Chemom. Intell. Lab. Syst., 54 35

    Google Scholar 

  • A.C. Rencher F.C. Pun (1980) Technometrics 22 49

    Google Scholar 

  • V.F. Flack P.C. Chang (1987) Am. Stat., 41 84

    Google Scholar 

  • C.M. Hurvich C.L. Tsai (1990) Am. Stat. 44 214

    Google Scholar 

  • Baumann, K., Stiefl, N. and von Korff, M., In Ford, M., Livingstone, D., Dearden, J. and van de Waterbeemd, H. (Eds.), EuroQSAR 2002, Designing Drugs and Crop Protectants: Processes, Problems and Solutions, Blackwell Publishing, Oxford, UK, 2003, pp. 290–292.

  • L. Breiman (1996) Ann. Stat., 24 2350

    Google Scholar 

  • E.A. Coats (1998) Perspect. Drug Discov. Des. 12-14 199

    Google Scholar 

  • N. Stiefl K. Baumann (2003) J. Med. Chem., 46 1390

    Google Scholar 

  • R.C. Rao H. Toutenburg (1999) Linear Models EditionNumber2 Springer New York

    Google Scholar 

  • J. Ye (1998) J. Am. Stat. Assoc. 93 120

    Google Scholar 

  • L. Breiman (2000) Mach. Learning 40 229

    Google Scholar 

  • G. Klopman A.N. Kalos (1985) J. Comput. Chem. 6 492

    Google Scholar 

  • S.S. So M. Karplus (1997) J. Med. Chem., 40 4347

    Google Scholar 

  • H. Kubinyi F.A. Hamprecht T. Mietzner (1998) J. Med. Chem., 41 2553

    Google Scholar 

  • H. Martens T. Naes (1989) Multivariate Calibration John Wiley & Sons Chichester, UK

    Google Scholar 

  • H. Kubinyi (1996) J. Chemom., 10 119

    Google Scholar 

  • D.L. Selwood D.J. Livingstone J.C.W. Comley A.B. O’Dowd A.T. Hudson P. Jackson K.S. Jandu V.S. Rose J.N. Stables (1990) J. Med. Chem. 33 136

    Google Scholar 

  • S.R. Krystek J.T. Hunt P.D. Stein T.R. Stouch (1995) J. Med. Chem. 38 659

    Google Scholar 

  • D.D. Robinson P.J. Winn P.D. Lyne W.G. Richards (1999) J. Med. Chem., 42 573

    Google Scholar 

  • E. Gancia G. Bravi P. Mascagni A. Zaliani (2000) J. Comput.-Aided Mol. Des. 14 293

    Google Scholar 

  • K. Baumann (2002) Quant. Struct.-Act. Relat. 21 507

    Google Scholar 

  • L. Breiman (1996) Mach. Learning 26 123

    Google Scholar 

  • Freund, Y. and Schapire, R., In Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA, 1996, pp. 148–156.

  • Y. Freund R. Schapire (1997) J. Comp. Syst. Sci., 55 119

    Google Scholar 

  • K. Baumann (2002) J. Chem. Inf. Comput. Sci. 42 26

    Google Scholar 

  • R.W. Kennard L.A. Stone (1969) Technometrics 11 137

    Google Scholar 

  • W. Wu B. Walczak D.L . Massart S. Heuerding F. Erni I.R. Last K.A. Prebble (1996) Chemom. Intell. Lab. Syst. 33 35

    Google Scholar 

  • N. Stiefl G. Bringmann C. Rummey K. Baumann (2003) J. Comput.-Aided Mol. Des. 17 347

    Google Scholar 

  • N.M. Faber (1999) Chemom. Intell. Lab. Syst. 49 79

    Google Scholar 

  • D. Jouan-Rimbaud E. Bouveresse D.L. Massart O.E. de Noord (1999) Anal. Chim. Acta 338 283

    Google Scholar 

  • A. Golbraikh A. Tropsha (2002) J. Mol. Graph. Mod. 20 269

    Google Scholar 

  • A. Tropsha P. Gramatica V.K. Gombar (2003) QSAR Comb. Sci. 22 69

    Google Scholar 

  • A. Kulkarni A.J. Hopfinger R. Osborne L.H. Bruner E.D. Thompson (2001) Toxicol. Sci. 59 335

    Google Scholar 

  • Stiefl, N., Holzgrabe, U. and Baumann, K., In Ford, M., Livingstone, D., Dearden, J. and van de Waterbeemd, H. (Eds.), EuroQSAR 2002, Designing Drugs and Crop Protectants: Processes, Problems and Solutions, Blackwell Publishing, Oxford, UK, 2003, pp. 195–197.

  • Baumann, K. and Stiefl, N., In Ford, M., Livingstone, D., Dearden, J. and van de Waterbeemd, H. (Eds.), EuroQSAR 2002, Designing Drugs and Crop Protectants: Processes, Problems and Solutions, Blackwell Publishing, Oxford, UK, 2003, pp. 153–157.

  • W. Sippl J.M. Contreras I. Parrot Y.M. Rival C.G. Wermuth (2001) J. Comput.-Aided Mol. Des., 15 395

    Google Scholar 

  • M.L. Barreca A. Carotti A. Carrieri A. Chimirri A.M. Monforte M. Pellegrini Calace A. Rao (1999) Bioorg. Med. Chem., 7 2283

    Google Scholar 

  • G. Costantino A. Macchiarulo E. Camaioni R. Pellicciari (2001) J. Med. Chem. 44 3786

    Google Scholar 

  • P. Burman (1989) Biometrika 76 503

    Google Scholar 

  • F. Mosteller J.W. Tukey (1977) Data Analysis and Regression Addison-Wesley Reading, MA

    Google Scholar 

  • R.P. Picard R.D. Cook (1984) J. Am. Stat. Assoc., 79 575

    Google Scholar 

  • Kubinyi, H. and Abraham, U., In Kubinyi, H. (Ed.), 3D QSAR in Drug Design–Theory Methods and Applications, ESCOM Science Publishers, Leiden, The Netherlands, 1993, pp. 717–728.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Knut Baumann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baumann, K., Stiefl, N. Validation tools for variable subset regression. J Comput Aided Mol Des 18, 549–562 (2004). https://doi.org/10.1007/s10822-004-4071-5

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-004-4071-5

Keywords

Navigation