Abstract
Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular, the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization and Support Vector Machines. We review both formulations in the context of Vapnik's theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. The emphasis is on regression: classification is treated as a special case.
Similar content being viewed by others
References
D. Allen, The relationship between variable selection and data augmentation and a method for prediction, Technometrics 16 (1974) 125–127.
N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler, Scale-sensitive-dimensions, uniform convergence, and learnability, in: Symposium on Foundations of Computer Science(1993).
S. Amari, A. Cichocki and H. Yang, A new learning algorithm for bling signal separation, in: Advances in Neural Information Processing System(MIT Press, Cambridge, MA, 1995) pp. 757–763.
S.I. Amari, Natural gradient works efficiently in learning, Neural Comput. 10 (1998) 251–276.
N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 686 (1950) 337–404.
P. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important that the size of the network, IEEE Trans. Inform. Theory (1998).
P. Bartlett, P.M. Long and R.C. Williamson, Fat-shattering and the learnability of real-valued functions, J. Comput. Systems Sci. 52(3) (1996) 434–452.
P. Bartlett and J. Shawe-Taylor, Generalization performance of support vector machine and other patern classifiers, in: Advances in Kernel Methods-Support Vector Learning, eds. C. Burges and B. Scholkopf (MIT Press, Cambridge, MA, 1998).
A.J. Bell and T. Sejnowski, An information maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159.
M. Bertero, Regularization methods for linear inverse problems, in: Inverse Problems, ed. C.G. Talenti (Springer, Berlin, 1986).
M. Bertero, T. Poggio and V. Torre, Ill-posed problems in early vision, Proc. IEEE 76 (1988) 869–889.
L. Bottou and V. Vapnik, Local learning algorithms, Neural Comput. 4(6) (1992) 888–900.
M.D. Buhmann, Multivariate cardinal interpolation with radial basis functions, Constr. Approx. 6 (1990) 225–255.
M.D. Buhmann, On quasi-interpolation with Radial Basis Functions, Numerical Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and Theoretical Physics, Cambridge, England (March 1991).
C. Burges, A tutorial on support vector machines for pattern recognition, in: Data Mining and Knowledge Discovery, Vol. 2 (Kluwer Academic, Boston, 1998).
O. Chapelle and V. Vapnik, Model selection for support vector machines, in: Advances in Neural Information Processing Systems(1999).
S. Chen, D. Donoho and M. Saunders, Atomic decomposition by basis pursuit, Technical Report 479, Department of Statistics, Stanford University (May 1995).
S. Chen, Basis Pursuit, Ph.D. thesis, Department of Statistics, Stanford University (November 1995).
V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory, and Methods(Wiley, New York, 1998).
J.A. Cochran, The Analysis of Linear Integral Equations(McGraw-Hill, New York, 1972).
R.R. Coifman and M.V. Wickerhauser, Entropy-based algorithms for best-basis selection, IEEE Trans. Inform. Theory 38 (1992) 713–718.
C. Cortes and V. Vapnik, Support vector networks, Machine Learning 20 (1995) 1–25.
R. Courant and D. Hilbert, Methods of Mathematical Physics, Vol. 2 (Interscience, London, England, 1962).
I. Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conferences Series in Applied Mathematics (SIAM, Philadelphia, PA, 1992).
C. de Boor, Quasi-interpolants and approximation power of multivariate splines, in: Computation of Curves and Surfaces, eds. M. Gasca and C.A. Micchelli (Kluwer Academic, Dordrecht, Netherlands, 1990) pp. 313–345.
R.A. DeVore, Nonlinear approximation, Acta Numerica (1998) 51–150.
L. Devroye, L. Gy¨orfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Applications of Mathematics, Vol. 31 (Springer, New York, 1996).
R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis(Wiley, New York, 1973).
R.M. Dudley, A course on empirical processes, in: Lecture Notes in Mathematics, Vol. 1097 (1984) pp. 2–142.
R.M. Dudley, E. Gine and J. Zinn, Uniform and universal Glivenko-Cantelli classes, J. Theoret. Probab. 4 (1991) 485–510.
N. Dyn, Interpolation and approximation by radial and related functions, in: Approximation Theory VI, eds. C.K. Chui, L.L. Schumaker and D.J. Ward (Academic Press, New York, 1991) pp. 211–234.
N. Dyn, I.R.H. Jackson, D. Levin and A. Ron, On multivariate approximation by integer translates of a basis function, Computer Sciences Technical Report 886, University of Wisconsin-Madison (November 1989).
N. Dyn, D. Levin and S. Rippa, Numerical procedures for surface fitting of scattered data by radial functions, SIAM J. Sci. Statist. Comput. 7(2) (1986) 639–659.
T. Evgeniou, L. Perez-Breva, M. Pontil and T. Poggio, Bounds on the generalization performance of kernel machines ensembles, A.I. Memo, MIT Artificial Intelligence Lab. (1999).
T. Evgeniou and M. Pontil, A note on the generalization performance of kernel classifiers with margin, A.I. Memo, MIT Artificial Intelligence Lab. (1999).
T. Evgeniou and M. Pontil, On the v-gamma-dimension for regression in reproducing kernel Hilbert spaces, A.I. Memo, MIT Artificial Intelligence Lab. (1999).
F. Girosi, Models of noise and robust estimates, A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1991).
F. Girosi, An equivalence between sparse approximation and Support Vector Machines, Neural Comput. 10(6) (1998) 1455–1480.
F. Girosi, M. Jones and T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (1995) 219–269.
F. Girosi, T. Poggio and B. Caprile, Extensions of a theory of networks for approximation and learning: outliers and negative examples, in: Advances in Neural Information Processings Systems, Vol. 3, eds. R. Lippmann, J. Moody and D. Touretzky (Morgan Kaufmann, San Mateo, CA, 1991).
W. H¨ardle, Applied Nonparametric Regression, Econometric Society Monographs, Vol. 19 (Cambridge University Press, 1990).
G.F. Harpur and R.W. Prager, Development of low entropy coding in a recurrent network, Network 7 (1996) 277–284.
T. Hastie and R. Tibshirani, Generalized Additive Models, Monographs on Statistics and Applied Probability, Vol. 43 (Chapman and Hall, London, 1990).
S. Haykin, Neural Networks: A Comprehensive Foundation(Macmillan, New York, 1994).
H. Hochstadt, Integral Equations, Wiley Classics Library (Wiley, New York, 1973).
V.V. Ivanov, The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations(Noordhoff International, Leiden, 1976).
T. Jaakkola and D. Haussler, Probabilistic kernel regression models, in: Proc. of Neural Information Processing Conference(1998).
I.R.H. Jackson, Radial Basis Functions methods for multivariate approximation, Ph.D. thesis, University of Cambridge, UK (1988).
M. Kearns, Y. Mansour, A. Ng and D. Ron, An experimental and theoretical comparison of model selection methods, in: Proceedings of the 8th Annual ACM Conference on Computational Learning Theory(1995).
M. Kearns and R. E. Shapire, Efficient distribution-free learning of probabilistic concepts, J. Comput. Syst. Sci. 48(3) (1994) 464–497.
G.S. Kimeldorf and G. Wahba, A correspondence between Bayesan estimation on stochastic processes and smoothing by splines, Ann. Math. Statist. 41(2) (1971) 495–502.
T.-W. Lee, M. Girolami, A.J. Bell and T. Sejnowski, A unifying information-theoretical framework for independent component analysis, Internat. J. Math. Comput. Mod. (1998) to appear.
M. Lewicki and T. Sejnowski, Learning nonlinear overcomplete representation for efficient coding, in: Advances in Neural Information Processing System(1997).
G.G. Lorentz, Approximation of Functions(Chelsea, New York, 1986).
D.J.C. MacKay, Introduction to Gaussian processes (1997) (available at the URL: http://wol.ra.phy.cam.ac.uk/mackay).
W.R. Madych and S.A. Nelson, Polyharmonic cardinal splines: a minimization property, J. Approx. Theory 63 (1990) 303–320.
S. Mallat and Z. Zhang, Matching Pursuit in a time-frequency dictionary, IEEE Trans. Signal Process. 41 (1993) 3397–3415.
J.L. Marroquin, S. Mitter and T. Poggio, Probabilistic solution of ill-posed problems in computational vision, J. Amer. Statist. Assoc. 82 (1987) 76–89.
H.N. Mhaskar, Neural networks for localized approximation of real functions, in: Neural Networks for Signal Processing III, Proceedings of the 1993 IEEE-SP Workshop, eds. C.A. Kamm et al. (IEEE Signal Processing Society, New York, 1993) pp. 190–196.
C.A. Micchelli, Interpolation of scattered data: distance matrices and conditionally positive definite functions, Constr. Approx. 2 (1986) 11–22.
P. Niyogi and F. Girosi, On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.
P. Niyogi, F. Girosi and T. Poggio, Incorporating prior information in machine learning by creating virt ual examples, Proc. IEEE 86(11) (1998) 2196–2209.
E. Oja, The nonlinear pca learning rule in independent component analysis, Neurocomput. 17 (1997) 25–45.
B. Olshausen, Learning linear, sparse, factorial codes, A.I. Memo 1580, MIT Artificial Intelligence Lab. (1996).
B.A. Olshausen and D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature381 (1996) 607–609.
C. Papageorgiou, F. Girosi and T. Poggio, Sparse correlation kernel based signal reconstruction, Technical Report 1635, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1998). (CBCL Memo 162.)
G. Parisi, Statistical Field Theory(Addison-Wesley, Reading, MA, 1988).
T. Poggio, On optimal nonlinear associative recall, Biological Cybernetics 19 (1975) 201–209.
T. Poggio and F. Girosi, A theory of networks for approximation and learning, A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1989).
T. Poggio and F. Girosi, Networks for approximation and learning, Proc. IEEE 78(9) (1990).
T. Poggio and F. Girosi, Networks for approximation and learning, in: Foundations of Neural Networks, ed. C. Lau (IEEE Press, Piscataway, NJ, 1992) pp. 91–106.
T. Poggio and F. Girosi, A sparse representation for function approximation, Neural Comput. 10(6) (1998).
T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314–319.
D. Pollard, Convergence of Stochastic Processes(Springer, Berlin, 1984).
M. Pontil, S. Mukherjee and F. Girosi, On the noise model of support vector machine regression, A.I. Memo, MIT Artificial Intelligence Laboratory (1998) (in preparation).
M. Pontil, R. Rifkin and T. Evgeniou, From regression to classification in support vector machines, A.I. Memo 1649, MIT Artificial Intelligence Lab. (1998).
M.J.D. Powell, The theory of radial basis functions approximation in 1990, in: Advances in Numerical Analysis Volume II: Wavelets, Subdivision Algorithms and Radial Basis Functions, ed. W.A. Light (Oxford University Press, 1992) pp. 105–210.
C. Rabut, How to build quasi-interpolants. Applications to polyharmonic B-splines, in: Curves and Surfaces, eds. P.-J. Laurent, A. Le M´ehaut´e and L.L. Schumaker (Academic Press, New York, 1991) pp. 391–402.
C. Rabut, An introduction to Schoenberg's approximation, Comput. Math. Appl. 24(12) (1992) 149–175.
R. Rifkin, M. Pontil and A. Verri, A note on support vector machine degeneracy, A.I. Memo, MIT Artificial Intelligence Lab. (1999).
J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471.
I.J. Schoenberg, Contributions to the problem of approximation of equidistant data by analytic functions, Part A: On the problem of smoothing of graduation, a first class of analytic approximation formulae, Quart. Appl. Math. 4 (1946) 45–99.
I.J. Schoenberg, Cardinal interpolation and spline functions, J. Approx. Theory 2 (1969) 167–206.
B. Scholkpof, P. Simard, A. Smola and V. Vapnik, Prior knowledge in suport vector kernels, in: Advances in Neural Information Processing Systems 9(1997).
L.L. Schumaker, Spline Functions: Basic Theory(Wiley, New York, 1981).
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory (1998) to appear. Also: NeuroCOLT Technical Report NC-TR-96–053 (1996) ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports.
J. Shawe-Taylor and N. Cristianini, Robust bounds on generalization from the margin distribution, Technical Report NeuroCOLT2, Technical Report NC2–TR-1998–029, NeuroCOLT2 (1998).
B.W. Silverman, Spline smoothing: the equivalent variable kernel method, Ann. Statist. 12 (1984) 898–916.
P. Simard, Y. LeCun and J. Denker, Efficient pattern recognition using a new transformation distance, in: Advances in Neural Information Processing Systems 5(1993) pp. 50–58.
A. Smola and B. Sch¨olkopf, On a kernel-based method for pattern recognition, regression, approximation and operator inversion, Algorithmica 22 (1998) 211–231.
J. Stewart, Positive definite functions and generalizations, an historical survey, Rocky Mountain J. Math. 6 (1976) 409–434.
A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-posed Problems(W.H. Winston, Washington, DC, 1977).
L.G. Valiant, A theory of learnable, in: Proc. of the 1984 STOC(1984) pp. 436–445.
V.N. Vapnik, Estimation of Dependences Based on Empirical Data(Springer, Berlin, 1982).
V.N. Vapnik, The Nature of Statistical Learning Theory(Springer, New York, 1995).
V.N. Vapnik, Statistical Learning Theory(Wiley, New York, 1998).
V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequences of events to their probabilities, Theory Probab. Appl. 17(2) (1971) 264–280.
V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for the uniform convergence of averages to their expected values, Teor. Veroyatn. i Primenen. 26(3) (1981) 543–564.
V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for consistency in the empirical risk minimization method, Pattern Recognition and Image Analysis 1(3) (1991) 283–305.
G. Wahba, Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data, in: Proceedings of the International Conference on Approximation Theory in Honour of George Lorenz, eds. J. Ward and E. Cheney, Austin, TX January 8–10, 1980 (Academic Press, 1980).
G. Wahba, A comparison of GCV and GML for choosing the smoothing parameter in the generalized splines smoothing problem, Ann. Statist. 13 (1985) 1378–1402.
G. Wahba, Splines Models for Observational Data, Series in Applied Mathematics, Vol. 59 (SIAM, Philadelphia, PA, 1990).
R. Williamson, A. Smola and B. Scholkopf, Generalization performance of regularization networks and support vector machines via entropy numbers, Technical Report NC-TR-98–019, Royal Holloway College University of London (1998).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Evgeniou, T., Pontil, M. & Poggio, T. Regularization Networks and Support Vector Machines. Advances in Computational Mathematics 13, 1–50 (2000). https://doi.org/10.1023/A:1018946025316
Issue Date:
DOI: https://doi.org/10.1023/A:1018946025316