Abstract
The generalization performance is the main concern of machine learning theoretical research. The previous main bounds describing the generalization ability of the Empirical Risk Minimization (ERM) algorithm are based on independent and identically distributed (i.i.d.) samples. In order to study the generalization performance of the ERM algorithm with dependent observations, we first establish the exponential bound on the rate of relative uniform convergence of the ERM algorithm with exponentially strongly mixing observations, and then we obtain the generalization bounds and prove that the ERM algorithm with exponentially strongly mixing observations is consistent. The main results obtained in this paper not only extend the previously known results for i.i.d. observations to the case of exponentially strongly mixing observations, but also improve the previous results for strongly mixing samples. Because the ERM algorithm is usually very time-consuming and overfitting may happen when the complexity of the hypothesis space is high, as an application of our main results we also explore a new strategy to implement the ERM algorithm in high complexity hypothesis space.
Article PDF
Similar content being viewed by others
References
Alexander, K. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Annals of Probability, 4, 1041–1067.
Alon, N., Ben-Darid, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence and learnability. Journal of the Association for Computing Machinery, 44, 615–631.
Bartlett, P. L., & Long, P. M. (1998). Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and System Sciences, 56(2), 174–190.
Bartlett, P. L., & Lugosi, G. (1999). An inequality for uniform deviations of sample averages from their means. Statistics & Probability Letters, 4, 55–62.
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Bousquet, O. (2003). New approaches to statistical learning theory. Annals of the Institute of Statistical Mathematics, 55, 371–389.
Cesa-Bianchi, N., Alex Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9), 2050–2057.
Chen, D. R., Wu, Q., Ying, Y. M., & Zhou, D. X. (2004). Support vector machine soft margin classifiers: error analysis. Journal of Machine Learning Research, 5, 1143–1175.
Craig, C. C. (1933). On the Tchebycheff inequality of Bernstein. Annals of Mathematical Statistics, 4, 94–102.
Cucker, F., & Smale, S. (2002a). On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39, 1–49.
Cucker, F., & Smale, S. (2002b). Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2, 413–428.
Cucker, F., & Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint. Cambridge: Cambridge University Press.
Davydov, Y. A. (1973). Mixing conditions for Markov chains. Theory of Probability and its Applications, XVIII, 312–328.
Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. Journal of Multivariate Analysis, 12, 72–79.
Evgeniou, T., & Pontil, M. (1999). Lecture notes in comput. sci. : Vol. 1720. On the V-gamma dimension for regression in reproducing Kernel Hilbert spaces (pp. 106–117). Berlin: Springer.
Giné, E., & Koltchinski, V. (2006). Concentration inequality and asymptotic results for ratio type empirical processes. The Annals of Probability, 34(3), 1143–1216.
Ibragimov, I. A., & Linnik, Y. V. (1971). Independent and stationary sequences of random variables. Groningen: Wolters-Noordnoff.
Karandikar, R. L., & Vidyasagar, M. (2002). Rates of uniform convergence of empirical means with mixing processes. Statistics & Probability Letters, 58, 297–307.
Lugosi, G., & Pawlak, M. (1994). On the posterior-probability estimate of the error of nonparameter classification rules. IEEE Transactions on Information Theory, 40(5), 475–481.
Modha, S., & Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory, 42, 2133–2145.
Nobel, A., & Dembo, A. (1993). A note on uniform laws of averages for dependent processes. Statistics & Probability Letters, 17, 169–172.
Pollard, D. (1984). Convergence of stochastic processes. New York: Springer.
Rosenblatt, M. (1956). A central theorem and strong mixing conditions. Proceedings of the National Academy of Sciences, 4, 43–47.
Smale, S., & Zhou, D. X. (2003). Estimating the approximation error in learning theory. Analysis and Its Applications, 1, 17–41.
Smale, S., & Zhou, D. X. (2004). Shannon sampling and function reconstruction from point values. Bulletin of the American Mathematical Society, 41, 279–305.
Steinwart, I., Hush, D., & Scovel, C. (2006). Learning from dependent observations (Technical Report LA-UR-06-3507). Los Alamos National Laboratory. Submitted for publication.
Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22, 28–76.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vidyasagar, M. (2002). Learning and generalization with applications to neural networks (2nd ed.). Berlin: Springer.
Withers, C. S. (1981). Conditions for linear processes of stationary mixing sequences. Annals of Probability, 22, 94–116.
White, H. (1989). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 535–549.
Wu, Q., & Zhou, D. X. (2005). SVM soft margin classifiers: linear programming versus quadratic programming. Neural Computation, 17, 1160–1187.
Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. Annals of Probability, 22, 94–114.
Zhou, D. X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49, 1743–1752.
Zou, B., & Li, L. Q. (2007). The performance bounds of learning machines based on exponentially strongly mixing sequence. Computer and Mathematics with Applications, 53(7), 1050–1058.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Nicolo Cesa-Bianchi.
Supported by National 973 project (2007CB311002), NSFC key project (70501030), NSFC project (10771053) and Foundation of Hubei Educational Committee (Q200710001).
Rights and permissions
About this article
Cite this article
Zou, B., Li, L. & Xu, Z. The generalization performance of ERM algorithm with strongly mixing observations. Mach Learn 75, 275–295 (2009). https://doi.org/10.1007/s10994-009-5104-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-009-5104-z