Skip to main content
Top

2018 | OriginalPaper | Chapter

Subsampling for Big Data: Some Recent Advances

Authors : P. Bertail, O. Jelassi, J. Tressou, M. Zetlaoui

Published in: Nonparametric Statistics

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The goal of this contribution is to develop subsampling methods in the framework of big data and to show their feasibility in a simulation study. We argue that using different subsampling distributions with different subsampling sizes brings a lot of information on the behavior of statistical procedures: subsampling allows to estimate the rate of convergence of different procedures and to construct confidence intervals for general parameters including the generalization error of an algorithm in machine learning.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Arcones, M. A., Giné, E. (1993). Limit theorems for U-processes. Annals of Probability, 21(3). 1494–1542. Arcones, M. A., Giné, E. (1993). Limit theorems for U-processes. Annals of Probability, 21(3). 1494–1542.
2.
go back to reference Babu, G., & Singh, K. (1985). Edgeworth expansions for sampling without replacement from finite populations. Journal of Multivariate Analysis, 17, 261–278. Babu, G., & Singh, K. (1985). Edgeworth expansions for sampling without replacement from finite populations. Journal of Multivariate Analysis, 17, 261–278.
3.
go back to reference Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley. Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
4.
go back to reference Bertail, P. (1997). Second order properties of an extrapolated bootstrap without replacement under weak assumptions: The i.i.d. and strong mixing case. Bernoulli, 3, 149–179. Bertail, P. (1997). Second order properties of an extrapolated bootstrap without replacement under weak assumptions: The i.i.d. and strong mixing case. Bernoulli, 3, 149–179.
5.
go back to reference Bertail, P. (2011). Somme comments on Subsampling weakly dependent time series and application to extremes. TEST, 20, 487–490. Bertail, P. (2011). Somme comments on Subsampling weakly dependent time series and application to extremes. TEST, 20, 487–490.
6.
go back to reference Bertail, P., Chautru, E., & Clémençon, S. (2014). Scaling-up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent. In Proceedings of the 2014 IEEE International Conference on Big Data, Washington (USA). Bertail, P., Chautru, E., & Clémençon, S. (2014). Scaling-up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent. In Proceedings of the 2014 IEEE International Conference on Big Data, Washington (USA).
7.
go back to reference Bertail, P., Chautru, E., & Clémençon, S. (2015). Tail index estimation based on survey data. ESAIM Probability & Statistics, 19, 28–59. Bertail, P., Chautru, E., & Clémençon, S. (2015). Tail index estimation based on survey data. ESAIM Probability & Statistics, 19, 28–59.
8.
go back to reference Bertail, P., Chautru, E., & Clémençon, S. (2016). Empirical processes in survey sampling. Scandinavian Journal of Statistics, 44(1), 97–111. Bertail, P., Chautru, E., & Clémençon, S. (2016). Empirical processes in survey sampling. Scandinavian Journal of Statistics, 44(1), 97–111.
9.
go back to reference Bertail, P., Haeffke, C., Politis, D., & White H. (2004). A subsampling approach to estimating the distribution of diverging statistics with applications to assessing financial market risks. Journal of Econometrics, 120, 295–326. Bertail, P., Haeffke, C., Politis, D., & White H. (2004). A subsampling approach to estimating the distribution of diverging statistics with applications to assessing financial market risks. Journal of Econometrics, 120, 295–326.
10.
go back to reference Bertail, P., & Politis, D. (2001). Extrapolation of subsampling distribution estimators in the i.i.d. strong-mixing cases. Canadian Journal of Statistics, 29(4), 667–680. Bertail, P., & Politis, D. (2001). Extrapolation of subsampling distribution estimators in the i.i.d. strong-mixing cases. Canadian Journal of Statistics, 29(4), 667–680.
11.
go back to reference Bertail, P., Politis, D., & Romano, J. (1999). Undersampling with unknown rate of convergence. Journal of the American Statistical Association, 94(446), 569–579. Bertail, P., Politis, D., & Romano, J. (1999). Undersampling with unknown rate of convergence. Journal of the American Statistical Association, 94(446), 569–579.
12.
go back to reference Bickel, P. J., & Sakov, A. (2008). On the choice of the m out n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967–985. Bickel, P. J., & Sakov, A. (2008). On the choice of the m out n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967–985.
13.
go back to reference Bickel P. J., & Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402), 387–393. Bickel P. J., & Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402), 387–393.
14.
go back to reference Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations, gains, losses and remedies for losses. Statistica Sinica, 7, 1–31. Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations, gains, losses and remedies for losses. Statistica Sinica, 7, 1–31.
15.
go back to reference Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press. Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.
16.
go back to reference Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionelles. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 19, 281–296. Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionelles. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 19, 281–296.
17.
go back to reference Carlstein, E. (1988). Nonparametric change-point estimation. Annals of Statistics, 16(1), 188–197. Carlstein, E. (1988). Nonparametric change-point estimation. Annals of Statistics, 16(1), 188–197.
18.
go back to reference Darkhovshk, B. S. (1976). A non-parametric method for the a posteriori detection of the “disorder” time of a sequence of independent random variables. Theory of Probability and Its Applications, 21, 178–83. Darkhovshk, B. S. (1976). A non-parametric method for the a posteriori detection of the “disorder” time of a sequence of independent random variables. Theory of Probability and Its Applications, 21, 178–83.
19.
go back to reference Götze Rauckauskas, F. A. (1999). Adaptive choice of bootstrap sample sizes. In M. de Gunst, C. Klaassen, & A. van der Vaart (Eds.), State of the art in probability statistics: Festschrift for Willem R. van Zwet. IMS lecture notes, monograph series (pp. 286–309). Beachwood, OH: Institute of Mathematical Statistics. Götze Rauckauskas, F. A. (1999). Adaptive choice of bootstrap sample sizes. In M. de Gunst, C. Klaassen, & A. van der Vaart (Eds.), State of the art in probability statistics: Festschrift for Willem R. van Zwet. IMS lecture notes, monograph series (pp. 286–309). Beachwood, OH: Institute of Mathematical Statistics.
20.
go back to reference Heilig, C., & Nolan, D. (2001). Limit theorems for the infinite degree U-process. Statistica Sinica, 11, 289–302. Heilig, C., & Nolan, D. (2001). Limit theorems for the infinite degree U-process. Statistica Sinica, 11, 289–302.
21.
go back to reference Isaacson, E., & Keller, H. B. (1966). Analysis of numerical methods. New York: John Wiley. Isaacson, E., & Keller, H. B. (1966). Analysis of numerical methods. New York: John Wiley.
22.
go back to reference Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B, 76(4), 795–816. Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B, 76(4), 795–816.
23.
go back to reference Le Cam, L. (1990). Maximum likelihood: An introduction. Revue Internationale de Statistique, 58(2), 153–171. Le Cam, L. (1990). Maximum likelihood: An introduction. Revue Internationale de Statistique, 58(2), 153–171.
24.
go back to reference McLeod, I., & Bellhouse, D. R. (1983). Algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(2), 182–184. McLeod, I., & Bellhouse, D. R. (1983). Algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(2), 182–184.
25.
go back to reference Politis, D., & Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics, 22, 2031–2050. Politis, D., & Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics, 22, 2031–2050.
Metadata
Title
Subsampling for Big Data: Some Recent Advances
Authors
P. Bertail
O. Jelassi
J. Tressou
M. Zetlaoui
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-96941-1_13

Premium Partner