research-article

Preserving Statistical Validity in Adaptive Data Analysis

Authors:
Cynthia Dwork

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

,
Vitaly Feldman

IBM Research, San Jose, CA, USA

IBM Research, San Jose, CA, USA
View Profile

,
Moritz Hardt

IBM Research, San Jose, CA, USA

IBM Research, San Jose, CA, USA
View Profile

,
Toniann Pitassi

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Omer Reingold

Samsung Research, Sunnyvale, CA, USA

Samsung Research, Sunnyvale, CA, USA
View Profile

,
Aaron Leon Roth

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA
View Profile

STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of ComputingJune 2015Pages 117–126https://doi.org/10.1145/2746539.2746580

Published:14 June 2015Publication History

STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing

Pages 117–126

ABSTRACT

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses.

In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples.

We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

References

Five lessons from Kaggle's event recommendation engine challenge. http://www.rouli.net/2013/02/five-lessons-from-kaggles-event.html. Accessed: 2014-10-07.Google Scholar
Kaggle blog: No free hunch. http://blog.kaggle.com/. Accessed: 2014-10-07.Google Scholar
Kaggle user forums. https://www.kaggle.com/forums. Accessed: 2014-10-07.Google Scholar
E. Aharoni, H. Neuvirth, and S. Rosset. The quality preserving database: A computational framework for encouraging collaboration, enhancing power and controlling false discovery. IEEE/ACM Trans. Comput. Biology Bioinform., 8(5):1431--1437, 2011. Google ScholarDigital Library
E. Aharoni and S. Rosset. Generalized a-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771--794, 2014.Google Scholar
R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.Google Scholar
R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization, revisited. CoRR, abs/1405.7085, 2014.Google Scholar
C. G. Begley and L. Ellis. Drug development: Raise standards for preclinical cancer research. Nature, 483:531--533, 2012.Google ScholarCross Ref
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate -- a practical and powerful approach to multiple testing. Journal of the Royal Statistics Society: Series B (Statistical Methodology), 57:289--300, 1995.Google Scholar
A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In PODS, pages 128--138, 2005. Google ScholarDigital Library
O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499--526, 2002. Google ScholarDigital Library
C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Proceedings of NIPS, pages 281--288, 2006.Google Scholar
I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, pages 202--210. ACM, 2003. Google ScholarDigital Library
C. Dwork. A firm foundation for private data analysis. Communications of the ACM, 54(1):86--95, 2011. Google ScholarDigital Library
C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486--503, 2006. Google ScholarDigital Library
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265--284. Springer, 2006. Google ScholarDigital Library
C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. P. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC '09, pages 381--390, 2009. Google ScholarDigital Library
C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In CRYPTO, pages 528--544, 2004.Google ScholarCross Ref
C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(34):211--407, 2014. Google ScholarDigital Library
C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-S4):211--407, 2014. Google ScholarDigital Library
V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for planted clique. In STOC, pages 655--664. ACM, 2013. Google ScholarDigital Library
D. Foster and R. Stine. Alpha-investing: A procedure for sequential control of expected false discoveries. J. Royal Statistical Soc.: Series B (Statistical Methodology), 70(2):429--444, 2008.Google Scholar
D. A. Freedman. A note on screening regression equations. The American Statistician, 37(2):152--155, 1983.Google Scholar
A. Gelman and E. Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time, 2013.Google Scholar
M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 51st IEEE FOCS 2010, pages 61--70, 2010. Google ScholarDigital Library
M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, pages 454--463, 2014. Google ScholarDigital Library
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009.Google ScholarCross Ref
J. A. Ioannidis. Contradicted and initially stronger effects in highly cited clinical research. The Journal of American Medical Association, 294(2):218--228, 2005.Google ScholarCross Ref
J. P. A. Ioannidis. Why Most Published Research Findings Are False. PLoS Medicine, 2(8):124, Aug. 2005.Google Scholar
M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983--1006, 1998. Google ScholarDigital Library
M. J. Kearns and U. V. Vazirani. An introduction to computational learning theory. MIT press, 1994. Google ScholarCross Ref
F. McSherry and K. Talwar, 2014. Personal communication.Google Scholar
S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161--193, 2006.Google ScholarCross Ref
T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419--422, 2004.Google ScholarCross Ref
F. Prinz, T. Schlange, and K. Asadullah. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9):712--712, 2011.Google ScholarCross Ref
R. B. Rao and G. Fung. On the dangers of cross-validation. an experimental evaluation. In International Conference on Data Mining, pages 588--596. SIAM, 2008.Google ScholarCross Ref
J. Reunanen. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3:1371--1382, 2003. Google ScholarDigital Library
A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In 42nd ACM STOC, pages 765--774. ACM, 2010. Google ScholarDigital Library
S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Google ScholarCross Ref
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635--2670, 2010. Google ScholarDigital Library
J. P. Simmons, L. D. Nelson, and U. Simonsohn. False-positive psychology: Undisclosed exibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11):1359--1366, 2011.Google ScholarCross Ref
T. Steinke and J. Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. arXiv preprint arXiv:1410.1228, 2014.Google Scholar
Y. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle. CoRR, abs/1502.06309, 2015.Google Scholar
D. Wind. Learning from the best. http://blog.kaggle.com/2014/08/01/learning-from-the-best/. Accessed: 2014-10-07.Google Scholar

Index Terms

Preserving Statistical Validity in Adaptive Data Analysis
1. Theory of computation

Recommendations

Privacy-preserving statistical estimation with optimal convergence rates
STOC '11: Proceedings of the forty-third annual ACM symposium on Theory of computing

Consider an analyst who wants to release aggregate statistics about a data set containing sensitive information. Using differentially private algorithms guarantees that the released statistics reveal very little about any particular record in the data ...
Read More
Data analysis and statistical estimation for time series: improving presentation and interpretation

In our days in the social sciences, time series (or longitudinal data) are ubiquitous, used in any analytic process, with the main scope to estimate or predict the future. The main issues are represented by the large variety of time series (sometime ...
Read More
Privacy-preserving Class Ratio Estimation
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In this paper we present learning models for the class ratio estimation problem, which takes as input an unlabeled set of instances and predicts the proportions of instances in the set belonging to the different classes. This problem has applications in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing
June 2015
916 pages
ISBN:9781450335362
DOI:10.1145/2746539
General Chair:
Rocco Servedio
Columbia University
,
Program Chair:
Ronitt Rubinfeld
MIT and Tel Aviv University
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptive data analysis
differential privacy
false discovery
generalization bounds
overfitting
statistical estimation
Qualifiers
- research-article
Conference

Acceptance Rates
STOC '15 Paper Acceptance Rate93of347submissions,27%Overall Acceptance Rate1,469of4,586submissions,32%
More
Upcoming Conference
STOC '24

Sponsor:

sigact

56th Annual ACM Symposium on Theory of Computing (STOC 2024)

June 24 - 28, 2024

Vancouver , BC , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 118
  Total Citations
  View Citations
- 1,290
  Total Downloads
- Downloads (Last 12 months)105
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Preserving Statistical Validity in Adaptive Data Analysis

STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Privacy-preserving statistical estimation with optimal convergence rates

Data analysis and statistical estimation for time series: improving presentation and interpretation

Privacy-preserving Class Ratio Estimation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Preserving Statistical Validity in Adaptive Data Analysis

STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Privacy-preserving statistical estimation with optimal convergence rates

Data analysis and statistical estimation for time series: improving presentation and interpretation

Privacy-preserving Class Ratio Estimation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media