ABSTRACT
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses.
In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples.
We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
- Five lessons from Kaggle's event recommendation engine challenge. http://www.rouli.net/2013/02/five-lessons-from-kaggles-event.html. Accessed: 2014-10-07.Google Scholar
- Kaggle blog: No free hunch. http://blog.kaggle.com/. Accessed: 2014-10-07.Google Scholar
- Kaggle user forums. https://www.kaggle.com/forums. Accessed: 2014-10-07.Google Scholar
- E. Aharoni, H. Neuvirth, and S. Rosset. The quality preserving database: A computational framework for encouraging collaboration, enhancing power and controlling false discovery. IEEE/ACM Trans. Comput. Biology Bioinform., 8(5):1431--1437, 2011. Google ScholarDigital Library
- E. Aharoni and S. Rosset. Generalized a-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771--794, 2014.Google Scholar
- R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.Google Scholar
- R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization, revisited. CoRR, abs/1405.7085, 2014.Google Scholar
- C. G. Begley and L. Ellis. Drug development: Raise standards for preclinical cancer research. Nature, 483:531--533, 2012.Google ScholarCross Ref
- Y. Benjamini and Y. Hochberg. Controlling the false discovery rate -- a practical and powerful approach to multiple testing. Journal of the Royal Statistics Society: Series B (Statistical Methodology), 57:289--300, 1995.Google Scholar
- A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In PODS, pages 128--138, 2005. Google ScholarDigital Library
- O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499--526, 2002. Google ScholarDigital Library
- C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Proceedings of NIPS, pages 281--288, 2006.Google Scholar
- I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, pages 202--210. ACM, 2003. Google ScholarDigital Library
- C. Dwork. A firm foundation for private data analysis. Communications of the ACM, 54(1):86--95, 2011. Google ScholarDigital Library
- C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486--503, 2006. Google ScholarDigital Library
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265--284. Springer, 2006. Google ScholarDigital Library
- C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. P. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC '09, pages 381--390, 2009. Google ScholarDigital Library
- C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In CRYPTO, pages 528--544, 2004.Google ScholarCross Ref
- C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(34):211--407, 2014. Google ScholarDigital Library
- C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-S4):211--407, 2014. Google ScholarDigital Library
- V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for planted clique. In STOC, pages 655--664. ACM, 2013. Google ScholarDigital Library
- D. Foster and R. Stine. Alpha-investing: A procedure for sequential control of expected false discoveries. J. Royal Statistical Soc.: Series B (Statistical Methodology), 70(2):429--444, 2008.Google Scholar
- D. A. Freedman. A note on screening regression equations. The American Statistician, 37(2):152--155, 1983.Google Scholar
- A. Gelman and E. Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time, 2013.Google Scholar
- M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 51st IEEE FOCS 2010, pages 61--70, 2010. Google ScholarDigital Library
- M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, pages 454--463, 2014. Google ScholarDigital Library
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009.Google ScholarCross Ref
- J. A. Ioannidis. Contradicted and initially stronger effects in highly cited clinical research. The Journal of American Medical Association, 294(2):218--228, 2005.Google ScholarCross Ref
- J. P. A. Ioannidis. Why Most Published Research Findings Are False. PLoS Medicine, 2(8):124, Aug. 2005.Google Scholar
- M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983--1006, 1998. Google ScholarDigital Library
- M. J. Kearns and U. V. Vazirani. An introduction to computational learning theory. MIT press, 1994. Google ScholarCross Ref
- F. McSherry and K. Talwar, 2014. Personal communication.Google Scholar
- S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161--193, 2006.Google ScholarCross Ref
- T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419--422, 2004.Google ScholarCross Ref
- F. Prinz, T. Schlange, and K. Asadullah. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9):712--712, 2011.Google ScholarCross Ref
- R. B. Rao and G. Fung. On the dangers of cross-validation. an experimental evaluation. In International Conference on Data Mining, pages 588--596. SIAM, 2008.Google ScholarCross Ref
- J. Reunanen. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3:1371--1382, 2003. Google ScholarDigital Library
- A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In 42nd ACM STOC, pages 765--774. ACM, 2010. Google ScholarDigital Library
- S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Google ScholarCross Ref
- S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635--2670, 2010. Google ScholarDigital Library
- J. P. Simmons, L. D. Nelson, and U. Simonsohn. False-positive psychology: Undisclosed exibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11):1359--1366, 2011.Google ScholarCross Ref
- T. Steinke and J. Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. arXiv preprint arXiv:1410.1228, 2014.Google Scholar
- Y. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle. CoRR, abs/1502.06309, 2015.Google Scholar
- D. Wind. Learning from the best. http://blog.kaggle.com/2014/08/01/learning-from-the-best/. Accessed: 2014-10-07.Google Scholar
Index Terms
- Preserving Statistical Validity in Adaptive Data Analysis
Recommendations
Privacy-preserving statistical estimation with optimal convergence rates
STOC '11: Proceedings of the forty-third annual ACM symposium on Theory of computingConsider an analyst who wants to release aggregate statistics about a data set containing sensitive information. Using differentially private algorithms guarantees that the released statistics reveal very little about any particular record in the data ...
Data analysis and statistical estimation for time series: improving presentation and interpretation
In our days in the social sciences, time series (or longitudinal data) are ubiquitous, used in any analytic process, with the main scope to estimate or predict the future. The main issues are represented by the large variety of time series (sometime ...
Privacy-preserving Class Ratio Estimation
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn this paper we present learning models for the class ratio estimation problem, which takes as input an unlabeled set of instances and predicts the proportions of instances in the set belonging to the different classes. This problem has applications in ...
Comments