skip to main content
10.1145/2746539.2746580acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

Preserving Statistical Validity in Adaptive Data Analysis

Published:14 June 2015Publication History

ABSTRACT

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses.

In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples.

We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

References

  1. Five lessons from Kaggle's event recommendation engine challenge. http://www.rouli.net/2013/02/five-lessons-from-kaggles-event.html. Accessed: 2014-10-07.Google ScholarGoogle Scholar
  2. Kaggle blog: No free hunch. http://blog.kaggle.com/. Accessed: 2014-10-07.Google ScholarGoogle Scholar
  3. Kaggle user forums. https://www.kaggle.com/forums. Accessed: 2014-10-07.Google ScholarGoogle Scholar
  4. E. Aharoni, H. Neuvirth, and S. Rosset. The quality preserving database: A computational framework for encouraging collaboration, enhancing power and controlling false discovery. IEEE/ACM Trans. Comput. Biology Bioinform., 8(5):1431--1437, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Aharoni and S. Rosset. Generalized a-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771--794, 2014.Google ScholarGoogle Scholar
  6. R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization error in adaptive data analysis. CoRR, abs/1503.04843, 2015.Google ScholarGoogle Scholar
  7. R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization, revisited. CoRR, abs/1405.7085, 2014.Google ScholarGoogle Scholar
  8. C. G. Begley and L. Ellis. Drug development: Raise standards for preclinical cancer research. Nature, 483:531--533, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  9. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate -- a practical and powerful approach to multiple testing. Journal of the Royal Statistics Society: Series B (Statistical Methodology), 57:289--300, 1995.Google ScholarGoogle Scholar
  10. A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In PODS, pages 128--138, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499--526, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Proceedings of NIPS, pages 281--288, 2006.Google ScholarGoogle Scholar
  13. I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, pages 202--210. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Dwork. A firm foundation for private data analysis. Communications of the ACM, 54(1):86--95, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486--503, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265--284. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. P. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC '09, pages 381--390, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In CRYPTO, pages 528--544, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(34):211--407, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-S4):211--407, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for planted clique. In STOC, pages 655--664. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Foster and R. Stine. Alpha-investing: A procedure for sequential control of expected false discoveries. J. Royal Statistical Soc.: Series B (Statistical Methodology), 70(2):429--444, 2008.Google ScholarGoogle Scholar
  23. D. A. Freedman. A note on screening regression equations. The American Statistician, 37(2):152--155, 1983.Google ScholarGoogle Scholar
  24. A. Gelman and E. Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time, 2013.Google ScholarGoogle Scholar
  25. M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 51st IEEE FOCS 2010, pages 61--70, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, pages 454--463, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. A. Ioannidis. Contradicted and initially stronger effects in highly cited clinical research. The Journal of American Medical Association, 294(2):218--228, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. P. A. Ioannidis. Why Most Published Research Findings Are False. PLoS Medicine, 2(8):124, Aug. 2005.Google ScholarGoogle Scholar
  30. M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983--1006, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. J. Kearns and U. V. Vazirani. An introduction to computational learning theory. MIT press, 1994. Google ScholarGoogle ScholarCross RefCross Ref
  32. F. McSherry and K. Talwar, 2014. Personal communication.Google ScholarGoogle Scholar
  33. S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161--193, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  34. T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419--422, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  35. F. Prinz, T. Schlange, and K. Asadullah. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9):712--712, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  36. R. B. Rao and G. Fung. On the dangers of cross-validation. an experimental evaluation. In International Conference on Data Mining, pages 588--596. SIAM, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  37. J. Reunanen. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3:1371--1382, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In 42nd ACM STOC, pages 765--774. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  40. S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635--2670, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. P. Simmons, L. D. Nelson, and U. Simonsohn. False-positive psychology: Undisclosed exibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11):1359--1366, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  42. T. Steinke and J. Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. arXiv preprint arXiv:1410.1228, 2014.Google ScholarGoogle Scholar
  43. Y. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle. CoRR, abs/1502.06309, 2015.Google ScholarGoogle Scholar
  44. D. Wind. Learning from the best. http://blog.kaggle.com/2014/08/01/learning-from-the-best/. Accessed: 2014-10-07.Google ScholarGoogle Scholar

Index Terms

  1. Preserving Statistical Validity in Adaptive Data Analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing
      June 2015
      916 pages
      ISBN:9781450335362
      DOI:10.1145/2746539

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 June 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      STOC '15 Paper Acceptance Rate93of347submissions,27%Overall Acceptance Rate1,469of4,586submissions,32%

      Upcoming Conference

      STOC '24
      56th Annual ACM Symposium on Theory of Computing (STOC 2024)
      June 24 - 28, 2024
      Vancouver , BC , Canada

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader