ABSTRACT
In this paper we present learning models for the class ratio estimation problem, which takes as input an unlabeled set of instances and predicts the proportions of instances in the set belonging to the different classes. This problem has applications in social and commercial data analysis. Existing models for class-ratio estimation however require instance-level supervision. Whereas in domains like politics, and demography, set-level supervision is more common. We present a new method for directly estimating class-ratios using set-level supervision. Another serious limitation in applying these techniques to sensitive domains like health is data privacy. We propose a novel label privacy-preserving mechanism that is well-suited for supervised class ratio estimation and has guarantees for achieving efficient differential privacy, provided the per-class counts are large enough. We derive learning bounds for the estimation with and without privacy constraints, which lead to important insights for the data-publisher. Extensive empirical evaluation shows that our model is more accurate than existing methods and that the proposed privacy mechanism and learning model are well-suited for each other.
Supplemental Material
- R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464--473. IEEE, 2014. Google ScholarDigital Library
- A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The sulq framework. In PODS, 2005. Google ScholarDigital Library
- K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In COLT, 2011.Google Scholar
- K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069--1109, 2011. Google ScholarDigital Library
- R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. PVLDB, 4(11), 2011.Google Scholar
- R. Chen, Q. Xiao, Y. Zhang, and J. Xu. Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 129--138. ACM, 2015. Google ScholarDigital Library
- J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Privacy aware learning. In NIPS. 2012.Google Scholar
- C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3--4):211--407, 2014. Google ScholarDigital Library
- A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723--773, 2012. Google ScholarDigital Library
- A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Schölkopf. Covariate shift and local learning by distribution matching, pages 131--160. MIT Press, Cambridge, MA, USA, 2 2009.Google Scholar
- C. Guttmann, X. Sun, C. Rao, C. Queiroz, and B. I. Rubinstein. On the challenges of balancing privacy and utility of open health data. In Joint Proceedings of the Workshop on AI Problems and Approaches for Intelligent Environments and Workshop on Semantic Cities, pages 43--47. ACM, 2013. Google ScholarDigital Library
- J. Hsu, M. Gaboardi, A. Haeberlen, S. Khanna, A. Narayan, B. C. Pierce, and A. Roth. Differential privacy: An economic method for choosing epsilon. In IEEE 27th Computer Security Foundations Symposium, CSF, 2014. Google ScholarDigital Library
- A. Iyer, S. Nath, and S. Sarawagi. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In ICML, pages 530--538, 2014.Google Scholar
- P. Jain and A. Thakurta. Differentially private learning with kernels. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 118--126, 2013.Google Scholar
- Z. Ji, Z. C. Lipton, and C. Elkan. Differential privacy and machine learning: a survey and review. CoRR, 2014.Google Scholar
- D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional regression. JMLR, 1:41, 2012.Google Scholar
- S. Le. Learning via Hilbert space embedding of distributions. PhD thesis, University of Sydney, School of Information Technologies, 2008.Google Scholar
- J. Lee, Y. Wang, and D. Kifer. Maximum likelihood postprocessing for differential privacy under consistency constraints. In ACM SIGKDD, 2015. Google ScholarDigital Library
- Y. Li, K. Swersky, and R. S. Zemel. Generative moment matching networks. In ICML, 2015.Google ScholarDigital Library
- A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In ICDE, 2008. Google ScholarDigital Library
- R. Nock, G. Patrini, and A. Friedman. Rademacher observations, private data, and boosting. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 948--956, 2015.Google Scholar
- G. Patrini, R. Nock, T. Caetano, and P. Rivera. (almost) no label no cry. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 190--198, 2014. Google ScholarDigital Library
- N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V. Le. Estimating labels from label proportions. Journal of Machine Learning Research, 10:2349--2374, 2009. Google ScholarDigital Library
- B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for svm learning. Journal of Privacy and Confidentiality, 4(1), 2012.Google ScholarCross Ref
- S. Rüping. SVM classifier estimation from group probabilities. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 911--918, 2010.Google ScholarDigital Library
- D.-T. Vo and Y. Zhang. Target-dependent twitter sentiment classification with rich automatic features. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI'15, pages 1347--1353. AAAI Press, 2015. Google ScholarDigital Library
- Y. Xin and T. Jaakkola. Controlling privacy in recommender systems. In Advances in Neural Information Processing Systems, pages 2618--2626, 2014. Google ScholarDigital Library
- F. X. Yu, D. Liu, S. Kumar, T. Jebara, and S. Chang. ∝SVM for learning with label proportions. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16--21 June 2013, pages 504--512, 2013.Google Scholar
- K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In ICML, 2013.Google ScholarDigital Library
- S. Zhou, J. D. Lafferty, and L. A. Wasserman. Compressed and privacy-sensitive sparse regression. IEEE Transactions on Information Theory, 55(2):846--866, 2009. Google ScholarDigital Library
Index Terms
- Privacy-preserving Class Ratio Estimation
Recommendations
Multi-instance learning with any hypothesis class
In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of ...
Semi-supervised learning method based on predefined evenly-distributed class centroids
AbstractCompared to supervised learning, semi-supervised learning reduces the dependence of deep learning on a large number of labeled samples. In this work, we use a small number of labeled samples and perform data augmentation on unlabeled samples to ...
Statistical analysis of kernel-based least-squares density-ratio estimation
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), feature selection (mutual information), and conditional ...
Comments