research-article

Privacy-preserving Class Ratio Estimation

Authors:
Arun Shankar Iyer

IIT Bombay, Mumbai, India

IIT Bombay, Mumbai, India
View Profile

,
J. Saketha Nath

IIT Bombay, Mumbai, India

IIT Bombay, Mumbai, India
View Profile

,
Sunita Sarawagi

IIT Bombay, Mumbai, India

IIT Bombay, Mumbai, India
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 925–934https://doi.org/10.1145/2939672.2939806

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 925–934

ABSTRACT

In this paper we present learning models for the class ratio estimation problem, which takes as input an unlabeled set of instances and predicts the proportions of instances in the set belonging to the different classes. This problem has applications in social and commercial data analysis. Existing models for class-ratio estimation however require instance-level supervision. Whereas in domains like politics, and demography, set-level supervision is more common. We present a new method for directly estimating class-ratios using set-level supervision. Another serious limitation in applying these techniques to sensitive domains like health is data privacy. We propose a novel label privacy-preserving mechanism that is well-suited for supervised class ratio estimation and has guarantees for achieving efficient differential privacy, provided the per-class counts are large enough. We derive learning bounds for the estimation with and without privacy constraints, which lead to important insights for the data-publisher. Extensive empirical evaluation shows that our model is more accurate than existing methods and that the proposed privacy mechanism and learning model are well-suited for each other.

Supplemental Material

kdd2016_iyer_ratio_estimation_01-acm.mp4

mp4

360.1 MB

Download

References

R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464--473. IEEE, 2014. Google ScholarDigital Library
A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The sulq framework. In PODS, 2005. Google ScholarDigital Library
K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In COLT, 2011.Google Scholar
K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069--1109, 2011. Google ScholarDigital Library
R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. PVLDB, 4(11), 2011.Google Scholar
R. Chen, Q. Xiao, Y. Zhang, and J. Xu. Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 129--138. ACM, 2015. Google ScholarDigital Library
J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Privacy aware learning. In NIPS. 2012.Google Scholar
C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3--4):211--407, 2014. Google ScholarDigital Library
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723--773, 2012. Google ScholarDigital Library
A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Schölkopf. Covariate shift and local learning by distribution matching, pages 131--160. MIT Press, Cambridge, MA, USA, 2 2009.Google Scholar
C. Guttmann, X. Sun, C. Rao, C. Queiroz, and B. I. Rubinstein. On the challenges of balancing privacy and utility of open health data. In Joint Proceedings of the Workshop on AI Problems and Approaches for Intelligent Environments and Workshop on Semantic Cities, pages 43--47. ACM, 2013. Google ScholarDigital Library
J. Hsu, M. Gaboardi, A. Haeberlen, S. Khanna, A. Narayan, B. C. Pierce, and A. Roth. Differential privacy: An economic method for choosing epsilon. In IEEE 27th Computer Security Foundations Symposium, CSF, 2014. Google ScholarDigital Library
A. Iyer, S. Nath, and S. Sarawagi. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In ICML, pages 530--538, 2014.Google Scholar
P. Jain and A. Thakurta. Differentially private learning with kernels. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 118--126, 2013.Google Scholar
Z. Ji, Z. C. Lipton, and C. Elkan. Differential privacy and machine learning: a survey and review. CoRR, 2014.Google Scholar
D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional regression. JMLR, 1:41, 2012.Google Scholar
S. Le. Learning via Hilbert space embedding of distributions. PhD thesis, University of Sydney, School of Information Technologies, 2008.Google Scholar
J. Lee, Y. Wang, and D. Kifer. Maximum likelihood postprocessing for differential privacy under consistency constraints. In ACM SIGKDD, 2015. Google ScholarDigital Library
Y. Li, K. Swersky, and R. S. Zemel. Generative moment matching networks. In ICML, 2015.Google ScholarDigital Library
A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In ICDE, 2008. Google ScholarDigital Library
R. Nock, G. Patrini, and A. Friedman. Rademacher observations, private data, and boosting. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 948--956, 2015.Google Scholar
G. Patrini, R. Nock, T. Caetano, and P. Rivera. (almost) no label no cry. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 190--198, 2014. Google ScholarDigital Library
N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V. Le. Estimating labels from label proportions. Journal of Machine Learning Research, 10:2349--2374, 2009. Google ScholarDigital Library
B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for svm learning. Journal of Privacy and Confidentiality, 4(1), 2012.Google ScholarCross Ref
S. Rüping. SVM classifier estimation from group probabilities. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 911--918, 2010.Google ScholarDigital Library
D.-T. Vo and Y. Zhang. Target-dependent twitter sentiment classification with rich automatic features. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI'15, pages 1347--1353. AAAI Press, 2015. Google ScholarDigital Library
Y. Xin and T. Jaakkola. Controlling privacy in recommender systems. In Advances in Neural Information Processing Systems, pages 2618--2626, 2014. Google ScholarDigital Library
F. X. Yu, D. Liu, S. Kumar, T. Jebara, and S. Chang. ∝SVM for learning with label proportions. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16--21 June 2013, pages 504--512, 2013.Google Scholar
K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In ICML, 2013.Google ScholarDigital Library
S. Zhou, J. D. Lafferty, and L. A. Wasserman. Compressed and privacy-sensitive sparse regression. IEEE Transactions on Information Theory, 55(2):846--866, 2009. Google ScholarDigital Library

Index Terms

Privacy-preserving Class Ratio Estimation
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Theory of database privacy and security
    2. Machine learning theory
      1. Kernel methods

Recommendations

Multi-instance learning with any hypothesis class

In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of ...
Read More
Semi-supervised learning method based on predefined evenly-distributed class centroids
Abstract
Compared to supervised learning, semi-supervised learning reduces the dependence of deep learning on a large number of labeled samples. In this work, we use a small number of labeled samples and perform data augmentation on unlabeled samples to ...
Read More
Statistical analysis of kernel-based least-squares density-ratio estimation

The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), feature selection (mutual information), and conditional ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
differential privacy
kernel method
learning theory
maximum mean discrepancy
statistical estimation
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 236
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Privacy-preserving Class Ratio Estimation

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Multi-instance learning with any hypothesis class

Semi-supervised learning method based on predefined evenly-distributed class centroids

Statistical analysis of kernel-based least-squares density-ratio estimation