research-article

The cost of privacy: destruction of data-mining utility in anonymized data publishing

Authors:
Justin Brickell

The University of Texas at Austin, Austin, TX, USA

The University of Texas at Austin, Austin, TX, USA
View Profile

,
Vitaly Shmatikov

The University of Texas at Austin, Austin, TX, USA

The University of Texas at Austin, Austin, TX, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 70–78https://doi.org/10.1145/1401890.1401904

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 70–78

ABSTRACT

Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while l-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but k-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records.

For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, l-diversity, and similar methods based on generalization and suppression.

References

D. N. A. Asuncion. UCI machine learning repository, 2007.Google Scholar
N. Adam and J. Worthmann. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4), 1989. Google ScholarDigital Library
C. Aggarwal. On k-anonymity and the curse of dimensionality. In VLDB, 2005. Google ScholarDigital Library
R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, 2000. Google ScholarDigital Library
R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE, 2005. Google ScholarDigital Library
A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In PODS, 2005. Google ScholarDigital Library
J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In SDM, 2006. Google ScholarDigital Library
S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Towards privacy in public databases. In TCC, 2005. Google ScholarDigital Library
B.-C. Chen, K. LeFevre, and R. Ramakrishnan. Privacy skyline: privacy with multidimensional adversarial knowledge. In VLDB, 2007. Google ScholarDigital Library
V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati. k-anonymity. Secure Data Management in Decentralized Systems, 2007.Google ScholarCross Ref
I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, 2003. Google ScholarDigital Library
C. Dwork. Differential privacy. In ICALP, 2006. Google ScholarDigital Library
A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy-preserving data mining. In PODS, 2003. Google ScholarDigital Library
B. Fung, K. Wang, and P. Yu. Top-down specialization for information and privacy preservation. In ICDE, 2005. Google ScholarDigital Library
V. Iyengar. Transforming data to satisfy privacy constraints. In KDD, 2002. Google ScholarDigital Library
D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In SIGMOD, 2006. Google ScholarDigital Library
D. Lambert. Measures of disclosure risk and harm. J. Official Stat., 9, 1993.Google Scholar
K. LeFevre, D. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD, 2005. Google ScholarDigital Library
K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In ICDE, 2006. Google ScholarDigital Library
K. LeFevre, D. DeWitt, and R. Ramakrishnan. Workload-aware anonymization. In KDD, 2006. Google ScholarDigital Library
N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, 2007.Google ScholarCross Ref
A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, 2006. Google ScholarDigital Library
D. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Halpern. Worst-case background knowledge for privacy-preserving data publishing. In ICDE, 2007.Google ScholarCross Ref
G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004. Google ScholarDigital Library
M. Nergiz and C. Clifton. Thoughts on k-anonymization. In PDM, 2006. Google ScholarDigital Library
M. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared database. In SIGMOD, 2007. Google ScholarDigital Library
M. Nergiz, C. Clifton, and A. Nergiz. Multirelational k-anonymity. In ICDE, 2007.Google ScholarCross Ref
M. Nergiz, C. Clifton, and A. Nergiz. Multirelational k-anonymity. In ICDE, 2007.Google ScholarCross Ref
H. Park and K. Shim. Approximate algorithms for k-anonymity. In SIGMOD, 2007. Google ScholarDigital Library
V. Rastogi, D. Suciu, and S. Hong. The boundary between privacy and utility in data publishing. In VLDB, 2007. Google ScholarDigital Library
P. Samarati. Protecting respondents? identities in microdata release. IEEE Trans. on Knowledge and Data Engineering, 13(6), 2001. Google ScholarDigital Library
L. Sweeney. Weaving technology and policy together to maintain confidentiality. J. of Law, Medicine and Ethics, 25(2-3):98--110, 1997.Google Scholar
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):571--588, 2002. Google ScholarDigital Library
L. Sweeney. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557--570, 2002. Google ScholarDigital Library
J. Traub, Y. Yemini, and H. Wozniakowski. The statistical security of a statistical database. ACM Transactions on Database Systems, 9(4), 1984. Google ScholarDigital Library
T. Truta and B. Vinay. Privacy protection: p-sensitive k-anonymity property. In PDM, 2006. Google ScholarDigital Library
K. Wang and B. Fung. Anonymizing sequential releases. In KDD, 2006. Google ScholarDigital Library
K. Wang, B. Fung, and P. Yu. Template-based privacy preservation in classification problems. In ICDM, 2005. Google ScholarDigital Library
R. Wong, J. li, A. Fu, and K. Wang. (α,k)-anonymity: An enhanced k-anonymity model for privacy-preserving data publishing. In KDD, 2006. Google ScholarDigital Library
X. Xiao and T. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In SIGMOD, 2007. Google ScholarDigital Library
X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In VLDB, 2006. Google ScholarDigital Library
X. Xiao and Y. Tao. Personalized privacy protection. In SIGMOD, 2006. Google ScholarDigital Library
L. Zhang, S. Jajodia, and A. Brodsky. Information disclosure under realistic assumptions: Privacy versus optimality. In CCS, 2007. Google ScholarDigital Library

Index Terms

The cost of privacy: destruction of data-mining utility in anonymized data publishing

Recommendations

On the tradeoff between privacy and utility in data publishing
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

In data publishing, anonymization techniques such as generalization and bucketization have been designed to provide privacy protection. In the meanwhile, they reduce the utility of the data. It is important to consider the tradeoff between privacy and ...
Read More
Privacy-preserving data sharing in cloud computing

Storing and sharing databases in the cloud of computers raise serious concern of individual privacy. We consider two kinds of privacy risk: presence leakage, by which the attackers can explicitly identify individuals in (or not in) the database, and ...
Read More
Freedom of Privacy: Anonymous Data Collection with Respondent-Defined Privacy Protection

The massive amount of sensitive survey data about individuals that agencies collect and share through the Internet is causing a great deal of privacy concerns. These concerns may discourage individuals from revealing their sensitive information. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
anonymity
data mining
privacy
utility
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 203
  Total Citations
  View Citations
- 2,408
  Total Downloads
- Downloads (Last 12 months)105
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The cost of privacy: destruction of data-mining utility in anonymized data publishing

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the tradeoff between privacy and utility in data publishing

Privacy-preserving data sharing in cloud computing

Freedom of Privacy: Anonymous Data Collection with Respondent-Defined Privacy Protection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The cost of privacy: destruction of data-mining utility in anonymized data publishing

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the tradeoff between privacy and utility in data publishing

Privacy-preserving data sharing in cloud computing

Freedom of Privacy: Anonymous Data Collection with Respondent-Defined Privacy Protection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media