ABSTRACT
Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while l-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but k-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records.
For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, l-diversity, and similar methods based on generalization and suppression.
- D. N. A. Asuncion. UCI machine learning repository, 2007.Google Scholar
- N. Adam and J. Worthmann. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4), 1989. Google ScholarDigital Library
- C. Aggarwal. On k-anonymity and the curse of dimensionality. In VLDB, 2005. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, 2000. Google ScholarDigital Library
- R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE, 2005. Google ScholarDigital Library
- A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In PODS, 2005. Google ScholarDigital Library
- J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In SDM, 2006. Google ScholarDigital Library
- S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Towards privacy in public databases. In TCC, 2005. Google ScholarDigital Library
- B.-C. Chen, K. LeFevre, and R. Ramakrishnan. Privacy skyline: privacy with multidimensional adversarial knowledge. In VLDB, 2007. Google ScholarDigital Library
- V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati. k-anonymity. Secure Data Management in Decentralized Systems, 2007.Google ScholarCross Ref
- I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, 2003. Google ScholarDigital Library
- C. Dwork. Differential privacy. In ICALP, 2006. Google ScholarDigital Library
- A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy-preserving data mining. In PODS, 2003. Google ScholarDigital Library
- B. Fung, K. Wang, and P. Yu. Top-down specialization for information and privacy preservation. In ICDE, 2005. Google ScholarDigital Library
- V. Iyengar. Transforming data to satisfy privacy constraints. In KDD, 2002. Google ScholarDigital Library
- D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In SIGMOD, 2006. Google ScholarDigital Library
- D. Lambert. Measures of disclosure risk and harm. J. Official Stat., 9, 1993.Google Scholar
- K. LeFevre, D. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD, 2005. Google ScholarDigital Library
- K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In ICDE, 2006. Google ScholarDigital Library
- K. LeFevre, D. DeWitt, and R. Ramakrishnan. Workload-aware anonymization. In KDD, 2006. Google ScholarDigital Library
- N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, 2007.Google ScholarCross Ref
- A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, 2006. Google ScholarDigital Library
- D. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Halpern. Worst-case background knowledge for privacy-preserving data publishing. In ICDE, 2007.Google ScholarCross Ref
- G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004. Google ScholarDigital Library
- M. Nergiz and C. Clifton. Thoughts on k-anonymization. In PDM, 2006. Google ScholarDigital Library
- M. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared database. In SIGMOD, 2007. Google ScholarDigital Library
- M. Nergiz, C. Clifton, and A. Nergiz. Multirelational k-anonymity. In ICDE, 2007.Google ScholarCross Ref
- M. Nergiz, C. Clifton, and A. Nergiz. Multirelational k-anonymity. In ICDE, 2007.Google ScholarCross Ref
- H. Park and K. Shim. Approximate algorithms for k-anonymity. In SIGMOD, 2007. Google ScholarDigital Library
- V. Rastogi, D. Suciu, and S. Hong. The boundary between privacy and utility in data publishing. In VLDB, 2007. Google ScholarDigital Library
- P. Samarati. Protecting respondents? identities in microdata release. IEEE Trans. on Knowledge and Data Engineering, 13(6), 2001. Google ScholarDigital Library
- L. Sweeney. Weaving technology and policy together to maintain confidentiality. J. of Law, Medicine and Ethics, 25(2-3):98--110, 1997.Google Scholar
- L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):571--588, 2002. Google ScholarDigital Library
- L. Sweeney. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557--570, 2002. Google ScholarDigital Library
- J. Traub, Y. Yemini, and H. Wozniakowski. The statistical security of a statistical database. ACM Transactions on Database Systems, 9(4), 1984. Google ScholarDigital Library
- T. Truta and B. Vinay. Privacy protection: p-sensitive k-anonymity property. In PDM, 2006. Google ScholarDigital Library
- K. Wang and B. Fung. Anonymizing sequential releases. In KDD, 2006. Google ScholarDigital Library
- K. Wang, B. Fung, and P. Yu. Template-based privacy preservation in classification problems. In ICDM, 2005. Google ScholarDigital Library
- R. Wong, J. li, A. Fu, and K. Wang. (α,k)-anonymity: An enhanced k-anonymity model for privacy-preserving data publishing. In KDD, 2006. Google ScholarDigital Library
- X. Xiao and T. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In SIGMOD, 2007. Google ScholarDigital Library
- X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In VLDB, 2006. Google ScholarDigital Library
- X. Xiao and Y. Tao. Personalized privacy protection. In SIGMOD, 2006. Google ScholarDigital Library
- L. Zhang, S. Jajodia, and A. Brodsky. Information disclosure under realistic assumptions: Privacy versus optimality. In CCS, 2007. Google ScholarDigital Library
Index Terms
- The cost of privacy: destruction of data-mining utility in anonymized data publishing
Recommendations
On the tradeoff between privacy and utility in data publishing
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningIn data publishing, anonymization techniques such as generalization and bucketization have been designed to provide privacy protection. In the meanwhile, they reduce the utility of the data. It is important to consider the tradeoff between privacy and ...
Privacy-preserving data sharing in cloud computing
Storing and sharing databases in the cloud of computers raise serious concern of individual privacy. We consider two kinds of privacy risk: presence leakage, by which the attackers can explicitly identify individuals in (or not in) the database, and ...
Freedom of Privacy: Anonymous Data Collection with Respondent-Defined Privacy Protection
The massive amount of sensitive survey data about individuals that agencies collect and share through the Internet is causing a great deal of privacy concerns. These concerns may discourage individuals from revealing their sensitive information. ...
Comments