skip to main content
10.1145/1401890.1401904acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

The cost of privacy: destruction of data-mining utility in anonymized data publishing

Published:24 August 2008Publication History

ABSTRACT

Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while l-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but k-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records.

For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, l-diversity, and similar methods based on generalization and suppression.

References

  1. D. N. A. Asuncion. UCI machine learning repository, 2007.Google ScholarGoogle Scholar
  2. N. Adam and J. Worthmann. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4), 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Aggarwal. On k-anonymity and the curse of dimensionality. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In PODS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In SDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Towards privacy in public databases. In TCC, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B.-C. Chen, K. LeFevre, and R. Ramakrishnan. Privacy skyline: privacy with multidimensional adversarial knowledge. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati. k-anonymity. Secure Data Management in Decentralized Systems, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  11. I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Dwork. Differential privacy. In ICALP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy-preserving data mining. In PODS, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Fung, K. Wang, and P. Yu. Top-down specialization for information and privacy preservation. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Iyengar. Transforming data to satisfy privacy constraints. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Lambert. Measures of disclosure risk and harm. J. Official Stat., 9, 1993.Google ScholarGoogle Scholar
  18. K. LeFevre, D. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. LeFevre, D. DeWitt, and R. Ramakrishnan. Workload-aware anonymization. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Halpern. Worst-case background knowledge for privacy-preserving data publishing. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  24. G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Nergiz and C. Clifton. Thoughts on k-anonymization. In PDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared database. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Nergiz, C. Clifton, and A. Nergiz. Multirelational k-anonymity. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Nergiz, C. Clifton, and A. Nergiz. Multirelational k-anonymity. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. H. Park and K. Shim. Approximate algorithms for k-anonymity. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Rastogi, D. Suciu, and S. Hong. The boundary between privacy and utility in data publishing. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Samarati. Protecting respondents? identities in microdata release. IEEE Trans. on Knowledge and Data Engineering, 13(6), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Sweeney. Weaving technology and policy together to maintain confidentiality. J. of Law, Medicine and Ethics, 25(2-3):98--110, 1997.Google ScholarGoogle Scholar
  33. L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):571--588, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. Sweeney. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557--570, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Traub, Y. Yemini, and H. Wozniakowski. The statistical security of a statistical database. ACM Transactions on Database Systems, 9(4), 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Truta and B. Vinay. Privacy protection: p-sensitive k-anonymity property. In PDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Wang and B. Fung. Anonymizing sequential releases. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Wang, B. Fung, and P. Yu. Template-based privacy preservation in classification problems. In ICDM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Wong, J. li, A. Fu, and K. Wang. (α,k)-anonymity: An enhanced k-anonymity model for privacy-preserving data publishing. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. X. Xiao and T. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Xiao and Y. Tao. Personalized privacy protection. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Zhang, S. Jajodia, and A. Brodsky. Information disclosure under realistic assumptions: Privacy versus optimality. In CCS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The cost of privacy: destruction of data-mining utility in anonymized data publishing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2008
          1116 pages
          ISBN:9781605581934
          DOI:10.1145/1401890
          • General Chair:
          • Ying Li,
          • Program Chairs:
          • Bing Liu,
          • Sunita Sarawagi

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 August 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader