skip to main content
research-article

Achieving anonymity via clustering

Published:02 July 2010Publication History
Skip Abstract Section

Abstract

Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of deidentifying records is to remove identifying fields such as social security number, name, etc. However, recent research has shown that a large fraction of the U.S. population can be identified using nonkey attributes (called quasi-identifiers) such as date of birth, gender, and zip code. The k-anonymity model protects privacy via requiring that nonkey attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a prespecified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an ϵ fraction of points to remain unclustered, that is, deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.

References

  1. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A. 2005. Approximation algorithms for k-anonymity. J. Privacy Technol., Number 20051120001.Google ScholarGoogle Scholar
  2. Bar-Ilan, J., Kortsarz, G., and Peleg, D. 1993. How to allocate network centers. J. Algor. 15, 385--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bayardo, R. J. and Agrawal, R. 2005. Data privacy through optimal k-anonymization. In Proceedings of the International Conference on Data Engineering. 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Charikar, M., Khuller, S., Mount, D., and Narasimhan, G. 2001. Algorithms for facility location with outliers. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 642--651. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chawla, S., Dwork, C., McSherry, F., Smith, A., and Wee, H. 2005. Toward privacy in public databases. In Proceedings of the Theory of Cryptography Conference. 363--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Garey, M. R. and Johnson, D. S. 1990. Computers and Intractability, A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guha, S., Meyerson, A., and Munagala, K. 2000. Hierarchical placement and network design problems. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 603--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hochbaum, D. and Shmoys, D. 1985. A best possible approximation algorithm for the k-center problem. Math. Oper. Res. 10, 180--184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jain, K. and Vazirani, V. V. 1999. Primal-Dual approximation algorithms for metric facility location and k-median problems. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Karger, D. and Minkoff, M. 2000. Building steiner trees with incomplete global knowledge. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 613--623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Khuller, S. and Sussmann, Y. 2000. The capacitated k-center problem. SIAM J. Discr. Math. 13, 3, 403--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. LeFevre, K., DeWitt, D., and Ramakrishnan, R. 2005. Incognito: Efficient full-domain k-anonymity. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. 2006. l-Diversity: Privacy beyond k-anonymity. In Proceedings of the International Conference on Data Engineering. 24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Meyerson, A. and Williams, R. 2004. On the complexity of optimal k-anonymity. In Proceedings of the Symposium on Principles of Database Systems. 223--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Samarati, P. 2001. Protecting respondent's privacy in microdata release. IEEE Trans. Knowl. Data Engin. 13, 6, 1010--1027. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sweeney, L. 2000. Uniqueness of simple demographics in the u.s. population. LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA.Google ScholarGoogle Scholar
  17. Time. 1997. The death of privacy.Google ScholarGoogle Scholar

Index Terms

  1. Achieving anonymity via clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Algorithms
        ACM Transactions on Algorithms  Volume 6, Issue 3
        June 2010
        304 pages
        ISSN:1549-6325
        EISSN:1549-6333
        DOI:10.1145/1798596
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 July 2010
        • Accepted: 1 July 2008
        • Revised: 1 May 2008
        • Received: 1 August 2007
        Published in talg Volume 6, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader