Abstract
Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k − 1 other records with respect to certain identifying attributes.
In this article, we show using two simple attacks that a k-anonymized dataset has some subtle but severe privacy problems. First, an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. This is a known problem. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks, and we propose a novel and powerful privacy criterion called ℓ-diversity that can defend against such attacks. In addition to building a formal foundation for ℓ-diversity, we show in an experimental evaluation that ℓ-diversity is practical and can be implemented efficiently.
- Adam, N. R. and Wortmann, J. C. 1989. Security-control methods for statistical databases: A comparative study. ACM Comput. Surv. 21, 4, 515--556. Google ScholarDigital Library
- Aggarwal, C. C. and Yu, P. S. 2004. A condensation approach to privacy preserving data mining. In Proceedings of the International Conference on Extending Database Technology (EDBT). 183--199.Google Scholar
- Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A. 2004. k-anonymity: Algorithms and hardness. Tech. rep., Stanford University.Google Scholar
- Agrawal, D. and Aggarwal, C. C. 2001. On the design and quantifiaction of privacy preserving data mining algorithms. In Proceedings of the International Conference on Principles of Database Systems (PODS). Google ScholarDigital Library
- Agrawal, R., Bayardo, R. J., Faloutsos, C., Kiernan, J., Rantzau, R., and Srikant, R. 2004. Auditing compliance with a hippocratic database. In Proceedings of the International Conference on Very Large Databases (VLDB). 516--527. Google ScholarDigital Library
- Agrawal, R., Evfimievski, A. V., and Srikant, R. 2003. Information sharing across private databases. In Proceedings of the SIGMOD Conference. 86--97. Google ScholarDigital Library
- Agrawal, R., Kiernan, J., Srikant, R., and Xu, Y. 2002. Hippocratic databases. In Proceedings of the International Conference on Very Large Databases (VLDB). 143--154. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 2000. Privacy preserving data mining. In Proceedings of the 19th ACM SIGMOD Conference on Management of Data. Google ScholarDigital Library
- Agrawal, R., Srikant, R., and Thomas, D. 2004. Privacy preserving OLAP. In Proceedings of the 23th ACM SIGMOD Conference on Management of Data. Google ScholarDigital Library
- Bacchus, F., Grove, A. J., Halpern, J. Y., and Koller, D. 1996. From statistical knowledge bases to degrees of belief. A.I. 87, 1--2. Google ScholarDigital Library
- Bayardo, R. J. and Agrawal, R. 2005. Data privacy through optimal k-anonymization. In Proceedings of the International Conference on Data Engineering (ICDE'05). Google ScholarDigital Library
- Beaver, D. 1997. Commodity-based cryptography (extended abstract). In Proceedings of the 29th ACM Symposium on Theory of Computing (STOC'97). 446--455. Google ScholarDigital Library
- Beaver, D. 1998. Server-assisted cryptography. In Proceedings of the 1998 Workshop on New Security Paradigms (NSPW'98). 92--106. Google ScholarDigital Library
- Beck, L. 1980. A security mechanism for statistical database. ACM Trans. Datab. Syst. 5, 3, 316--338. Google ScholarDigital Library
- Ben-Or, M., Goldwasser, S., and Wigderson, A. 1988. Completeness theorems for non-cryptographic fault-tolerant distributed computation. In Proceedings of the 20th ACM Symposium on Theory of Computing (STOC'88). 1--10. Google ScholarDigital Library
- Ben-Tal, A., Charnes, A., and Teboulle, M. 1989. Entropic means. J. Mathemat. Anal. Appl. 139, 2, 537--551.Google ScholarCross Ref
- Blum, A., Dwork, C., McSherry, F., and Nissim, K. 2005. Practical privacy: The SuLQ framework. In Proceedings of the International Conference on Principles of Data Systems (PODS). Google ScholarDigital Library
- Chaum, D., Crepeau, C., and Damgard, I. 1988. Multiparty unconditionally secure protocols. In Proceedings of the 20th ACM Symposium on Theory of Computing (STOC'88). 11--19. Google ScholarDigital Library
- Chawla, S., Dwork, C., McSherry, F., Smith, A., and Wee, H. 2005. Toward privacy in public databases. In Proceedings of the Tactical Communications Conference (TCC). Google ScholarDigital Library
- Chin, F. 1986. Security problems on inference control for sum, max, and min queries. J. ACM 33, 3, 451--464. Google ScholarDigital Library
- Chin, F. and Ozsoyoglu, G. 1981. Auditing for secure statistical databases. In Proceedings of the ACM Conference (ACM'81). 53--59. Google ScholarDigital Library
- Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., and Zhu, M. Y. 2002. Tools for privacy preserving data mining. SIGKDD Explorations 4, 2, 28--34. Google ScholarDigital Library
- Cox, L. 1995. Network models for complementary cell suppression. J. Amer. Statis. Asso. 90, 1453--1462.Google ScholarCross Ref
- Cox, L. H. 1980. Suppression, methodology and statistical disclosure control. J. Amer. Statis. Asso. 75.Google ScholarCross Ref
- Cox, L. H. 1982. Solving confidentiality protection problems in tabulations using network optimization: A network model for cell suppression in the u.s. economic censuses. In Proceedings of the International Seminar on Statistical Confidentiality. Dublin International Statistical Institute, Dublin, Ireland. 229--245.Google Scholar
- Cox, L. H. 1987. New results in dislosure avoidance for tabulations. In Proceedings of the International Statistical Institute 46th Session. Tokyo, Japan. 83--84.Google Scholar
- Dalenius, T. 1981. A simple procedure for controlled rounding. Statistik Tidskrift.Google Scholar
- Dalenius, T. and Reiss, S. 1982. Data swapping: A technique for disclosure control. J. Statis. Plan. Infer. 6.Google Scholar
- Denning, D. 1980. Secure statistical databases with random sample queries. ACM Trans. Datab. Syst. 5, 3, 291--315. Google ScholarDigital Library
- Denning, D. E., Denning, P. J., and Schwartz, M. D. 1979. The tracker: A threat to statistical database security. ACM Trans. Datab. Syst. 4, 1, 76--96. Google ScholarDigital Library
- Diaconis, P. and Sturmfels, B. 1998. Algebraic algorithms for sampling from conditional distributions. Annals of Statistics 1, 363--397.Google Scholar
- Dinur, I. and Nissim, K. 2003. Revealing information while preserving privacy. In Proceedings of the International Conference on Principles of Data Systems (PODS). 202--210. Google ScholarDigital Library
- Dobkin, D. P., Jones, A. K., and Lipton, R. J. 1979. Secure databases: Protection against user influence. ACM: Trans. Datab. Syst. 4, 1 (March), 76--96. Google ScholarDigital Library
- Dobra, A. 2002. Statistical tools for disclosure limitation in multiway contingency tables. Ph.D. thesis, Carnegie Mellon University.Google Scholar
- Dobra, A. and Feinberg, S. E. 2000. Assessing the risk of disclosure of confidential categorical data. In Bayesian Statistics 7. Oxford University Press, Oxford, UK.Google Scholar
- Dobra, A. and Feinberg, S. E. 2003. Bounding entries in multi-way contingency tables given a set of marginal totals. In Proceedings of the Shoresh Conference 2000: Foundations of Statistical Inference. Springer Verlag.Google Scholar
- Du, W. 2001. A study of several specific secure two-party computation problems. Ph.D. thesis, Purdue University. Google ScholarDigital Library
- Du, W. and Zhan, Z. 2002. A practical approach to solve secure multi-party computation problems. New Security Paradigms Workshop. Google ScholarDigital Library
- Duncan, G. T. and Feinberg, S. E. 1997. Obtaining information while preserving privacy: A markov perturbation method for tabular data. Joint Statistical Meetings. Anaheim, CA.Google Scholar
- Evfimievski, A., Gehrke, J., and Srikant, R. 2003. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the International Conference on Principles of Data Systems (PODS). Google ScholarDigital Library
- Evfimievsky, A., Srikant, R., Gehrke, J., and Agrawal, R. 2002. Privacy preserving data mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining. 217--228. Google ScholarDigital Library
- Fellegi, I. P. 1972. On the question of statistical confidentiality. J. Amer. Statis. Asso. 67:337, 7--18.Google ScholarCross Ref
- Goldreich, O., Micali, S., and Wigderson, A. 1987. How to play any mental game. In Proceedings of the 19th ACM Conference on Theory of Computing (STOC'87). 218--229. Google ScholarDigital Library
- Huang, Z., Du, W., and Chen, B. 2004. Deriving private information from randomized data. In Proceedings of the 23th ACM SIGMOD Conference on Management of Data. Google ScholarDigital Library
- Kantarcioglu, M. and Clifton, C. 2002. Privacy-preserving distributed mining of association rules on horizontally partitioned data. In Proceedings of the Conference on Data Mining and Knowledge Discovery (DMKD).Google Scholar
- Kargupta, H., Datta, S., Wang, Q., and Sivakumar, K. 2003. On the privacy preserving properties of random data perturbation techniques. In Proceedings of the International Conference on Data Mining (ICDM). 99--106. Google ScholarDigital Library
- Kenthapadi, K., Mishra, N., and Nissim, K. 2005. Simulatable auditing. In PODS. Google ScholarDigital Library
- Kleinberg, J., Papadimitriou, C., and Raghavan, P. 2000. Auditing boolean attributes. In Proceedings of the International Conference on Principles of Data Systems (PODS). Google ScholarDigital Library
- LeFevre, K., Agrawal, R., Ercegovac, V., Ramakrishnan, R., Xu, Y., and DeWitt, D. J. 2004. Limiting disclosure in hippocratic databases. In Proceedings of the International Conference on Very Large Databases (VLDB). 108--119. Google ScholarDigital Library
- LeFevre, K., DeWitt, D., and Ramakrishnan, R. 2005. Incognito: Efficient fulldomain k-anonymity. In SIGMOD. Google ScholarDigital Library
- M. Langheinrich, E. 2001. A P3P preference exchange language 1.0 (appel1.0). W3C Working Draft.Google Scholar
- M. Marchiori, E. 2002. The platform for privacy preferences 1.0 (p3p1.0) specification. W3C Proposed Recommendation.Google Scholar
- Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasubramaniam, M. 2006. ℓ-diversity: Privacy beyond k-anonymity. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Martin, D., Kifer, D., Machanavajjhala, A., Gehrke, J., and Halpern, J. 2006. Worst-case background knowledge in privacy. Tech. rep., Cornell University.Google Scholar
- Matloff, N. S. 1986. Another look at the use of noise addition for database security. In Proceedings of IEEE Symposium on Security and Privacy. 173--180.Google ScholarCross Ref
- Meyerson, A. and Williams, R. 2004. On the complexity of optimal k-anonymity. In PODS. Google ScholarDigital Library
- Miklau, G. and Suciu, D. 2003. Controlling access to published data using cryptography. In Proceedings of the International Conference on Very Large Databases (VLDB). 898--909. Google ScholarDigital Library
- Miklau, G. and Suciu, D. 2004. A formal analysis of information disclosure in data exchange. In SIGMOD. Google ScholarDigital Library
- Ohrn, A. and Ohno-Machado, L. 1999. Using boolean reasoning to anonymize databases. A. I. Medicine 15, 3, 235--254.Google Scholar
- Samarati, P. 2001. Protecting respondents' identities in microdata release. In IEEE Trans. Knowl. Data Eng. Google ScholarDigital Library
- Samarati, P. and Sweeney, L. 1998. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Tech. rep. SRI-CSL-98-04, SRI Computer Science Laboratory, Palo Alto, CA.Google Scholar
- Schlorer, J. 1975. Identification and retrieval of personal records from a statistical bank. Methods Inform. Medicine.Google Scholar
- Slavkovic, A. and Feinberg, S. E. 2004. Bounds for cell entries in two-way tables given conditional relative frequencies. In Lecture Notes in Computer Science, Vol. 3050. J. Domingo-Ferrer and V. Torra Eds. Springer-Verlag, 30--43.Google Scholar
- Snodgrass, R. T., Yao, S., and Collberg, C. S. 2004. Tamper detection in audit logs. In Proceedings of the International Conference on Very Large Databases (VLDB). 504--515. Google ScholarDigital Library
- Sweeney, L. 2000. Uniqueness of simple demographics in the u.s. population. Tech. rep., Carnegie Mellon University.Google Scholar
- Sweeney, L. 2002. k-anonymity: a model for protecting privacy. Int. J. Uncer., Fuz. Knowl-based Syst. 10, 5, 557--570. Google ScholarDigital Library
- Traub, J. F., Yemini, Y., and Wozniakowski, H. 1984. The statistical security of a statistical database. ACM Trans. Datab. Syst. 9, 4, 672--679. Google ScholarDigital Library
- University of California Irvine Machine Learning Repository. http://www.ics.uci.edu/mlearn/mlrepository.html.Google Scholar
- Vaidya, J. and Clifton, C. 2002. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 639--644. Google ScholarDigital Library
- Warner, S. L. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statis. Ass.Google ScholarCross Ref
- Warner, S. L. 1971. The linear randomized response model. J. Amer. Statis. Ass. 884--888.Google ScholarCross Ref
- Yang, X. and Li, C. 2004. Secure XML publishing without information leakage in the presence of data inference. In Proceedings of the International Conference on Very Large Databases (VLDB). 96--107. Google ScholarDigital Library
- Zhong, S., Yang, Z., and Wright, R. N. 2005. Privacy-enhancing k-anonymization of customer data. In Proceedings of the International Conference on Principles of Data Systems (PODS). Google ScholarDigital Library
Index Terms
- L-diversity: Privacy beyond k-anonymity
Recommendations
t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation
Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It has been used as an alternative to generalization and suppression to generate k-anonymous data sets, where the identity of ...
Improving accuracy of classification models induced from anonymized datasets
The performance of classifiers and other data mining models can be significantly enhanced using the large repositories of digital data collected nowadays by public and private organizations. However, the original records stored in those repositories ...
Background knowledge attacks in privacy-preserving data publishing models
AbstractMassive volumes of data are being generated at every moment through various sources in the cyber-physical world. While storing as well as facilitating these data for business or individual requirements, data disclosure, sensitive data ...
Comments