nach oben

2008 | Buch

Kapitel lesen Erstes Kapitel lesen

Privacy-Preserving Data Mining

Models and Algorithms

herausgegeben von: Charu C. Aggarwal, Philip S. Yu

Verlag: Springer US

Buchreihe : Advances in Database Systems

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Advances in hardware technology have increased the capability to store and record personal data about consumers and individuals, causing concerns that personal data may be used for a variety of intrusive or malicious purposes.

Privacy-Preserving Data Mining: Models and Algorithms proposes a number of techniques to perform the data mining tasks in a privacy-preserving way. These techniques generally fall into the following categories: data modification techniques, cryptographic methods and protocols for data sharing, statistical techniques for disclosure and inference control, query auditing methods, randomization and perturbation-based techniques.

This edited volume contains surveys by distinguished researchers in the privacy field. Each survey includes the key research content as well as future research directions.

Privacy-Preserving Data Mining: Models and Algorithms is designed for researchers, professors, and advanced-level students in computer science, and is also suitable for industry practitioners.

Inhaltsverzeichnis

Frontmatter

1. An Introduction to Privacy-Preserving Data Mining

The field of privacy has seen rapid advances in recent years because of the increases in the ability to store data. In particular, recent advances in the data mining field have lead to increased concerns about privacy. While the topic of privacy has been traditionally studied in the context of cryptography and information-hiding, recent emphasis on data mining has lead to renewed interest in the field. In this chapter, we will introduce the topic of privacy-preserving data mining and provide an overview of the different topics covered in this book.

Charu C. Aggarwal, Philip S. Yu

2. A General Survey of Privacy-Preserving Data Mining Models and Algorithms

In recent years, privacy-preserving data mining has been studied extensively, because of the wide proliferation of sensitive information on the internet. A number of algorithmic techniques have been designed for privacy-preserving data mining. In this paper, we provide a review of the state-of-the-art methods for privacy. We discuss methods for randomization, k-anonymization, and distributed privacy-preserving data mining. We also discuss cases in which the output of data mining applications needs to be sanitized for privacy-preservation purposes. We discuss the computational and theoretical limits associated with privacy-preservation over high dimensional data sets.

Charu C. Aggarwal, Philip S. Yu

3. A Survey of Inference Control Methods for Privacy-Preserving Data Mining

Inference control in databases, also known as Statistical Disclosure Control (SDC), is about protecting data so they can be published without revealing confidential information that can be linked to specific individuals among those to which the data correspond. This is an important application in several areas, such as official statistics, health statistics, e-commerce (sharing of consumer data), etc. Since data protection ultimately means data modification, the challenge for SDC is to achieve protection with minimum loss of the accuracy sought by database users. In this chapter, we survey the current state of the art in SDC methods for protecting individual data (microdata). We discuss several information loss and disclosure risk measures and analyze several ways of combining them to assess the performance of the various methods. Last but not least, topics which need more research in the area are identified and possible directions hinted.

Josep Domingo-Ferrer

4. Measures of Anonymity

To design a privacy-preserving data publishing system, we must first quantify the very notion of privacy, or information loss. In the past few years, there has been a proliferation of measures of privacy, some based on statistical considerations, others based on Bayesian or information-theoretic notions of information, and even others designed around the limitations of bounded adversaries. In this chapter, we review the various approaches to capturing privacy. We will find that although one can define privacy from different standpoints, there are many structural similarities in the way different approaches have evolved. It will also become clear that the notions of privacy and utility (the useful information one can extract from published data) are intertwined in ways that are yet to be fully resolved.

Suresh Venkatasubramanian

5. k-Anonymous Data Mining: A Survey

Data mining technology has attracted significant interest as a means of identifying patterns and trends from large collections of data. It is however evident that the collection and analysis of data that include personal information may violate the privacy of the individuals to whom information refers. Privacy protection in data mining is then becoming a crucial issue that has captured the attention of many researchers.

In this chapter, we first describe the concept of k-anonymity and illustrate different approaches for its enforcement. We then discuss how the privacy requirements characterized by k-anonymity can be violated in data mining and introduce possible approaches to ensure the satisfaction of k-anonymity in data mining.

V. Ciriani, S. De Capitani di Vimercati, S. Foresti, P. Samarati

6. A Survey of Randomization Methods for Privacy-Preserving Data Mining

A well known method for privacy-preserving data mining is that of randomization. In randomization, we add noise to the data so that the behavior of the individual records is masked. However, the aggregate behavior of the data distribution can be reconstructed by subtracting out the noise from the data. The reconstructed distribution is often sufficient for a variety of data mining tasks such as classification. In this chapter, we will provide a survey of the randomization method for privacy-preserving data mining.

Charu C. Aggarwal III, Philip S. Yu

7. A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining

The major challenge of data perturbation is to achieve the desired balance between the level of privacy guarantee and the level of data utility. Data privacy and data utility are commonly considered as a pair of conflicting requirements in privacy-preserving data mining systems and applications. Multiplicative perturbation algorithms aim at improving data privacy while maintaining the desired level of data utility by selectively preserving the mining task and model specific information during the data perturbation process. By preserving the task and model specific information, a set of “transformation-invariant data mining models” can be applied to the perturbed data directly, achieving the required model accuracy. Often a multiplicative perturbation algorithm may find multiple data transformations that preserve the required data utility. Thus the next major challenge is to find a good transformation that provides a satisfactory level of privacy guarantee. In this chapter, we review three representative multiplicative perturbation methods: rotation perturbation, projection perturbation, and geometric perturbation, and discuss the technical issues and research challenges. We first describe the mining task and model specific information for a class of data mining models, and the transformations that can (approximately) preserve the information. Then we discuss the design of appropriate privacy evaluation models for multiplicative perturbations, and give an overview of how we use the privacy evaluation model to measure the level of privacy guarantee in the context of different types of attacks.

Keke Chen, Ling Liu

8. A Survey of Quantification of Privacy Preserving Data Mining Algorithms

The aim of privacy preserving data mining (PPDM) algorithms is to extract relevant knowledge from large amounts of data while protecting at the same time sensitive information. An important aspect in the design of such algorithms is the identification of suitable evaluation criteria and the development of related benchmarks. Recent research in the area has devoted much effort to determine a trade-off between the right to privacy and the need of knowledge discovery. It is often the case that no privacy preserving algorithm exists that outperforms all the others on all possible criteria. Therefore, it is crucial to provide a comprehensive view on a set of metrics related to existing privacy preserving algorithms so that we can gain insights on how to design more effective measurement and PPDM algorithms. In this chapter, we review and summarize existing criteria and metrics in evaluating privacy preserving techniques.

Elisa Bertino, Dan Lin, Wei Jiang

9. A Survey of Utility-based Privacy-Preserving Data Transformation Methods

As a serious concern in data publishing and analysis, privacy preserving data processing has received a lot of attention. Privacy preservation often leads to information loss. Consequently, we want to minimize utility loss as long as the privacy is preserved. In this chapter, we survey the utility-based privacy preservation methods systematically. We first briefly discuss the privacy models and utility measures, and then review four recently proposed methods for utilitybased privacy preservation.

We first introduce the utility-based anonymization method for maximizing the quality of the anonymized data in query answering and discernability. Then we introduce the top-down specialization (TDS) method and the progressive disclosure algorithm (PDA) for privacy preservation in classification problems. Last, we introduce the anonymized marginal method, which publishes the anonymized projection of a table to increase the utility and satisfy the privacy requirement.

Ming Hua, Jian Pei

10. Mining Association Rules under Privacy Constraints

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may impel users to provide spurious information. In this chapter, we study whether users can be encouraged to provide correct information by ensuring that the mining process cannot, with any reasonable degree of certainty, violate their privacy. Our analysis is in the context of extracting association rules from large historical databases, a popular mining process that identifies interesting correlations between database attributes. We analyze the various schemes that have been proposed for this purpose with regard to a variety of parameters including the degree of trust, privacy metric, model accuracy and mining efficiency.

Jayant R. Haritsa

11. A Survey of Association Rule Hiding Methods for Privacy

Data and knowledge hiding are two research directions that investigate how the privacy of raw data, or information, can be maintained either before or after the course of mining the data. By focusing on the knowledge hiding thread, we present a taxonomy and a survey of recent approaches that have been applied to the association rule hiding problem. Association rule hiding refers to the process of modifying the original database in such a way that certain sensitive association rules disappear without seriously affecting the data and the non-sensitive rules. We also provide a thorough comparison of the presented approaches, and we touch upon hiding approaches used for other data mining tasks. A detailed presentation of metrics used to evaluate the performance of those approaches is also given. Finally, we conclude our study by enumerating interesting future directions in this research body.

Vassilios S. Verykios, Aris Gkoulalas-Divanis

12. A Survey of Statistical Approaches to Preserving Confidentiality of Contingency Table Entries

In the statistical literature, there has been considerable development of methods of data releases for multivariate categorical data sets, where the releases come in the form of marginal and conditional tables corresponding to subsets of the categorical variables. In this chapter we provide an overview of this methodology and we relate it to the literature on the release of association rules which can be viewed as conditional tables. We illustrate this with two examples. A related problem, ”association rule hiding” is often independently studied in the database community.

Stephen E. Fienberg, Aleksandra B. Slavkovic

13. A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data

Data mining can extract important knowledge from large data collections, but sometimes these collections are split among various parties. Data warehousing, bringing data from multiple sources under a single authority, increases risk of privacy violations. Furthermore, privacy concerns may prevent the parties from directly sharing even some meta-data.

Distributed data mining and processing provide a means to address this issue, particularly if queries are processed in a way that avoids the disclosure of any information beyond the final result. This chapter describes methods to mine horizontally partitioned data without violating privacy and discusses how to use the data mining results in a privacy-preserving way. The methods described here incorporate cryptographic techniques to minimize the information shared, while adding as little as possible overhead to the mining and processing task.

Murat Kantarcioglu

14. A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data

The goal of data mining is to extract or “mine” knowledge from large amounts of data. However, data is often collected by several different sites. Privacy, legal and commercial concerns restrict centralized access to this data, thus derailing data mining projects. Recently, there has been growing focus on finding solutions to this problem. Several algorithms have been proposed that do distributed knowledge discovery, while providing guarantees on the non-disclosure of data. Vertical partitioning of data is an important data distribution model often found in real life. Vertical partitioning or heterogeneous distribution implies that different features of the same set of data are collected by different sites. In this chapter we survey some of the methods developed in the literature to mine vertically partitioned data without violating privacy and discuss challenges and complexities specific to vertical partitioning.

Jaideep Vaidya

15. A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods

We focus primarily on the use of additive and matrix multiplicative data perturbation techniques in privacy preserving data mining (PPDM). We survey a recent body of research aimed at better understanding the vulnerabilities of these techniques. These researchers assumed the role of an attacker and developed methods for estimating the original data from the perturbed data and any available prior knowledge. Finally, we briefly discuss research aimed at attacking k-anonymization, another data perturbation technique in PPDM.

Kun Liu, Chris Giannella, Hillol Kargupta

16. Private Data Analysis via Output Perturbation

A Rigorous Approach to Constructing Sanitizers and Privacy Preserving Algorithms

We describe output perturbation techniques that allow for a provable, rigorous sense of individual privacy. Examples where the techniques are effective span frombasic statistical computations to sophisticated machine learning algorithms.

Kobbi Nissim

17. A Survey of Query Auditing Techniques for Data Privacy

This chapter is a survey of query auditing techniques for detecting and preventing disclosures in a database containing private data. Informally, auditing is the process of examining past actions to check whether they were in conformance with official policies. In the context of database systems with specific data disclosure policies, auditing is the process of examining queries that were answered in the past to determine whether answers to these queries could have been used by an individual to ascertain confidential information forbidden by the disclosure policies. Techniques used for detecting disclosures could potentially also be used or extended to prevent disclosures, and so in addition to the retroactive auditing mentioned above, researchers have also studied an online variant of the auditing problem wherein the task of an online auditor is to deny queries that could potentially cause a breach of privacy.

Shubha U. Nabar, Krishnaram Kenthapadi, Nina Mishra, Rajeev Motwani

18. Privacy and the Dimensionality Curse

Most privacy-transformation methods such as k-anonymity or randomization use some kind of transformation on the data for privacy-preservation purposes. In many cases, the data can be indirectly identified with the use of a combination of attributes. Such attributes may be available from public records and they may be used to link the sensitive records to the target of interest. Thus, the sensitive attributes of the record may be inferred as well with the use of publicly available attributes. In many cases, the target of interest may be known to the adversary, which results in a large number of combinations of attributes being known to the adversary. This is a reasonable assumption, since privacy attacks will often be mounted by an adversary with some knowledge of the target. As a result, the number of attributes for identification increases, and results in almost unique identification of the target. In this paper, we will examine a number of privacypreservation methods and show that in each case the privacy-preservation approach becomes either ineffective or infeasible.

Charu C. Aggarwal

19. Personalized Privacy Preservation

Unlike conventional methods that exert the same amount of privacy control over all the tuples in the microdata, personalized privacy preservation applies various degrees of protection to different tuples, subject to the preferences of the data owners. This chapter explains the formulation of personal preferences, and the computation of anonymized tables that fulfill the privacy requirement of everybody. Several theoretical results regarding privacy guarantees will also be discussed. Finally, we point out the open research issues that may be explored in the future.

Yufei Tao, Xiaokui Xiao

20. Privacy-Preserving Data Stream Classification

In a wide range of applications, multiple data streams need to be examined together in order to discover trends or patterns existing across several data streams. One common practice is to redirect all data streams into a central place for joint analysis. This “centralized” practice is challenged by the fact that data streams often are private in that they come from different owners. In this paper, we focus on the problem of building a classifier in this context and assume that classification evolves as the current window of streams slides forward. This problem faces two major challenges. First, the many-to-many join relationship of streams will blow up the already fast arrival rate of data streams. Second, the privacy requirement implies that data exchange among owners should be minimal. These considerations rule out all classification methods that require producing the join in the current window.We show that Naive Bayesian Classification (NBC) presents a unique opportunity to address this problem. Our main contribution is to adopt NBC to solve the classification problem for private data streams.

Yabo Xu, Ke Wang, Ada Wai-Chee Fu, Rong She, Jian Pei

Backmatter

Titel: Privacy-Preserving Data Mining
herausgegeben von: Charu C. Aggarwal
Philip S. Yu
Verlag: Springer US
Electronic ISBN: 978-0-387-70992-5
Print ISBN: 978-0-387-70991-8
DOI: https://doi.org/10.1007/978-0-387-70992-5

Springer Professional