Skip to main content

2008 | Buch

Privacy, Security, and Trust in KDD

First ACM SIGKDD International Workshop, PinKDD 2007, San Jose, CA, USA, August 12, 2007, Revised Selected Papers

herausgegeben von: Francesco Bonchi, Elena Ferrari, Bradley Malin, Yücel Saygin

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Paper

An Ad Omnia Approach to Defining and Achieving Private Data Analysis
Abstract
We briefly survey several privacy compromises in published datasets, some historical and some on paper. An inspection of these suggests that the problem lies with the nature of the privacy-motivated promises in question. These are typically syntactic, rather than semantic. They are also ad hoc , with insufficient argument that fulfilling these syntactic and ad hoc conditions yields anything like what most people would regard as privacy. We examine two comprehensive, or ad omnia, guarantees for privacy in statistical databases discussed in the literature, note that one is unachievable, and describe implementations of the other.
Cynthia Dwork

Contributed Papers

Phoenix: Privacy Preserving Biclustering on Horizontally Partitioned Data
Abstract
Emerging business models require organizations to collaborate with each other. This collaboration is usually in the form of distributed clustering to find optimal customer targets for effective marketing. This process is hampered by two problems (1) Inability of traditional clustering algorithm in finding local (subspace) patterns in distributed data and (2) Privacy policies of individual organizations limiting the process of information sharing. In this paper, we propose an efficient privacy preserving biclustering algorithm on horizontally partitioned data, referred to as Phoenix, which solves both of these problems. It assumes a malicious adversary model which is more practical than commonly employed semi-honest adversary model. It is shown to outperform traditional K-means clustering algorithm in identifying local patterns. The distributed secure implementation of the algorithm is evaluated to be very efficient both in computation and communication requirements.
Waseem Ahmad, Ashfaq Khokhar
Allowing Privacy Protection Algorithms to Jump Out of Local Optimums: An Ordered Greed Framework
Abstract
As more and more person-specific data like health information becomes available, increasing attention is paid to confidentiality and privacy protection. One proposed model of privacy protection is k-Anonymity, where a dataset is k-anonymous if each record is identical to at least (k-1) others in the dataset. Our goal is to minimize information loss while transforming a collection of records to satisfy the k-Anonymity model. The downside to current greedy anonymization algorithms is their potential to get stuck at poor local optimums. In this paper, we propose an Ordered Greed Framework for k-Anonymity. Using our framework, designers can avoid the poor-local-optimum problem by adding stochastic elements to their greedy algorithms. Our preliminary experimental results indicate improvements in both runtime and solution quality. We also discover a surprising result concerning at least two widely-accepted greedy optimization algorithms in the literature. More specifically, for anonymization algorithms that process datasets in column-wise order, we show that a random column ordering can lead to significantly higher quality solutions than orderings determined by known greedy heuristics.
Rhonda Chaytor
Probabilistic Anonymity
Abstract
In this age of globalization, organizations need to publish their micro-data owing to legal directives or share it with business associates in order to remain competitive. This puts personal privacy at risk. To surmount this risk, attributes that clearly identify individuals, such as Name, Social Security Number, and Driving License Number, are generally removed or re- placed by random values. But this may not be enough because such de-identified databases can sometimes be joined with other public databases on attributes such as Gender, Date of Birth, and Zipcode to re-identify individuals who were supposed to remain anonymous. In the literature, such an identity-leaking attribute combination is called as a quasi-identifier. It is always critical to be able to recognize quasi-identifiers and to apply to them appropriate protective measures to mitigate the identity disclosure risk posed by join attacks.
In this paper, we start out by providing the first formal characterization and a practical technique to identify quasi-identifiers. We show an interesting connection between whether a set of columns forms a quasi-identifier and the number of distinct values assumed by the combination of the columns. We then use this characterization to come up with a probabilistic notion of anonymity. Again we show an interesting connection between the number of distinct values taken by a combination of columns and the anonymity it can offer. This allows us to find an ideal amount of generalization or suppression to apply to different columns in order to achieve probabilistic anonymity. We work through many examples and show that our analysis can be used to make a published database conform to privacy rules like HIPAA. In order to achieve probabilistic anonymity, we observe that one needs to solve multiple 1-dimensional k-anonymity problems. We propose many efficient and scalable algorithms for achieving 1-dimensional anonymity. Our algorithms are optimal in a sense that they minimally distort data and retain much of its utility.
Sachin Lodha, Dilys Thomas
Website Privacy Preservation for Query Log Publishing
Abstract
In this paper we study privacy preservation for the publication of search engine query logs. We introduce a new privacy concern, website privacy as a special case of business privacy. We define the possible adversaries who could be interested in disclosing website information and the vulnerabilities in the query log, which they could exploit. We elaborate on anonymization techniques to protect website information, discuss different types of attacks that an adversary could use and propose an anonymization strategy for one of these attacks. We then present a graph-based heuristic to validate the effectiveness of our anonymization method and perform an experimental evaluation of this approach. Our experimental results show that the query log can be appropriately anonymized against the specific attack, while retaining a significant volume of useful data.
Barbara Poblete, Myra Spiliopoulou, Ricardo Baeza-Yates
Privacy-Preserving Data Mining through Knowledge Model Sharing
Abstract
Privacy-preserving data mining (PPDM) is an important topic to both industry and academia. In general there are two approaches to tackling PPDM, one is statistics-based and the other is crypto-based. The statistics-based approach has the advantage of being efficient enough to deal with large volume of datasets. The basic idea underlying this approach is to let the data owners publish some sanitized versions of their data (e.g., via perturbation, generalization, or ℓ-diversification), which are then used for extracting useful knowledge models such as decision trees. In this paper, we present a new method for statistics-based PPDM. Our method differs from the existing ones because it lets the data owners share with each other the knowledge models extracted from their own private datasets, rather than to let the data owners publish any of their own private datasets (not even in any sanitized form). The knowledge models derived from the individual datasets are used to generate some pseudo-data that are then used for extracting the desired “global” knowledge models. While instrumental, there are some technical subtleties that need be carefully addressed. Specifically, we propose an algorithm for generating pseudo-data according to paths of a decision tree, a method for adapting anonymity measures of datasets to measure the privacy of decision trees, and an algorithm that prunes a decision tree to satisfy a given anonymity requirement. Through an empirical study, we show that predictive models learned using our method are significantly more accurate than those learned using the existing ℓ-diversity method in both centralized and distributed environments with different types of datasets, predictive models, and utility measures.
Patrick Sharkey, Hongwei Tian, Weining Zhang, Shouhuai Xu
Privacy-Preserving Sharing of Horizontally-Distributed Private Data for Constructing Accurate Classifiers
Abstract
Data mining tasks such as supervised classification can often benefit from a large training dataset. However, in many application domains, privacy concerns can hinder the construction of an accurate classifier by combining datasets from multiple sites. In this work, we propose a novel privacy-preserving distributed data sanitization algorithm that randomizes the private data at each site independently before the data is pooled to form a classifier at a centralized site. Distance-preserving perturbation approaches have been proposed by other researchers but we show that they can be susceptible to security risks. To enhance security, we require a unique non-distance-preserving approach. We use Kernel Density Estimation (KDE) Resampling, where samples are drawn independently from a distribution that is approximately equal to the original data’s distribution. KDE Resampling provides consistent density estimates with randomized samples that are asymptotically independent of the original samples. This ensures high accuracy, especially when a large number of samples is available, with low privacy loss. We evaluated our approach on five standard datasets in a distributed setting using three different classifiers. The classification errors only deteriorated by 3% (in the worst case) when we used the randomized data instead of the original private data. With a large number of samples, KDE Resampling effectively preserves privacy (due to the asymptotic independence property) and also maintains the necessary data integrity for constructing accurate classifiers (due to consistency).
Vincent Yan Fu Tan, See-Kiong Ng
Towards Privacy-Preserving Model Selection
Abstract
Model selection is an important problem in statistics, machine learning, and data mining. In this paper, we investigate the problem of enabling multiple parties to perform model selection on their distributed data in a privacy-preserving fashion without revealing their data to each other. We specifically study cross validation, a standard method of model selection, in the setting in which two parties hold a vertically partitioned database. For a specific kind of vertical partitioning, we show how the participants can carry out privacy-preserving cross validation in order to select among a number of candidate models without revealing their data to each other.
Zhiqiang Yang, Sheng Zhong, Rebecca N. Wright
Preserving the Privacy of Sensitive Relationships in Graph Data
Abstract
In this paper, we focus on the problem of preserving the privacy of sensitive relationships in graph data. We refer to the problem of inferring sensitive relationships from anonymized graph data as link re-identification. We propose five different privacy preservation strategies, which vary in terms of the amount of data removed (and hence their utility) and the amount of privacy preserved. We assume the adversary has an accurate predictive model for links, and we show experimentally the success of different link re-identification strategies under varying structural characteristics of the data.
Elena Zheleva, Lise Getoor
Backmatter
Metadaten
Titel
Privacy, Security, and Trust in KDD
herausgegeben von
Francesco Bonchi
Elena Ferrari
Bradley Malin
Yücel Saygin
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-78478-4
Print ISBN
978-3-540-78477-7
DOI
https://doi.org/10.1007/978-3-540-78478-4

Premium Partner