Genetic algorithm-based clustering approach for k-anonymization
Introduction
Privacy protection is an important societal concern. Protection of public released microdata from individual identification becomes increasingly important as the public becomes increasingly concerned with privacy. Most privacy protection techniques work by randomizing (Agrawal and Srikant, 2000, Lin and Cheng, 2009) or generalizing Samarati, 2001 the original data, but can also degrade the quality of the data. Therefore, a dilemma exists between data quality and data privacy.
As a privacy-preserving approach, has been extensively studied in recent years, the k-anonymity model (Samarati, 2001, Sweeney, 2002) works by ensuring that each record of a table is identical to at least other records with respect to a set of privacy-related attributes, called quasi-identifiers, that could be used to identify individuals by linking these attributes to external datasets. For instance, consider the hospital data in Table 1, where the attributes ZipCode, Gender and Age are considered as quasi-identifiers. Table 2 shows a 3-anonymization version of Table 1, where anonymization is achieved via generalization at the attribute level (Ciriani, di Vimercati, Foresti, & Samarati, 2007), i.e., if two records contain the same value at a quasi-identifier before anonymization, then they are generalized to the same value at the quasi-identifier by the anonymization process. Table 3 shows another 3-anonymization version of Table 1, where anonymization is achieved via generalization at the cell-level (Ciriani et al., 2007), i.e., two cells with same value could be generalized to different values (e.g., value “75275” in the ZipCode column and value “Male” in the Gender column). Because anonymization via generalization at the cell-level generates data containing different generalization levels within a column (e.g., values “7527∗” and “75275” in the ZipCode column and values “Male” and “Person” in the Gender column of Table 3), utilizing such data becomes more complex than utilizing the data generated via generalization at the attribute level. However, in terms of data quality, generalization at the cell-level causes less information loss than generalization at the attribute level, and therefore is adopted in this study.
Anonymization via generalization at the cell-level can proceed in two steps, partitioning and anonymizing. In the partitioning step, the dataset to be anonymized is partitioned into several groups such that each group contains at least k records. Then, in the anonymizing step, records in the same group are generalized such that their values at each quasi-identifier are identical. Minimizing the information loss incurred by the anonymizing step requires that the partitioning step places similar records (with respect to the quasi-identifiers) in the same group. In data mining, clustering is an effective means of partitioning records into clusters such that records within a cluster resemble each other, while records in different clusters are clearly distinct from each other. Hence, clustering techniques have been successfully adapted for k-anonymization (Byun et al., 2007, Chiu and Tsai, 2007, Lin and Wei, 2008, Lin et al., 2008, Loukides and Shao, 2007).
Genetic algorithms (GA) are well known for their global search capabilities. The constraint of k-anonymity means that traditional GA-based clustering techniques cannot be easily adapted for the k-anonymization problem without causing much overhead. Section 2.3 provides details. This work proposes a GA-based clustering approach for k-anonymization. This approach starts with a dataset that has been partitioned using a traditional clustering-based k-anonymization technique, called the Hybrid method (Lin et al., 2008). The output of the Hybrid method is encoded as a population of chromosomes. Various heuristics are then adopted to select genes to undergo the crossover operations to reduce the information loss of the dataset. This approach is, to our knowledge, the first GA-based clustering method proposed for k-anonymization at the cell-level. Experimental results demonstrate that the proposed approach further reduces the information loss caused by the Hybrid method.
The rest of this paper is organized as follows. Section 2 reviews basic concepts on k-anonymization with a focus on clustering-based methods for k-anonymization and GA-based clustering techniques. Section 3 describes the proposed GA-based approach for k-anonymization. Section 4 presents a performance analysis of the proposed approach. Conclusions are finally drawn in Section 5, along with recommendations for further research.
Section snippets
Basic concepts
The k-anonymity model has attracted considerable attention in recent years. Many approaches have been proposed for k-anonymization and its variations. Please refer to Ciriani et al. (2007) for a survey of various k-anonymization approaches. This section first describes the concept of information loss, which measures the effectiveness of various k-anonymization approaches. Several recently proposed clustering methods for k-anonymization are then reviewed. Finally, GA-based clustering methods are
A GA-based clustering approach for k-anonymization
To scale up the assignment-oriented methods for large datasets, while also avoid the feasibility check of chromosomes, this section presents an assignment-oriented method for the k-anonymization problem. The proposed approach differs from previous assignment-oriented methods mainly in terms of the encoding scheme of chromosomes, and in the gene selection heuristic for crossover operations.
Experimental results
This section evaluates various parameter settings of the proposed GA-based approach, and then compares its performance against the Hybrid method (Lin et al., 2008). Both methods were implemented in Java, and run on a desktop PC with Intel Core2Duo 2.2 GHz CPU and 2GB of RAM under the MS Window Vista operating system.
This experiment used the Adult dataset from the UC Irvine Machine Learning Repository (Asuncion & Newman, 2007), which is considered to be a standard benchmark for k-anonymization.
Conclusions
Most k-anonymization approaches rely on heuristics to minimize the information loss incurred by the anonymization process. This work formalizes the k-anonymization problem as a constraint optimization problem, and proposes a GA-based clustering approach for this problem. The proposed approach further reduces the total information loss caused by the Hybrid method.
Many variations of the k-anonymity model have recently been proposed to further protect the data from identification, e.g., l
References (30)
- et al.
A genetic algorithm approach to cluster analysis
Computers and Mathematics with Applications
(1999) - et al.
Privacy preserving itemset mining through noisy items
Expert Systems with Applications
(2009) - et al.
Genetic algorithm-based clustering technique
Pattern Recognition
(2000) A genetic c-means clustering algorithm applied to color image quantization
Pattern Recognition
(1997)- Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international...
- Asuncion, A., & Newman, D. J. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of...
- et al.
A near-optimal initial seed value selection in k-means algorithm using a genetic algorithm
Pattern Recognition Letters
(1993) - Byun, J.-W., Kamra, A., Bertino, E., & Li, N. (2007). Efficient k-anonymization using clustering techniques. In...
- Chiu, C.-C., & Tsai, C.-Y. (2007). A k-anonymity clustering method for effective data privacy preservation. In Third...
- et al.
K-Anonymity
Top-down specialization for information and privacy preservation
Genetic algorithms in search, optimization and machine learning
Clustering with a genetically optimized approach
IEEE Transactions on Evolutionary Computation
Cited by (36)
Multi-level personalized k-anonymity privacy-preserving model based on sequential three-way decisions
2024, Expert Systems with ApplicationsKAB: A new k-anonymity approach based on black hole algorithm
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :In each case, the new crossover operator converges faster to more effective solutions. Lin and Wei (Lin and Wei, 2009) proposed a Genetic Algorithm (GA)-based clustering approach for achieving k-anonymity. In this approach, the initial population of GA is created based on Hybrid Method proposed in (Lin and Wei, 2008).
Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks
2020, Expert Systems with ApplicationsCitation Excerpt :Generally, in the anonymizing techniques, some edges of the graph or attributes of the data matrix are changed to satisfy protecting against the determined attacks. The anonymized graph/matrix should retain as much information as possible of the original graph/matrix (Lin & Wei, 2009). Various clustering-based techniques based on K-anonymity (KA) (Sweeney, 2002) and its extensions, L-diversity (LD) (Machanavajjhala, Gehrke, Kifer & Venkitasubramaniam, 2006) and T-closeness (TC) (Li, Li & Venkatasubramanian, 2007) have been proposed for graph/matrix anonymizing.
A new geometric shape-based genetic clustering algorithm for the multi-depot vehicle routing problem
2011, Expert Systems with ApplicationsCitation Excerpt :The other is to adapt genetic algorithm directly to the clustering problem. This way, genetic algorithm is the core of the clustering method, and the solution is encoded inside the chromosomes of genetic algorithm (Lin & Wei, 2009). Generally in clustering problems, the number of clusters in a data set is not known, as in most real-life situations.
(k, m, t)-anonymity: Enhanced privacy for transactional data
2022, Concurrency and Computation: Practice and Experience