Genetic algorithm-based clustering approach for k-anonymization

doi:10.1016/j.eswa.2009.02.009

Expert Systems with Applications

Volume 36, Issue 6, August 2009, Pages 9784-9792

https://doi.org/10.1016/j.eswa.2009.02.009 Get rights and content

Abstract

k-Anonymity has been widely adopted as a model for protecting public released microdata from individual identification. This model requires that each record must be identical to at least $k - 1$ other records in the anonymized dataset with respect to a set of privacy-related attributes. Although anonymizing the original dataset to satisfy the requirement of k-anonymity is easy, the anonymized dataset must preserve as much information as possible of the original dataset. Clustering techniques have recently been successfully adapted for k-anonymization. This work proposes a novel genetic algorithm-based clustering approach for k-anonymization. The proposed approach adopts various heuristics to select genes for crossover operations. Experimental results show that this approach can further reduce the information loss caused by traditional clustering-based k-anonymization techniques.

Introduction

Privacy protection is an important societal concern. Protection of public released microdata from individual identification becomes increasingly important as the public becomes increasingly concerned with privacy. Most privacy protection techniques work by randomizing (Agrawal and Srikant, 2000, Lin and Cheng, 2009) or generalizing Samarati, 2001 the original data, but can also degrade the quality of the data. Therefore, a dilemma exists between data quality and data privacy.

As a privacy-preserving approach, has been extensively studied in recent years, the k-anonymity model (Samarati, 2001, Sweeney, 2002) works by ensuring that each record of a table is identical to at least $k - 1$ other records with respect to a set of privacy-related attributes, called quasi-identifiers, that could be used to identify individuals by linking these attributes to external datasets. For instance, consider the hospital data in Table 1, where the attributes ZipCode, Gender and Age are considered as quasi-identifiers. Table 2 shows a 3-anonymization version of Table 1, where anonymization is achieved via generalization at the attribute level (Ciriani, di Vimercati, Foresti, & Samarati, 2007), i.e., if two records contain the same value at a quasi-identifier before anonymization, then they are generalized to the same value at the quasi-identifier by the anonymization process. Table 3 shows another 3-anonymization version of Table 1, where anonymization is achieved via generalization at the cell-level (Ciriani et al., 2007), i.e., two cells with same value could be generalized to different values (e.g., value “75275” in the ZipCode column and value “Male” in the Gender column). Because anonymization via generalization at the cell-level generates data containing different generalization levels within a column (e.g., values “7527^∗” and “75275” in the ZipCode column and values “Male” and “Person” in the Gender column of Table 3), utilizing such data becomes more complex than utilizing the data generated via generalization at the attribute level. However, in terms of data quality, generalization at the cell-level causes less information loss than generalization at the attribute level, and therefore is adopted in this study.

Anonymization via generalization at the cell-level can proceed in two steps, partitioning and anonymizing. In the partitioning step, the dataset to be anonymized is partitioned into several groups such that each group contains at least k records. Then, in the anonymizing step, records in the same group are generalized such that their values at each quasi-identifier are identical. Minimizing the information loss incurred by the anonymizing step requires that the partitioning step places similar records (with respect to the quasi-identifiers) in the same group. In data mining, clustering is an effective means of partitioning records into clusters such that records within a cluster resemble each other, while records in different clusters are clearly distinct from each other. Hence, clustering techniques have been successfully adapted for k-anonymization (Byun et al., 2007, Chiu and Tsai, 2007, Lin and Wei, 2008, Lin et al., 2008, Loukides and Shao, 2007).

Genetic algorithms (GA) are well known for their global search capabilities. The constraint of k-anonymity means that traditional GA-based clustering techniques cannot be easily adapted for the k-anonymization problem without causing much overhead. Section 2.3 provides details. This work proposes a GA-based clustering approach for k-anonymization. This approach starts with a dataset that has been partitioned using a traditional clustering-based k-anonymization technique, called the Hybrid method (Lin et al., 2008). The output of the Hybrid method is encoded as a population of chromosomes. Various heuristics are then adopted to select genes to undergo the crossover operations to reduce the information loss of the dataset. This approach is, to our knowledge, the first GA-based clustering method proposed for k-anonymization at the cell-level. Experimental results demonstrate that the proposed approach further reduces the information loss caused by the Hybrid method.

The rest of this paper is organized as follows. Section 2 reviews basic concepts on k-anonymization with a focus on clustering-based methods for k-anonymization and GA-based clustering techniques. Section 3 describes the proposed GA-based approach for k-anonymization. Section 4 presents a performance analysis of the proposed approach. Conclusions are finally drawn in Section 5, along with recommendations for further research.

Section snippets

Basic concepts

The k-anonymity model has attracted considerable attention in recent years. Many approaches have been proposed for k-anonymization and its variations. Please refer to Ciriani et al. (2007) for a survey of various k-anonymization approaches. This section first describes the concept of information loss, which measures the effectiveness of various k-anonymization approaches. Several recently proposed clustering methods for k-anonymization are then reviewed. Finally, GA-based clustering methods are

A GA-based clustering approach for k-anonymization

To scale up the assignment-oriented methods for large datasets, while also avoid the feasibility check of chromosomes, this section presents an assignment-oriented method for the k-anonymization problem. The proposed approach differs from previous assignment-oriented methods mainly in terms of the encoding scheme of chromosomes, and in the gene selection heuristic for crossover operations.

Experimental results

This section evaluates various parameter settings of the proposed GA-based approach, and then compares its performance against the Hybrid method (Lin et al., 2008). Both methods were implemented in Java, and run on a desktop PC with Intel Core2Duo 2.2 GHz CPU and 2GB of RAM under the MS Window Vista operating system.

This experiment used the Adult dataset from the UC Irvine Machine Learning Repository (Asuncion & Newman, 2007), which is considered to be a standard benchmark for k-anonymization.

Conclusions

Most k-anonymization approaches rely on heuristics to minimize the information loss incurred by the anonymization process. This work formalizes the k-anonymization problem as a constraint optimization problem, and proposes a GA-based clustering approach for this problem. The proposed approach further reduces the total information loss caused by the Hybrid method.

Many variations of the k-anonymity model have recently been proposed to further protect the data from identification, e.g., l

References (30)

M. Cowgill et al.
A genetic algorithm approach to cluster analysis
Computers and Mathematics with Applications
(1999)
J.-L. Lin et al.
Privacy preserving itemset mining through noisy items
Expert Systems with Applications
(2009)
U. Maulik et al.
Genetic algorithm-based clustering technique
Pattern Recognition
(2000)
P. Scheunders
A genetic c-means clustering algorithm applied to color image quantization
Pattern Recognition
(1997)
Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international...
Asuncion, A., & Newman, D. J. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of...
G.P. Babu et al.
A near-optimal initial seed value selection in k-means algorithm using a genetic algorithm
Pattern Recognition Letters
(1993)
Byun, J.-W., Kamra, A., Bertino, E., & Li, N. (2007). Efficient k-anonymization using clustering techniques. In...
Chiu, C.-C., & Tsai, C.-Y. (2007). A k-anonymity clustering method for effective data privacy preservation. In Third...
V. Ciriani et al.
K-Anonymity

B.C.M. Fung et al.

Top-down specialization for information and privacy preservation

D.E. Goldberg

Genetic algorithms in search, optimization and machine learning

(1989)

L. Hall et al.

Clustering with a genetically optimized approach

IEEE Transactions on Evolutionary Computation

(1999)

LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Proceedings of the...

Li, J., Gao, X., & Jiao, L.-C. (2003). A GA-based clustering algorithm for large data sets with mixed and categorical...

Cited by (36)

Multi-level personalized k-anonymity privacy-preserving model based on sequential three-way decisions
2024, Expert Systems with Applications
K-anonymity is a widely used privacy-preserving technique which defends against linking attacks by suppression and generalization. The existing k-anonymity algorithms prevent attackers from illegally obtaining private information by constraining at least k records in an equivalence group. However, this unified anonymity method ignores individual differences and leads to a large amount of information loss. To this end, we introduce sequential three-way decisions into k-anonymity, using a dynamic k-value sequence instead of the fixed k-value to achieve personalized k-anonymity. Specifically, we first construct a hierarchical decision table for k-anonymity by attribute generalization trees and sensitive decision values provided with users. Then, we propose a multi-level personalized k-anonymity privacy-preserving model based on sequential three-way decisions, where we anonymize the partitioning granular data with a dynamic k-value sequence, respectively. Furthermore, we present three practical algorithms to implement the proposed model and discuss the differences between them. Finally, the experimental results demonstrate that the proposed model not only provides a more flexible anonymization method to achieve personalized anonymity, but greatly reduces the information loss. This study provides a complete framework for multi-level privacy protection and enriches the application of sequential three-way decisions.
KAB: A new k-anonymity approach based on black hole algorithm
2022, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
In each case, the new crossover operator converges faster to more effective solutions. Lin and Wei (Lin and Wei, 2009) proposed a Genetic Algorithm (GA)-based clustering approach for achieving k-anonymity. In this approach, the initial population of GA is created based on Hybrid Method proposed in (Lin and Wei, 2008).
K-anonymity is the most widely used approach to privacy preserving microdata which is mainly based on generalization. Although generalization-based k-anonymity approaches can achieve the privacy protection objective, they suffer from information loss. Clustering-based approaches have been successfully adapted for k-anonymization as they enhance the data quality, however, the computational complexity of finding an optimal solution has shown as NP-hard. Nature-inspired optimization algorithms are effective in finding solutions to complex problems. We propose, in this paper, a novel algorithm based on a simple nature-inspired metaheuristic called Black Hole Algorithm (BHA), to address such limitations. Experiments on real data set show that data utility has been improved by our approach compared to k-anonymity, BHA-based k-anonymity and clustering-based k-anonymity approaches.
Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks
2020, Expert Systems with Applications
Citation Excerpt :
Generally, in the anonymizing techniques, some edges of the graph or attributes of the data matrix are changed to satisfy protecting against the determined attacks. The anonymized graph/matrix should retain as much information as possible of the original graph/matrix (Lin & Wei, 2009). Various clustering-based techniques based on K-anonymity (KA) (Sweeney, 2002) and its extensions, L-diversity (LD) (Machanavajjhala, Gehrke, Kifer & Venkitasubramaniam, 2006) and T-closeness (TC) (Li, Li & Venkatasubramanian, 2007) have been proposed for graph/matrix anonymizing.
In recent years, an explosive growth of social networks has been made publicly available for understanding the behavior of users and data mining purposes. The main challenge in sharing the social network databases is protecting public released data from individual identification. The most common privacy preserving technique is anonymizing data by removing or changing some information, while the anonymized data should retain as much information as possible of the original data. K-anonymity and its extensions (e.g., L-diversity and T-closeness) have widely been used for data anonymization. The main drawback of the existing anonymity techniques is the lack of protection against attribute/link disclosure and similarity attacks. Moreover, they suffer from high amount of information loss in the released database. In order to overcome these drawbacks, this paper proposes a combined anonymizing algorithm based on K-member Fuzzy Clustering and Firefly Algorithm (KFCFA) to protect the anonymized database against identity disclosure, attribute disclosure, link disclosure, and similarity attacks, and significantly minimize the information loss. In KFCFA, at first, a modified K-member version of fuzzy c-means is utilized to create balanced clusters with at least K members in each cluster. Then, firefly algorithm is performed for further optimizing the primary clusters and anonymizing the network graph and data. To achieve this purpose, a constrained multi-objective function is introduced to simultaneously minimize the clustering error rate and the generated information loss, while satisfying the defined anonymity constraints. The proposed methodology can be utilized for both network graph structures and micro data. Simulation results over four social network databases from Facebook, Google+, Twitter and YouTube demonstrate the efficiency of the proposed KFCFA algorithm to minimize the information loss of the published data and graph, while satisfying K-anonymity, L-diversity and T-closeness conditions.
A new geometric shape-based genetic clustering algorithm for the multi-depot vehicle routing problem
2011, Expert Systems with Applications
Citation Excerpt :
The other is to adapt genetic algorithm directly to the clustering problem. This way, genetic algorithm is the core of the clustering method, and the solution is encoded inside the chromosomes of genetic algorithm (Lin & Wei, 2009). Generally in clustering problems, the number of clusters in a data set is not known, as in most real-life situations.
In this paper, a new type of geometric shape based genetic clustering algorithm is proposed. A genetic algorithm based on this clustering technique is developed for the solution process of the multi-depot vehicle routing problem. A set of problems obtained from the literature is used to compare the efficiency of the proposed algorithm with the nearest neighbor algorithm so as to solve the multi-depot vehicle routing problem. The experimental results show that the proposed algorithm provides a better clustering performance in terms of the distance of each customer to each depot in clusters. This result in a considerably less computation time required, when compared with the nearest neighbor algorithm.
Navigating Ethical Challenges in Psychological Research Involving Digital Remote Technologies and People Who Use Alcohol or Drugs
2024, American Psychologist
(k, m, t)-anonymity: Enhanced privacy for transactional data
2022, Concurrency and Computation: Practice and Experience

View all citing articles on Scopus

View full text