Genetic algorithm-based clustering approach for k-anonymization

https://doi.org/10.1016/j.eswa.2009.02.009Get rights and content

Abstract

k-Anonymity has been widely adopted as a model for protecting public released microdata from individual identification. This model requires that each record must be identical to at least k-1 other records in the anonymized dataset with respect to a set of privacy-related attributes. Although anonymizing the original dataset to satisfy the requirement of k-anonymity is easy, the anonymized dataset must preserve as much information as possible of the original dataset. Clustering techniques have recently been successfully adapted for k-anonymization. This work proposes a novel genetic algorithm-based clustering approach for k-anonymization. The proposed approach adopts various heuristics to select genes for crossover operations. Experimental results show that this approach can further reduce the information loss caused by traditional clustering-based k-anonymization techniques.

Introduction

Privacy protection is an important societal concern. Protection of public released microdata from individual identification becomes increasingly important as the public becomes increasingly concerned with privacy. Most privacy protection techniques work by randomizing (Agrawal and Srikant, 2000, Lin and Cheng, 2009) or generalizing Samarati, 2001 the original data, but can also degrade the quality of the data. Therefore, a dilemma exists between data quality and data privacy.

As a privacy-preserving approach, has been extensively studied in recent years, the k-anonymity model (Samarati, 2001, Sweeney, 2002) works by ensuring that each record of a table is identical to at least k-1 other records with respect to a set of privacy-related attributes, called quasi-identifiers, that could be used to identify individuals by linking these attributes to external datasets. For instance, consider the hospital data in Table 1, where the attributes ZipCode, Gender and Age are considered as quasi-identifiers. Table 2 shows a 3-anonymization version of Table 1, where anonymization is achieved via generalization at the attribute level (Ciriani, di Vimercati, Foresti, & Samarati, 2007), i.e., if two records contain the same value at a quasi-identifier before anonymization, then they are generalized to the same value at the quasi-identifier by the anonymization process. Table 3 shows another 3-anonymization version of Table 1, where anonymization is achieved via generalization at the cell-level (Ciriani et al., 2007), i.e., two cells with same value could be generalized to different values (e.g., value “75275” in the ZipCode column and value “Male” in the Gender column). Because anonymization via generalization at the cell-level generates data containing different generalization levels within a column (e.g., values “7527” and “75275” in the ZipCode column and values “Male” and “Person” in the Gender column of Table 3), utilizing such data becomes more complex than utilizing the data generated via generalization at the attribute level. However, in terms of data quality, generalization at the cell-level causes less information loss than generalization at the attribute level, and therefore is adopted in this study.

Anonymization via generalization at the cell-level can proceed in two steps, partitioning and anonymizing. In the partitioning step, the dataset to be anonymized is partitioned into several groups such that each group contains at least k records. Then, in the anonymizing step, records in the same group are generalized such that their values at each quasi-identifier are identical. Minimizing the information loss incurred by the anonymizing step requires that the partitioning step places similar records (with respect to the quasi-identifiers) in the same group. In data mining, clustering is an effective means of partitioning records into clusters such that records within a cluster resemble each other, while records in different clusters are clearly distinct from each other. Hence, clustering techniques have been successfully adapted for k-anonymization (Byun et al., 2007, Chiu and Tsai, 2007, Lin and Wei, 2008, Lin et al., 2008, Loukides and Shao, 2007).

Genetic algorithms (GA) are well known for their global search capabilities. The constraint of k-anonymity means that traditional GA-based clustering techniques cannot be easily adapted for the k-anonymization problem without causing much overhead. Section 2.3 provides details. This work proposes a GA-based clustering approach for k-anonymization. This approach starts with a dataset that has been partitioned using a traditional clustering-based k-anonymization technique, called the Hybrid method (Lin et al., 2008). The output of the Hybrid method is encoded as a population of chromosomes. Various heuristics are then adopted to select genes to undergo the crossover operations to reduce the information loss of the dataset. This approach is, to our knowledge, the first GA-based clustering method proposed for k-anonymization at the cell-level. Experimental results demonstrate that the proposed approach further reduces the information loss caused by the Hybrid method.

The rest of this paper is organized as follows. Section 2 reviews basic concepts on k-anonymization with a focus on clustering-based methods for k-anonymization and GA-based clustering techniques. Section 3 describes the proposed GA-based approach for k-anonymization. Section 4 presents a performance analysis of the proposed approach. Conclusions are finally drawn in Section 5, along with recommendations for further research.

Section snippets

Basic concepts

The k-anonymity model has attracted considerable attention in recent years. Many approaches have been proposed for k-anonymization and its variations. Please refer to Ciriani et al. (2007) for a survey of various k-anonymization approaches. This section first describes the concept of information loss, which measures the effectiveness of various k-anonymization approaches. Several recently proposed clustering methods for k-anonymization are then reviewed. Finally, GA-based clustering methods are

A GA-based clustering approach for k-anonymization

To scale up the assignment-oriented methods for large datasets, while also avoid the feasibility check of chromosomes, this section presents an assignment-oriented method for the k-anonymization problem. The proposed approach differs from previous assignment-oriented methods mainly in terms of the encoding scheme of chromosomes, and in the gene selection heuristic for crossover operations.

Experimental results

This section evaluates various parameter settings of the proposed GA-based approach, and then compares its performance against the Hybrid method (Lin et al., 2008). Both methods were implemented in Java, and run on a desktop PC with Intel Core2Duo 2.2 GHz CPU and 2GB of RAM under the MS Window Vista operating system.

This experiment used the Adult dataset from the UC Irvine Machine Learning Repository (Asuncion & Newman, 2007), which is considered to be a standard benchmark for k-anonymization.

Conclusions

Most k-anonymization approaches rely on heuristics to minimize the information loss incurred by the anonymization process. This work formalizes the k-anonymization problem as a constraint optimization problem, and proposes a GA-based clustering approach for this problem. The proposed approach further reduces the total information loss caused by the Hybrid method.

Many variations of the k-anonymity model have recently been proposed to further protect the data from identification, e.g., l

References (30)

  • B.C.M. Fung et al.

    Top-down specialization for information and privacy preservation

  • D.E. Goldberg

    Genetic algorithms in search, optimization and machine learning

    (1989)
  • L. Hall et al.

    Clustering with a genetically optimized approach

    IEEE Transactions on Evolutionary Computation

    (1999)
  • LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Proceedings of the...
  • Li, J., Gao, X., & Jiao, L.-C. (2003). A GA-based clustering algorithm for large data sets with mixed and categorical...
  • Cited by (36)

    • KAB: A new k-anonymity approach based on black hole algorithm

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      In each case, the new crossover operator converges faster to more effective solutions. Lin and Wei (Lin and Wei, 2009) proposed a Genetic Algorithm (GA)-based clustering approach for achieving k-anonymity. In this approach, the initial population of GA is created based on Hybrid Method proposed in (Lin and Wei, 2008).

    • Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks

      2020, Expert Systems with Applications
      Citation Excerpt :

      Generally, in the anonymizing techniques, some edges of the graph or attributes of the data matrix are changed to satisfy protecting against the determined attacks. The anonymized graph/matrix should retain as much information as possible of the original graph/matrix (Lin & Wei, 2009). Various clustering-based techniques based on K-anonymity (KA) (Sweeney, 2002) and its extensions, L-diversity (LD) (Machanavajjhala, Gehrke, Kifer & Venkitasubramaniam, 2006) and T-closeness (TC) (Li, Li & Venkatasubramanian, 2007) have been proposed for graph/matrix anonymizing.

    • A new geometric shape-based genetic clustering algorithm for the multi-depot vehicle routing problem

      2011, Expert Systems with Applications
      Citation Excerpt :

      The other is to adapt genetic algorithm directly to the clustering problem. This way, genetic algorithm is the core of the clustering method, and the solution is encoded inside the chromosomes of genetic algorithm (Lin & Wei, 2009). Generally in clustering problems, the number of clusters in a data set is not known, as in most real-life situations.

    • (k, m, t)-anonymity: Enhanced privacy for transactional data

      2022, Concurrency and Computation: Practice and Experience
    View all citing articles on Scopus
    View full text