scroll identifier for mobile
main-content

## Über dieses Buch

Privacy and security risks arising from the application of different data mining techniques to large institutional data repositories have been solely investigated by a new research domain, the so-called privacy preserving data mining. Association rule hiding is a new technique in data mining, which studies the problem of hiding sensitive association rules from within the data.

Association Rule Hiding for Data Mining addresses the problem of "hiding" sensitive association rules, and introduces a number of heuristic solutions. Exact solutions of increased time complexity that have been proposed recently are presented, as well as a number of computationally efficient (parallel) approaches that alleviate time complexity problems, along with a thorough discussion regarding closely related problems (inverse frequent item set mining, data reconstruction approaches, etc.). Unsolved problems, future directions and specific examples are provided throughout this book to help the reader study, assimilate and appreciate the important aspects of this challenging problem.

Association Rule Hiding for Data Mining is designed for researchers, professors and advanced-level students in computer science studying privacy preserving data mining, association rule mining, and data mining. This book is also suitable for practitioners working in this industry.

## Inhaltsverzeichnis

### Chapter 1. Introduction

The significant advances in data collection and data storage technologies have provided the means for the inexpensive storage of enormous amounts of transactional data in data warehouses that reside in companies and public sector organizations. Apart from the benefit of using this data per se (e.g., for keeping up to date profiles of the customers and their purchases, maintaining a list of the available products, their quantities and price, etc), the mining of these datasets with the existing data mining tools can reveal invaluable knowledge that was unknown to the data holder beforehand. The extracted knowledge patterns can provide insight to the data holders as well as be invaluable in important tasks, such as decision making and strategic planning. Moreover, companies are often willing to collaborate with other entities who conduct similar business, towards the mutual benefit of their businesses. Significant knowledge patterns can be derived and shared among the partners through the collaborative mining of their datasets. Furthermore, public sector organizations and civilian federal agencies usually have to share a portion of their collected data or knowledge with other organizations having a similar purpose, or even make this data and/or knowledge public in order to comply with certain regulations. For example, in the United States, the National Institutes of Health (NIH) [2] endorses research that leads to significant findings which improve human health and provides a set of guidelines which sanction the timely dissemination of NIH-supported research findings for use by other researchers. At the same time, the NIH acknowledges the need to maintain privacy standards and, thus, requires NIH-sponsored investigators to disclose data collected or studied in a manner that is “free of identifiers that could lead to deductive disclosure of the identity of individual subjects” [2] and deposit it to the Database of Genotype and Phenotype (dbGaP) [45] for broad dissemination.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 2. Background

In this chapter we provide the background and terminology that are necessary for the understanding of association rule hiding. Specifically, in Section 2.1, we present the theory behind association rule mining and introduce the notion of the positive and the negative borders of the frequent itemsets. Following that, Section 2.2 explicitly states the goals of association rule hiding methodologies, discusses the different types of solutions that association rule hiding algorithms can produce, as well as it delivers the formal problem statement for association rule hiding and its popular variant, frequent itemset hiding.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 3. Classes of Association Rule Hiding Methodologies

In this chapter, we present a taxonomy of frequent itemset and association rule hiding algorithms after having reviewed a large collection of independent works in this research area. The chapter is organized as follows. Section 3.1 presents a set of four orthogonal dimensions that we used to classify the existing methodologies by taking into consideration a number of parameters related to their workings. Following that, Section 3.2 straightens out the three principal classes of association rule hiding methodologies that have been proposed over the years and discusses the main properties of each class of approaches.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 4. Other Knowledge Hiding Methodologies

Association rule hiding algorithms aim at protecting sensitive knowledge captured in the form of frequent itemsets or association rules. However, (sensitive) knowledge may appear in various forms directly related to the applied data mining algorithm that achieved to expose it. Consequently, a set of hiding approaches have been proposed over the years to allow for the safeguarding of sensitive knowledge exposed by data mining tasks such as clustering, classification and sequence mining. In this chapter, we briefly discuss some state-of-the-art approaches for the hiding of sensitive knowledge that is depicted in any of the aforementioned formats.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 5. Summary

Association rule hiding is a subarea of privacy preserving data mining that focuses on the privacy implications originating from the application of association rule mining to large public databases. In this first part of the book, we provided the basics for the understanding of the problem, which investigates how sensitive association rules can escape the scrutiny of malevolent data miners by modifying certain values in the original database, and presented some related problems on the knowledge hiding thread. Specifically, in the first two chapters we motivated the problem of association rule hiding, presented the necessary background for its understanding and derived the problem statement along two popular variants of the problem: frequent itemset hiding and association rule hiding. In Chapter 3, we provided a classification of the association rule hiding algorithms to facilitate the organization that we follow for the presentation of the methodologies in the rest of this book. Our proposed taxonomy partitions the association rule hiding methodologies along four orthogonal directions based on the employed hiding strategy, the data modification strategy, the number of rules that are concurrently hidden, and the nature of the algorithm. Elaborating on the last direction, we identified three classes of association rule hiding approaches, namely heuristic-based, border-based and exact approaches, and discussed the differences among them. Last, in Chapter 4 we examined the problem of knowledge hiding in the related research areas of clustering, classification and sequence mining. For each of these areas we briefly discussed some state-of-the-art approaches that have been proposed.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 6. Distortion Schemes

In this chapter, we review some popular support-based and confidence-based distortion algorithms that have been proposed in the association rule hiding literature. Distortion-based approaches operate by selecting specific items to include to (or exclude from) selected transactions of the original database in order to facilitate the hiding of the sensitive association rules. Two of the most commonly employed strategies for data distortion involve the swapping of values between transactions [10, 20], as well as the deletion of specific items from the database [54]. In the rest of this chapter, we present an overview of these approaches along with other methodologies that also fit in the same class.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 7. Blocking Schemes

An interesting line of research in the class of heuristic methodologies for association rule hiding regards approaches that block certain original values of a database by introducing unknowns (i.e., by replacing the real values of some items in selected transactions of the original database with question marks “?”). Unlike data distortion schemes, blocking methodologies do not add any false information to the original database and thus provide a much safer alternative for critical real life applications, where the distinction between “false” and “unknown” can be vital. Due to the introduction of unknowns, the support and the confidence of association rules that are mined from the sanitized database becomes fuzzified to an interval (rather than being an exact value) and can no longer be safely estimated. Blocking approaches, similarly to distortion approaches, can be partitioned into support-based and confidence-based, depending on whether they use the support or the confidence of the association rules to drive the hiding process. In this chapter, we review two methodologies that belong to this class of approaches.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 8. Summary

In the second part of the book, we presented some state-of-the-art algorithms that have been proposed for association rule hiding which belong to the class of heuristic-based approaches. The heuristic class of approaches collects computationally and memory efficient algorithms that operate in a series of steps by optimizing certain subgoals to drive the hiding process. We partitioned the current heuristic approaches into two main categories: distortion-based schemes, which operate by alternating certain items in selected transactions from 1’s to 0’s (and vice versa), and blocking-based schemes, which replace certain items in selected transactions with unknowns, in order to facilitate association rule hiding. Each category of approaches was further partitioned into support-based and confidence-based methodologies, depending on whether the algorithm uses the support or the confidence of the rule to drive the hiding process. A large amount of research has been conducted over the past years, leading to several interesting heuristic methodologies being proposed for association rule hiding. In Chapters 6 and 7 it was our intention to cover a selection of the existing methodologies by considering the most representative ones for this domain. As a final remark, we should note down that, as is also evident from the number of presented works in each category, research on heuristic methodologies for association rule hiding has mostly concentrated on the direction of distortionbased approaches rather than blocking techniques. However, blocking techniques are certainly more preferable than conventional distortion methodologies in several real life scenarios and for this reason we feel that such approaches are expected to attract more scientific interest in the years to come. Moreover, the combination of distortion and blocking techniques is also a prominent research direction, which may lead to solutions that further minimize the data loss that is introduced to the original database to account for the hiding of the sensitive knowledge.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 9. Border Revision for Knowledge Hiding

In this chapter, we highlight the process of border revision, which plays a significant role on both border-based and exact approaches for association rule hiding. Simply stated, the process of border revision captures those itemsets of the original database which need to remain frequent and those that need to become infrequent in the sanitized version of the database, in order to allow furnishing an optimal hiding solution. After presenting the theory behind border revision, we introduce a set of algorithms (adopted from [27]) that can be applied for the efficient computation of both the original and the revised borders in a transactional database, when given a minimum support threshold.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 10. BBA Algorithm

Sun & Yu [66, 67] in 2005 proposed the first frequent itemset hiding methodology that relies on the notion of the border [46] of the nonsensitive frequent itemsets to track the impact of altering transactions in the original database. By evaluating the impact of each candidate item modification to the itemsets of the revised positive border, the algorithm greedily selects to apply those modifications (item deletions) that cause the least impact to the border itemsets. As already covered in the previous chapter, the border itemsets implicitly dictate the status (i.e., frequent vs. infrequent) of every itemset in the database. Consequently, the quality of the borders directly affects the quality of the sanitized database that is produced by the hiding algorithm.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 11. Max-Min Algorithms

Moustakides & Verykios in [50, 51] propose two border based methodologies that rely on the max–min criterion for the hiding of the sensitive itemsets. Both methodologies use the revised positive border of the frequent itemsets to track the impact of each tentative item modification that helps towards the hiding of a sensitive itemset. Then, they select to apply those item modifications to the original database that effectively conceal all the sensitive knowledge, while minimally affecting the itemsets of the revised positive border and, consequently, the nonsensitive frequent itemsets. For each item of a sensitive itemset, the Max–Min algorithms identify the set of itemsets from the revised positive border which depend on it, and select among them the ones that are supported the least. Then, from among all minimum supported border itemsets (coming from the previously computed sets for the different items of the sensitive itemset), the itemset with the highest support is selected as it is the one with the maximum distance from the borderline that separates the frequent from the infrequent itemsets.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 12. Summary

In the third part of the book, we examined two popular border-based approaches that have been recently proposed for the hiding of sensitive frequent itemsets. The first approach, called BBA, was presented in Chapter 10 and was developed by Sun & Yu in [66, 67]. This approach uses the border of the nonsensitive frequent itemsets to track the impact of altering items of transactions in the original database. The second approach, called Max–Min, was proposed by Moustakides & Verykios in [50] and was covered in Chapter 11. It involves two heuristic algorithms that rely on the max–min criterion and use the revised positive border to hide the sensitive knowledge with minimal impact on the nonsensitive frequent itemsets. By restricting the impact of any tentative item modifications to the itemsets of the revised positive border, both BBA and Max–Min achieve to identify hiding solutions with fewer side-effects, when compared to pure heuristic approaches. This is due to the fact that tracking the impact of item modifications to the itemsets of the borders provides a measure of the side-effects that are induced by the hiding algorithm to the original database (in terms of lost itemsets) and thus serves as an optimization criterion to guide the hiding process towards high quality solutions. As a concluding remark, we should point out that although border-based approaches provide an improvement over pure heuristic approaches, they are still reliant on heuristics to decide upon the item modifications that they apply on the original database. As a result, in many cases these methodologies are unable to identify optimal hiding solutions, although such solutions may exist for the problem at hand.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 13. Menon's Algorithm

Menon, et al. in [47] are the first to propose a frequent itemset hiding methodology that consists of an exact part and a heuristic part to facilitate the hiding of the sensitive knowledge. The exact part uses the original database to formulate a Constraints Satisfaction Problem (CSP)1 in the universe of the sensitive itemsets, with the objective of identifying the minimum number of transactions that have to be sanitized for the proper hiding of the sensitive knowledge. The optimization process of solving the CSP is driven by a criterion function that is inspired from the measure of accuracy [42, 60], essentially penalizing the hiding algorithm based on the number of transactions (instead of items) that are sanitized from the original database to accommodate the hiding of the sensitive itemsets. The constraints imposed in the integer programming formulation aim to capture the transactions of the database that need to be sanitized for the hiding of each sensitive itemset. An integer programming solver is then used to find the solution of the CSP that optimizes the objective function and to derive the value of the objective.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 14. Inline Algorithm

In this chapter, we present in detail the first algorithm that has been proposed which does not rely on any heuristics to secure the sensitive knowledge derived by association rule mining. Similarly to Menon’s algorithm (covered in Chapter 13), the inline algorithm of [23] aims to hide the sensitive frequent itemsets of the original database $$\mathcal{D_O}$$ that can lead to the production of sensitive association rules. As a first step, we will introduce the notion of distance between two databases along with a measure for quantifying it. The quantification of distance provides us with important knowledge regarding the minimum data modification that is necessary to be induced to the original database to facilitate the hiding of the sensitive itemsets, while minimally affecting the nonsensitive ones. By trying to minimize the distance between the original database and its sanitized counterpart, the inline algorithm formulates the hiding process as an optimization problem in which distance is the optimization criterion. Specifically, the hiding process is modeled as a CSP which is subsequently solved by using a technique called Binary Integer Programming (BIP). The attained solution is such that the distance measure is minimized.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 15. Two-Phase Iterative Algorithm

In this chapter, we present a two–phase iterative algorithm (proposed in [27]) that extends the functionality of the inline algorithm of [23] (Chapter 14) to allow for the identification of exact hiding solutions for a wider spectrum of problem instances. A problem instance is defined as the set of (i) the original dataset $$\mathcal{D_O}$$, (ii) the minimum frequency threshold mfreq that is considered for its mining, and (iii) the set of sensitive itemsets S that have to be protected. Since the inline algorithm allows only supported items in $$\mathcal{D_O}$$ to become unsupported in $$\mathcal{D}$$, there exist problem instances that although they allow for an exact hiding solution, the inline approach is incapable of finding it. The truthfulness of this statement can be observed in the experiments provided in Section 15.4, as well as in the experimental evaluation of [26].
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 16. Hybrid Algorithm

Gkoulalas–Divanis & Verykios in [26] introduce the first exact methodology to strategically perform sensitive frequent itemset hiding based on a new notion of hybrid database generation. This approach broadens the regular process of data sanitization (as introduced in [10] and adopted by the overwhelming majority of researchers [10, 47, 55]) by applying an extension to the original database instead of either modifying existing transactions (directly or through the application of transformations), or rebuilding the dataset from scratch to accommodate sensitive knowledge hiding. The extended portion of the database contains a set of carefully crafted transactions that achieve to lower the importance of the sensitive patterns to a degree that they become uninteresting from the perspective of the data mining algorithm, while minimally affecting the importance of the nonsensitive ones. The hiding process is guided by the need to maximize the data utility of the sanitized database by introducing the least possible amount of side-effects. The released database, which consists of the initial part (original database) and the extended part (database extension), can guarantee the protection of the sensitive knowledge, when mined at the same or higher support as the one used in the original database.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 17. Parallelization Framework for Exact Hiding

In this chapter, we elaborate on a novel framework, introduced in [25], that is suitable for decomposition and parallelization of the exact hiding algorithms that were covered in Chapters 14, 15 and 16. The framework operates in three phases to decompose CSPs that are produced by the exact hiding algorithms, into a set of smaller CSPs that can be solved in parallel. In the first phase, the original CSP is structurally decomposed into a set of independent CSPs, each of which is assigned to a different processor. In the second phase, for each independent CSP a decision is made on whether it should be further decomposed into a set of dependent CSPs, through a function that questions the gain of any further decomposition. In the third and last step, the solutions of the various CSPs, produced as part of the decomposition process, are appropriately combined to provide the solution of the original CSP (i.e. the one prior to the decomposition). The generality of the framework allows it to efficiently handle any CSP that consists of linear constraints involving binary variables and whose objective is to maximize (or minimize) the summation of these variables. Together with existing approaches for the parallel mining of association rules [6, 34, 80], the framework of [25] can be applied to parallelize the most time consuming steps of the exact hiding algorithms.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 18. Quantifying the Privacy of Exact Hiding Algorithms

The exact hiding algorithms that were presented in Chapters 14, 15, and 16 are all based on the principle of minimum harm (distortion), which requires the minimum amount of modifications to be made to the original database to facilitate sensitive knowledge hiding. As an effect, in most cases (depending on the problem instance at hand), the sensitive itemsets are expected to be positioned just below the revised borderline in the computed sanitized database $$\mathcal{D}$$. However, the selection of the minimum support threshold based on which the hiding is performed can lead to radically different solutions, some of which are bound to be superior to others in terms of offered privacy. In this chapter, we present a layered approach that was originally proposed in [27], which enables the owner of the data to quantify the privacy that is offered on a given database by the employed exact hiding algorithm. Assuming that an adversary has no knowledge regarding which of the infrequent itemsets in $$\mathcal{D}$$ are the sensitive ones, this approach can compute the disclosure probability for the hidden sensitive itemsets, given the sanitized database $$\mathcal{D}$$ and a minimum support or frequency threshold msup/mfreq.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 19. Summary

In this part of the book, we studied the third class of association rule hiding approaches that involves methodologies which lead to superior hiding solutions when compared to heuristic and border-based algorithms. The methodologies of this class operate by (i) using the process of border revision (Chapter 9) to model the optimal hiding solution, (ii) constructing a constraints satisfaction problem using the itemsets of the computed revised borders, and (iii) solving the constraints satisfaction problem by applying integer programming. Transforming the hiding problem to an optimization problem guarantees the minimum amount of side-effects in the computed hiding solution and offers minimal distortion of the original database to accommodate sensitive knowledge hiding. Unlike the previous classes of approaches, exact hiding algorithms perform the sanitization process by considering all relevant data modifications to select the one that leads to an optimal hiding solution for the problem at hand.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 20. Conclusions

The serious privacy concerns that are raised due to the sharing of large transactional databases with untrusted third parties for association rule mining purposes, soon brought into existence the area of association rule hiding, a very popular subarea of privacy preserving data mining. Association rule hiding focuses on the privacy implications originating from the application of association rule mining to shared databases and aims to provide sophisticated techniques that effectively block access to sensitive association rules that would otherwise be revealed when mining the data. The research in this area has progressed mainly along three principal directions: (i) heuristic-based approaches, (ii) border-based approaches, and (iii) exact hiding approaches. Taking into consideration the rich work proposed so far, in this book we tried to collect the most significant research findings since 1999, when this area was brought into existence, and effectively cover each principal line of research. The detail of our presentation was intentionally finer on exact hiding approaches, since they comprise the most recent direction, offering increased quality guarantees in terms of distortion and side-effects introduced by the hiding process.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Chapter 21. Roadmap to Future Work

There is a plethora of open issues related to the problem of association rule hiding that are still under investigation. First of all, the emergence of sophisticated exact hiding approaches of high complexity, especially for very large databases, causes the consideration of efficient parallel approaches to be employed for improving the runtime of these algorithms. Parallel approaches allow for decomposition of the constraints satisfaction problem into numerous components that can be solved independently. The overall solution is then attained as a function of the objectives of the individual solutions. A framework for decomposition and parallelization of exact hiding approaches has been recently proposed in [25] and is covered in Chapter 17 of the book. Although this framework improves the runtime of solving the constraints satisfaction problem that is produced by the exact hiding algorithms, we are confident the further optimizations can be achieved by exploiting the inherent characteristics of the constraints that are involved in the CSP. Also, different optimization techniques can be adopted to allow searching the space of possible solutions for the CSP, in a more advanced way.
Aris Gkoulalas-Divanis, Vassilios S. Verykios

### Backmatter

Weitere Informationen

## BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

## Whitepaper

- ANZEIGE -

### Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.