Background
Related work and problem analysis
Related work
Traditional heuristic approaches
S. no | Approach | Technique used | Achieved | Issues |
---|---|---|---|---|
1 | [11] | Constructed lattice-like graph of a dataset Greedy iterative traversal to immediate subset Selected victim with maximum support | Good privacy level Simple and fast | Have not considered the extent of loss of support for large itemsets hence data quality get affected Scalability to handle large-scale data |
2 | [14] | Increase support of antecedent \((A) \to B\)
Decrease support of consequent \(A \to (B)\)
Hybrid decrement till confidence or support goes below the threshold | Decrease the support or confidence but not both | Based on strong assumption of any item contained in one sensitive itemset will not appear in another sensitive itemset Scalability |
3 | [15] | Deleted maximum support item \(i \in s\) from minimum length transaction The second algorithm sort sensitive itemset in terms of size and support of itemsets and mask them in round robin fashion | Sanitize minimum length transaction first in order to decrease the side effect on non-sensitive data Second algorithm is fair enough by masking in round robin fashion | Scalability is still an issue High execution time for large dataset |
4 | [9] | MaxFIA: deleted maximum support item \(i \in s\) where \(s \subseteq T\)
MinFIA: Delete minimum support item \(i \in s\) where \(s \subseteq T\)
IGA: make clusters of sensitive patterns sharing same itemsets and delete max or min support item | Cluster formation hides the set of sensitive itemsets at once No traversal required, easily count and select max and min support item Sensitive dataset is separated out in order to reduce the data size and sanitization time | Issue of scalability and high execution time in case of large scale dataset |
5 | [10] | SWA mask sensitive rules by hiding maximum frequency item \(i \in s\) where \(s \subseteq T\) SWA requires single database scan | Conceal all the sensitive rules Require single database scan Sliding window concept made the approach scalable to some extent | High execution time as well as scalability when data is big data |
6 | [12] | Aggregate: deleted transaction \(T\cap S\) supporting maximum sensitive itemsets Disaggregate: delete maximum support item \(i \in s\) where \(s \subseteq T\) from remaining transactions | Hybrid approach is fast as it selectively identify transaction and delete maximum support item | Direct deletion of transactions affects the data quality as the transactions may contain non-sensitive information too Scalability issue exists |
Problem analysis
Challenges in collaborating basic heuristics with MapReduce
-
Traditionally all sensitive itemsets are masked in one by one fashion.
-
After each victim item removal,transaction list against every affected sensitive itemset is revised. In a parallel environment, the lookahead procedure requires all nodes to communicate and revise their set of sensitive itemset and respective list, which ultimately will increase communication and computation cost.
-
If every node sanitizes their set of transaction (data chunk) independently then there is a chance of over-hiding which will degrade the data quality of sanitized dataset.
-
To study how to divide a dataset into small data chunks such that even during parallel sanitization, the minimum length/DoC transaction can be modified first.
MapReduce
Proposed MapReduce version of MaxFIA and SWA
Overview
-
Phase-I
-
Sub-routine 1: (Data separator) Sensitive transactions (T*) \(\{ if \quad s\subseteq T, where \quad s \in S, T\in D\}\) are separated out from the non-sensitive transactions (T’) in order to reduce the size of dataset need to be processed further.
-
Sub-routine 2: (Frequency calculator) Support of each 1-frequent item calculates and stores in an support index file(SIF), such that support of any item can be directly accessed by any node in a cluster.
-
Sub-routine 3: (Victim identifier) Against each sensitive itemset, an item with maximum support is selected as a victim.
-
Sub-routine 4: (Transaction sorting) Sensitive transactions are sorted depending on the given condition (e.g. Length of transaction, Degree of conflict etc.).
-
Sub-routine 5: (Sensitive itemset effect calculator) It calculates \(\Delta = Current\quad support - Th(minimum\quad support)\) the minimum number of times sensitive itemset need to be masked for making it infrequent. All the sensitive itemsets and their corresponding \(\Delta\) values are stored in a global index file (GIF).
-
Phase-II
-
Sensitive transactions are further divided into data chunks using data partitioner and distributed over n computing machines.
-
At each node, victim item against each sensitive itemset is removed till all restricted information is hidden.