Background
Privacy and security concerns in big data
Privacy and security concerns
S.No | Privacy | Security |
---|---|---|
1 | Privacy is the appropriate use of user’s information | Security is the “confidentiality, integrity and availability” of data |
2 | Privacy is the ability to decide what information of an individual goes where | Security offers the ability to be confident that decisions are respected |
3 | The issue of privacy is one that often applies to a consumer’s right to safeguard their information from any other parties | Security may provide for confidentiality. The overall goal of most security system is to protect an enterprise or agency [72] |
4 | It is possible to have poor privacy and good security practices | However, it is difficult to have good privacy practices without a good data security program |
5 | For example, if user make a purchase from XYZ Company and provide them payment [13] and address information in order for them to ship the product, they cannot then sell user’s information to a third party without prior consent to user | The company XYZ uses various techniques (Encryption, Firewall) in order to prevent data compromise from technology or vulnerabilities in the network |
Privacy requirements in big data
Big data privacy in data generation phase
-
A tool Socketpuppet is utilized to hide online identity of individual by deception. By utilizing multiple Socketpuppets, the data belonging to one specific individual will be regarded as having a place with various people. In that way the data collector will not have enough knowledge to relate different socketpuppets to one individual.
-
Certain security tools can be used to mask individual’s identity, such as Mask Me. This is especially useful when the data owner needs to give the credit card details amid online shopping.
Big data privacy in data storage phase
Approaches to privacy preservation storage on cloud
-
Attribute based encryption Access control is based on the identity of a user complete access over all resources.
-
Homomorphic encryption Can be deployed in IBE or ABE scheme settings updating cipher text receiver is possible.
-
Storage path encryption It secures storage of big data on clouds.
-
Usage of Hybrid clouds Hybrid cloud is a cloud computing environment which utilizes a blend of on-premises, private cloud and third-party, public cloud services with organization between the two platforms.
Integrity verification of big data storage
Big data privacy preserving in data processing
Privacy preserving methods in big data
De-identification
-
Privacy-preserving big data analytics is still challenging due to either the issues of flexibility along with effectiveness or the de-identification risks.
-
De-identification is more feasible for privacy-preserving big data analytics if develop efficient privacy-preserving algorithms to help mitigate the risk of re-identification.
-
Identifier attributes include information that uniquely and directly distinguish individuals such as full name, driver license, social security number.
-
Quasi-identifier attributes means a set of information, for example, gender, age, date of birth, zip code. That can be combined with other external data in order to re-identify individuals.
-
Sensitive attributes are private and personal information. Examples include, sickness, salary, etc.
-
Insensitive attributes are the general and the innocuous information.
-
Equivalence classes are sets of all records that consists of the same values on the quasi-identifiers.
K-anonymity
Name | Age | Gender | State of domicile | Religion | Disease |
---|---|---|---|---|---|
Ramya | 29 | Female | Tamil Nadu | Hindu | Cancer |
Yamini | 24 | Female | Andhra Pradesh | Hindu | Viral infection |
Salini | 28 | Female | Tamil Nadu | Muslim | TB |
Sunny | 27 | Male | Karnataka | Parsi | No illness |
Joshna | 24 | Female | Andhra Pradesh | Christian | Heart-related |
Badri | 23 | Male | Karnataka | Buddhist | TB |
Ramu | 19 | Male | Andhra Pradesh | Hindu | Cancer |
Kishor | 29 | Male | Karnataka | Hindu | Heart-related |
John | 17 | Male | Andhra Pradesh | Christian | Heart-related |
Jhonny | 19 | Male | Andhra Pradesh | Christian | Viral infection |
Name | Age | Gender | State of domicile | Religion | Disease |
---|---|---|---|---|---|
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | Cancer |
* | 20 < Age ≤ 30 | Female | Andhra Pradesh | * | Viral infection |
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | TB |
* | 20 < Age ≤ 30 | Male | Karnataka | * | No illness |
* | 20 < Age ≤ 30 | Female | Andhra Pradesh | * | Heart-related |
* | 20 < Age ≤ 30 | Male | Karnataka | * | TB |
* | Age ≤ 20 | Male | Andhra Pradesh | * | Cancer |
* | 20 < Age ≤ 30 | Male | Karnataka | * | Heart-related |
* | Age ≤ 20 | Male | Andhra Pradesh | * | Heart-related |
* | Age ≤ 20 | Male | Andhra Pradesh | * | Viral infection |
L-diversity
T-closeness
Comparative analysis of de-identification privacy methods
S.No | Privacy measure | Definitions | Limitations | Computational complexity |
---|---|---|---|---|
1 | K-anonymity | It is a framework for constructing and evaluating algorithms and systems that release information such that released information limits what can be revealed about the properties of entities that are to be protected | Homogeneity-attack, background knowledge | |
2 | L-diversity | An equivalence class is said to have L-diversity if there are at least “well-represented” values for the sensitive attribute. A table is said to have L-diversity if every equivalence class of the table has L-diversity | L-diversity may be difficult and unnecessary to achieve and L-diversity is insufficient to prevent attribute disclosure | O((n2)/k) |
3 | T-closeness | An equivalence class is said to have T-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness | T-closeness requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of a sensitive attribute in the overall table | 2O(n)O(m) [36] |
HybrEx
Privacy-preserving aggregation
Operations over encrypted data
Recent techniques of privacy preserving in big data
Differential privacy
Identity based anonymization
Privacy preserving Apriori algorithm in MapReduce framework
Hiding a needle in a haystack [46]
Privacy-preserving big data publishing
Improvement of k-anonymity and l-diversity privacy model
MapReduce-based anonymization
K-anonymity with MapReduce
MapReduce-based l-diversity
Fast anonymization of big data streams
Proactive heuristic
Privacy and security aspects healthcare in big data
Data governance
Real-time security analytics
Privacy-preserving analytics
Data quality
Data sharing and privacy
Relying on predictive models
Variety of methods and complex math’s
Summary on recent approaches used in big data privacy
S.No | Research paper | Publication and year | Focus | Limitations |
---|---|---|---|---|
1 | “Toward Efficient and Privacy Preserving Computing in Big Data Era” [38] | IEEE Network July/Aug 2014 | Introduced an efficient and privacy-preserving cosine similarity computing protocol | Need significant research efforts for addressing unique Privacy issues in some specific big data analytics |
2 | “Hiding a needle in a Haystack: privacy preserving Apriori algorithm in map reduce framework” [46] | ACM Nov 7, 2014 | Proposed the privacy preserving data mining technique in Hadoop i.e. solve privacy violation without utility degradation | Execution time of proposed technique is affected by noise size |
3 | “Making big data, privacy, and anonymization work together in the enterprise: experiences and issues” [41] | IEEE International Congress 2014 | Discusses experiences and issues encountered when successfully combined anonymization, privacy protection, and Big Data techniques to analyse usage data while protecting the identities of users | Uses K-anonymity technique which is vulnerable to correlation attack |
4 | “Microsoft Differential Privacy for Everyone” [40] | Microsoft Research 2015 | Discussed and suggested how an existing approach “differential privacy” is suitable for big data | This method total depends on calculation of the amount of noise by the curator. So if curator is compromised the whole system fails |
5 | “A scalable two-phase top-down specialization approach for data anonymization using MapReduce on cloud” [69] | IEEE transactions on parallel and distributed systems 2014 | Proposed a scalable two-phase top-down specialization (TDS) approach to anonymize large-scale data sets using the Map Reduce framework on cloud | Uses anonymization technique which is vulnerable to correlation attack |
6 | “HireSome-II: towards privacy-aware cross-cloud service composition for big data applications” [74] | IEEE transactions on parallel and distributed systems 2014 | Proposed a privacy-aware cross-cloud service composition method, named HireSome-II (History record-based Service optimization method) based on its previous basic version HireSome-I | |
7 | Protection of big data privacy [7] | IEEE translations 2016 | Proposed various privacy issues dealing with big data applications | Customer segmentation and profiling can easily lead to discrimination based on age gender, ethnic background, health condition, social, background, and so on |
8 | Fast anonymization of big data streams [55] | ACM August, 2014 | Proposed an anonymization algorithm (FAST) to speed up anonymization of big data streams | Further research required to design and implement FAST in a distributed cloud-based framework in order to gain cloud computation power and achieve high scalability |
9 | Privacy preserving Ciphertext multi-sharing control for big data storage [75] | IEEE Transactions on informatics Forensics and Security 2015 | Proposed a privacy-preserving Ciphertext multi-sharing mechanism | The proxy can create delegation rights between the two parties which have never agreed upon the delegation process |
10 | Privacy-preserving machine learning algorithms for big data systems [76] | IEEE international conference on distributed computing systems 2015 | Proposed a novel framework to achieve privacy-preserving machine learning where the training data are distributed and each shared data portion of large volume | Not able to achieve distributed feature selection |
11 | Privacy-preserving big data publishing [50] | ACM June–July 2015 | Proposed approach towards privacy-preserving data mining of very massive data sets using MapReduce | Generalization is unable to handle high dimensional data, it reduces data utility. Perturbation reduces utility of data |
12 | Proximity-aware local-recoding anonymization with map reduce for scalable big data privacy preservation in cloud [70] | IEEE Transactions on computer August 2015 | Model the problem of big data local recoding against proximity privacy breaches as a proximity-aware clustering problem, and propose a scalable two-phase clustering approach accordingly | Further research to integrate our approach with Apache Mahout to achieve highly scalable privacy preserving big data mining or analytics |
13 | Deduplication on encrypted big data in cloud [77] | IEEE transactions on big data 2016 | Proposed a practical scheme to manage the encrypted big data in cloud with deduplication based on ownership challenge and Proxy Re-Encryption (PRE) | Convergent encryption(CE) is subject to an inherent security limitation, namely, susceptibility to offline Brute-force dictionary attacks |
14 | Security and privacy for storage and computation in cloud computing [22] | International Journal of Science and Research (IJSR) ISSN (Online): 2319–7064 | Proposed methodology provides data confidentiality, secure data sharing without Re-encryption, access control for malicious insiders, and forward and backward access control | Limiting the trust level in the cryptographic server (CS) |