Introduction
Successful related works
-
325 large breaches of PHI, compromising 16,612,985 individual patient records.
-
3,620,000 breached patient records in the year’s single largest incident.
-
40% of large breach incidents involved unauthorized access/disclosure.
-
Policy attention should focus more on the actual uses of big data and less on its collection and analysis. Such existing policies are unlikely to yield effective strategies for improving privacy, or to be scalable over time.
-
Policy concerning privacy protection should be addressing the purpose rather than prescribing the mechanism.
-
Research is needed in the technologies that help to protect privacy, in the social mechanisms that influence privacy preserving behavior, and in the legal options that are robust to changes in technology and create appropriate balance among economic opportunity, national priorities, and privacy protection.
-
Increased education and training opportunities concerning privacy protection, including career paths for professionals. Programs that provide education leading to privacy expertise are essential and need encouragement.
-
Privacy protections should be extended to non-US citizens as privacy is a worldwide value that should be reflected in how the federal government handles personally identifiable information from non-US citizens [16].
Privacy and security concerns in big data
Security | Privacy |
---|---|
Security is the “confidentiality, integrity and availability” of data | Privacy is the appropriate use of user’s information |
Various techniques like Encryption, Firewall, etc. are used in order to prevent data compromise from technology or vulnerabilities in the network of an organization | The organization can’t sell its patient/user’s information to a third party without prior consent of the user |
It may provide for confidentiality or protect an enterprise or agency | It concerns with patient’s right to safeguard their information from any other parties |
Security offers the ability to be confident that decisions are respected | Privacy is the ability to decide what information of an individual goes and where to |
Security of big healthcare data
A. Big data security lifecycle
-
Data collection phase This is the obvious first step. It involves collecting data from different sources in various formats. From a security perspective, securing big health data technology is a necessary requirement from the first phase of the lifecycle. Therefore, it is important to gather data from trusted sources, preserve patient privacy (there must be no attempt to identify the individual patients in the database) and make sure that this phase is secured and protected. Indeed, some mature security measures must be used to ensure that all data and information systems are protected from unauthorized access, disclosure, modification, duplication, diversion, destruction, loss, misuse or theft.
-
Data transformation phase Once the data is available, the first step is to filter and classify the data based on their structure and do any necessary transformations in order to perform meaningful analysis. More broadly, data filtering, enrichment and transformation are needed to improve the quality of the data ahead of analytics or modeling phase and remove or appropriately deal with noise, outliers, missing values, duplicate data instances, etc. On the other side, the collected data may contain sensitive information, which makes extremely important to take sufficient precautions during data transformation and storing. In order to guarantee the safety of the collected data, the data should remain isolated and protected by maintaining access-level security and access control (utilizing an extensive list of directories and databases as a central repository for user credentials, application logon templates, password policies and client settings) [22], and defining some security measures like data anonymization approach, permutation, and data partitioning.
-
Data modeling phase Once the data has been collected, transformed and stored in secured storage solutions, the data processing analysis is performed to generate useful knowledge. In this phase, supervised data mining techniques such as clustering, classification, and association can be employed for feature selection and predictive modeling. Further, there also exist several ensembles of learning techniques that improve accuracy and robustness of the final model. On the other side, it is crucial to provide secure processing environment. In fact, the focus of data miners in this phase is to use powerful data mining algorithms that can extract sensitive data. Therefore, the process of data mining and the network components in general, must be configured and protected against data mining based attacks and any security breach that may happen, as well as make sure that only authorized staff work in this phase. This process helps eliminate some vulnerabilities and mitigates others to a lower risk level.
-
Knowledge creation phase Finally, the modeling phase comes up with new information and valued knowledges to be used by decision makers. These created knowledges are considered sensitive data, especially in a competitive environment. Indeed, healthcare organizations aware of their sensitive data (e.g. patient personal data) not to be publicly released. Accordingly, security compliance and verification are a primary objective in this phase.
B. Technologies in use
Privacy of big healthcare data
Data protection laws
Country | Law | Salient features |
---|---|---|
USA | HIPAA Act Patient Safety and Quality Improvement Act (PSQIA) HITECH Act | Requires the establishment of national standards for electronic healthcare transactions. Gives the right to privacy to individuals from age 12 through 18 Signed disclosure from the affected before giving out any information on provided healthcare to anyone, including parents Patient Safety Work Product must not be disclosed [63] Individual violating the confidentiality provisions is subject to a civil penalty Protect security and privacy of electronic health information |
EU | Data Protection Directive | Protect people’s fundamental rights and freedoms and in particular their right to privacy with respect to the processing of personal data [64] |
Canada | Personal Information Protection and Electronic Documents Act (‘PIPEDA’) | Individual is given the right to know the reasons for collection or use of personal information, so that organizations are required to protect this information in a reasonable and secure way [65] |
UK | Data Protection Act (DPA) | Provides a way for individuals to control information about themselves Personal data shall not be transferred to a country or territory outside the European Economic Area unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects |
Morocco | The 09-08 act, dated on 18 February 2009 | Protects the one’s privacy through the establishment of the CNDP authority by limiting the use of personal and sensitive data using the data controllers in any data processing operation [66] |
Russia | Russian Federal Law on Personal Data | Requires data operators to take “all the necessary organizational and technical measures required for protecting personal data against unlawful or accidental access” |
India | IT Act and IT (Amendment) Act | Implement reasonable security practices for sensitive personal data or information. Provides for compensation to person affected by wrongful loss or wrongful gain. Provides for imprisonment and/or fine for a person who causes wrongful loss or wrongful gain by disclosing personal information of another person while providing services under the terms of lawful contract |
Brazil | Constitution | The intimacy, private life, honor and image of the people are inviolable, with assured right to indigenization by material or moral damage resulting from its violation |
Angola | Data Protection Law (Law no. 22/11 of 17 June) | With respect to sensitive data processing, collection and processing is only allowed where there is a legal provision allowing such processing and prior authorization from the APD is obtained |
Privacy preserving methods in big data
A. De-identification
-
k-anonymity In this technique, the higher the value of k, the lower will be the probability of re-identification. However, it may lead to distortions of data and hence greater information loss due to k-anonymization. Furthermore, excessive anonymization can make the disclosed data less useful to the recipients because some of the analysis becomes impossible or may produce biased and erroneous results. In k-anonymization, if the quasi-identifiers containing data are used to link with other publicly available data to identify individuals, then the sensitive attribute (like disease) as one of the identifier will be revealed. Table 3 is a non-anonymized database consisting of the patient records of some fictitious hospital in Casablanca.Table 3A non-anonymized database comprising of the patient recordsNameBirthSexZIP codeReligionDiseaseYasmine12/03/1962Female20502MuslimHeart-relatedKhalid21/11/1962Male20042MuslimCancerJohn01/08/1964Male20056ChristianViral infectionAicha30/01/1962Female29004MuslimDiabetes mellitusAbraham15/09/1964Male20303JewishPneumonia
Name | Birth | Sex | ZIP code | Religion | Disease |
---|---|---|---|---|---|
* | 1962 | Female | 20000 | * | Heart-related |
* | 1962 | Male | 20000 | * | Cancer |
* | 1964 | Male | 20000 | * | Viral infection |
* | 1962 | Female | 20000 | * | Diabetes mellitus |
* | 1964 | Male | 20000 | * | Pneumonia |
-
L-diversity It is a form of group based anonymization that is utilized to safeguard privacy in data sets by diminishing the granularity of data representation. This model (Distinct, Entropy, Recursive) [46, 47, 51] is an extension of the k-anonymity which utilizes methods including generalization and suppression to reduce the granularity of data representation in a way that any given record maps onto at least k different records in the data. The l-diversity model handles a few of the weaknesses in the k-anonymity model in which protected identities to the level of k-individuals is not equal to protecting the corresponding sensitive values that were generalized or suppressed. The problem with this method is that it depends upon the range of sensitive attribute. If want to make data L-diverse though sensitive attribute has not as much as different values, fictitious data to be inserted. This fictitious data will improve the security but may result in problems amid analysis. As a result, L-diversity method is also a subject to skewness and similarity attack [51] and thus can’t prevent attribute disclosure.
-
T-closeness Is a further improvement of l-diversity group based anonymization. The t-closeness model (equal/hierarchical distance) [46, 50] extends the l-diversity model by treating the values of an attribute distinctly, taking into account the distribution of data values for that attribute. The main advantage of this technique is that it intercepts attribute disclosure, and its problem is that as size and variety of data increase, the odds of re-identification increase too.
B. HybrEx
C. Identity based anonymization
Summary on recent approaches used in big data privacy
Paper | Focus | Limitations |
---|---|---|
[56] | Discusses experiences and issues encountered when successfully combined anonymization, privacy protection, and Big data techniques to analyze usage data while protecting the identities of users | It still uses K-anonymity technique which is vulnerable to correlation attack |
[61] | Proposed the privacy preserving data mining techniques in Hadoop, i.e. solve privacy violation without utility degradation | Its execution time is affected by noise size |
[67] | Introduced an efficient and privacy-preserving cosine similarity computing protocol | Need significant research efforts for addressing unique privacy issues in some specific big data analytics |
[68] | Discussed and suggested how an existing approach “differential privacy” is suitable for big data | This method depends totally on calculation of the amount of noise by the curator. So, if curator is compromised the whole system fails |
[69] | Proposed a scalable two-phase top-down specialization (TDS) approach to anonymize large-scale data sets using the MapReduce framework on cloud | It uses anonymization technique which is vulnerable to correlation attack |
[70] | Proposed various privacy issues dealing with big data applications | Customer segmentation and profiling can easily lead to discrimination based on age gender, ethnic background, health condition, social, background, and so on |
[71] | Proposed an anonymization algorithm (FAST) to speed up anonymization of big data streams | Further research required to design and implement FAST in a distributed cloud-based framework in order to gain cloud computation power and achieve high scalability |
[72] | The novel framework proposed into achieve privacy-preserving machine learning | The training data are distributed and each shared data portion of large volume, is not able to achieve distributed feature selection |
[73] | Proposed methodology provides data confidentiality, secure data sharing without Re-encryption and access control for malicious insiders and forward and backward access control | Limiting the trust level in the cryptographic server |