1 Introduction
2 Data Privacy for Databases
-
Data-driven or general purpose. In this case, we have no knowledge on the type of analysis to be performed by a third party. This is the usual case in which data is published through a server for future use. It also includes the case that data is transferred to a data miner or a data scientist for its analysis as we usually do not know which algorithm will be applied to the data. For this purpose, anonymization methods, also known as masking methods have been developed.
-
Computation-driven or specific purpose. In this case, we know the exact analysis the third party (or third parties) wants to apply to the data. For example, we know that the data scientist wants to find the parameters of a regression model. This can be seen as the computation of a function or as solving a query for a database without disclosing the database. When a single database is considered and we formulate the problem as answering a query, differential privacy is a suitable privacy model. In the case that multiple databases are considered, the privacy model is based on secure multiparty computation and cryptographic protocols are used for this purpose.
-
Result-driven. In this case, the analysis (a given data mining algorithm) is also known. The difference with computation-driven approaches is that here we are not worried on the protection of the database per se, but on the protection of some of the outcomes of the algorithm. For example, we know that data scientists will apply association rule mining, and we want to avoid that they infer that people buying diapers also buy beers. Similarly as in computation-driven analysis, prevention of disclosure for this type of analysis is specific to the given computation producing the specific results. In this case, however, the focus is on the knowledge inferred from the data instead of the actual data.
-
Masking methods. Methods that given a database X transform it into another one \(X'\) with less quality. Masking methods are usually classified into three categories: perturbative, non-perturbative and synthetic data generators. Perturbative methods reduce the quality by means of modifying the data introducing some kind of error into the data. Noise addition and multiplication, microaggregation and rank swapping are examples of perturbative methods. Non-perturbative methods reduce the quality of the data making them less detailed (but not erroneous). Generalization and suppression are examples of them. Synthetic data generators replace the original data by data generated from a model, which has been extracted from the original database. So, the data in \(X'\) is not the original data but artificial data generated from the model.
-
Information loss measures. They measure in what extent the transformation of X into \(X'\) reduces the utility of the data, and the information that is lost in the process. Information loss measures are typically defined in terms of an analysis f to be performed to the data. Then, given this analysis f and the original and anonymized files X and \(X'\), we define information loss as where divergence is a function that evaluates how far are f(X) and \(f(X')\). A distance on the space of f(X) can be used for this purpose. Naturally, we expect divergence \((Y,Y)=0\) for all Y. Typical examples of functions f include some statistics (means, variances, covariances, regression coefficients), as well as machine learning algorithms (clustering and classification algorithms). Specific measures for some types of databases have also been considered in the literature (e.g., measures on graphs).×
-
Privacy models and disclosure risk measures. They focus on what extent anonymized (masked) data still contains sensitive information that can be used to compromise the privacy of the individuals of the database.
3 Data Privacy for Big Data
3.1 Big Data
-
Volume. Databases include huge amounts of data. For example, facebook generated 4 new petabytes of data per day in October 2014 (see [21]).
-
Velocity. Data is flowing to the databases in real time: real time streams of data flowing from diverse resources. Either from sensors or from internet (from e-commerce or social media).
-
Variety. Data is no longer of a single type (or a few simple types). Databases include data from a vast range of systems and sensors in different formats and datatypes. This may include unstructured text, logs, and videos.
3.2 Moving Privacy to Big Data: Disclosure Risk
-
Lack of control and transparency. It is more and more difficult to know who has our data. There are different organizations that can have information about ourselves without us knowing it. Information is gathered from sensors and cameras, obtained through screening posts in social networks, and from analysis of web searches. Note also the case of tracking cookies. Finally, there are data brokers that gather as much information as possible about citizens.
-
Linkability. It is usual for big data to link databases to improve the amount and quality of the information. Linking databases increase the risk of identification as there is more information for each individual. Note that the more information we have on individuals the easier to reidentify them, and the more difficult to protect them.
-
Inference and data reusability. There exist effective inference algorithms that infer sensitive information (e.g., sexual orientation, political and religious affiliation [12]). One of the main goals of big data analytics is to use existing data for new purposes. This increases the inference ability. As a side effect, data is never deleted waiting for future use.
3.3 Open Research Questions for Big Data
-
Issue #1. Technology should help people to know what others know and can infer about them.As we have stated above, effective machine learning and data mining algorithms can infer sensitive information. Some of these models use data that does not seem a priori sensitive. It is insufficient that we protect sensitive information without protecting what permits us to infer sensitive information. Technology should help people to know about this, and e.g. provide tools in social networks to make people aware of this fact.
-
Issue #2. Databases should be anonymized/masked in origin. Machine learning algorithms for masked data are required.On the one hand, there exist masking methods that are effective in the sense that they achieve low information loss (with loss disclosure risk). On the other hand, there are machine learning and data mining algorithms that are resitant to errors. In the same direction, not all data is equally important for machine learning algorithms, and some data mining algorithms for big data do not use all data but only a sample of them. Because of that, it is meaningful to consider privacy by design machine learning algorithms. That is, machine learning methods that are appropriate for data that has already been protected.Preprocessing methods for machine learning (dimensionality reduction, sampling, etc.) should be combined with and integrated to masking methods. Masking methods can be seen as methods to introduce noise and reduce quality, but they can also be seen as methods for dimensionality reduction. See e.g. the case of microaggregation and, in general, methods to achieve k-anonymity. They reduce the number of (different) records in a database by means of generalization or clustering (i.e., building centroids). These generalized records or centroids can be seen as more consolidated (error-free) data.
-
Issue #3. Anonymization needs to provide controlled linkability.We have reported that linkability is one of the basic components of big data. Companies want to combine databases to increase the information about individuals (enlarging the set of variables/attributes available on them). If databases are anonymized in origin, we need ways to ensure that these databases can still be somehow linked in order to fulfill big data requirements. k-anonymity allows linkability at group level. Algorithms for controlled linkability are needed, as well as methods that can exploit e.g. linkability at group level.
-
Issue #4. Privacy models need to be composable.Given several data sets with a given privacy guarantee, their combination needs to satisfy also the privacy requirements. There are results on the composability of differential privacy. See e.g. [15].
-
Issue #5. User privacy should be in place: decentralized anonymity.User privacy [17] is when users have an active role in ensuring their own privacy. For this purpose, there are methods to protect the identity of the users as well as to protect their data. For example, there are methods for user privacy in communications and in information retrieval.While the research questions mentioned above are to be implemented and used by data holders, user privacy provides users with tools to be used by themselves. User privacy permits that data are anonymized before their transmission to data collectors (or to the service provider). So, there is no need to trust the data collector. Local anonymization and collaborative anonymization are keywords for tools for user privacy.
-
Issue #6. Methods for big data.Big data have particularities (the three or more Vs discussed above) that have to be taken into account when developing methods for ensuring privacy. These particularities are for both respondent and holder privacy (i.e., methods applied by data holders) and for user privacy. We can distinguish three types of situations.
-
Issue #6.1. Large volumes of data. Efficient algorithms are required for data of high dimension. Algorithms are required for producing masked databases, but also for computing information loss measures and disclosure risk. There exist already some masking methods that have been developed with efficiency in mind for standard databases (e.g., some algorithms for microaggregation), for graphs and social networks (e.g., based on random noise on edges, on generalization and microaggregation), and for location privacy. New methods are needed.
-
Issue #6.2. Dynamic data. When data changes with respect to time, we may need to publish several copies of the database. In this case, specific data masking algorithms are required. Note that independent application of algorithms for k-anonymity to the same database can cause disclosure. So, the same applies when the database has changed between two applications of the algorithm.
-
Issue #6.3. Streaming data. Data is received continuously and should be processed as soon as possible because we cannot hold them and process them later. In this case, difficulties arise because at any time information is only partial. Methods based on sliding windows have been developed for this purpose.
-
-
Issue #7. Data provenance and data privacy.The new EU General Data Protection Regulation grants citizens the right to rectify and delete the information about themselves in companies. Data provenance are the data structures that permit companies to know where the information of customers and users is in their databases. Different open research questions appear in the crossroad between provenance and privacy. One of them is the fact that data provenance can contain sensitive information and, thus, privacy technologies needs to be applied to it. At the same time, the fact that data can be modified using data provenance according to customers’ requirements poses new privacy problems. We discuss these issues in more detail in the next section. These research topics can also be considered for databases of small and medium size but it is with big data that the research becomes challenging.
4 Data Privacy and Data Provenance
4.1 Securing Data Provenance
-
Distributed. When databases flow through untrusted environments, and provenance data is associated to them, we need secure data provenance systems to be defined so that they work in a distributed environment. We cannot use a centralized approach with trusted hardware.
-
Integrity. In distributed environments it is important that nobody can forge provenance data. Provenance data is transmitted and provenance structures are modified to add the new processes applied to the data. Nevertheless, as stated above the structure is immutable and no adversary can be granted to change any part of it. In addition, the provenance system should not allow the modification of a value without expanding the provenance structure. Finally, deletion of provenance data should not cause that a record of the database is unreadable. Additional aspects to be taken into account is to consider collusions of intruders (that coalitions of intruders should not be able to attack integrity), repudiation (that intruders should not be able to repudiate a record as it was not theirs) or creating forged structures (intruders should not be able to create new provenance structures).
-
Availability. We are interested in providing security mechanisms to ensure provenance data availability. Auditors should be able to access provenance information in a secure, fast and reliable manner to perform any required operation, e.g. verify the integrity of an ownership sequence without knowing the individual records.
-
Privacy and confidentiality. We need to ensure that disclosure does not take place, and this is needed for both the database and the provenance data. Only authorized users can access the information.
-
Completeness. That is, that all actions that are relevant to computation should be detected and represented in the provenance structure. Note that this is not always easy, because some operations as e.g. cut & paste or manual copy can exclude relevant provenance information.
-
Efficiency. Data provenance introduces an overhead to the data. Fine-grained provenance can double (or more) the size of a database. In addition, operations on the provenance structure need to be efficient because they also introduce an overhead on the computation time.
4.2 Considerations About Privacy and Provenance
-
Case 1: The data are kept private and provenance data are also private. Both need to be protected and their relation has to be preserved.
-
Case 2: The data itself are not protected but provenance data are private.
-
Case 3: Data are private, but the provenance data are not protected.
-
Case 4: No privacy protection are applied to neither the data itself nor the provenance data.
Data | Provenance | |
---|---|---|
Case 1
| Private | Private |
Case 2
| Non-private | Private |
Case 3
| Private | Non-private |
Case 4
| Non-private | Non-private |
4.3 Example of Privacy Problem with Provenance Information
-
An intruder knows \(S \subset X\), \(\varGamma \), and \(\varGamma '\), can this intruder gain knowledge of \(\mu \) and \(S' \subseteq X \setminus S\) with certainty?
-
An intruder knows \(\chi \) and \(\chi '\), will this intruder be able to determine \(\mu \)?