Privacy-aware collection of aggregate spatial data
Introduction
Collecting aggregate data from users is important to many applications. A typical data collection shows a number of categories to participants and requires each participant to select one category. Such a data collection system normally maintains aggregate information, e.g., total number of participants, for each category. It is well known that privacy concerns of participants have a strong impact on data collection, especially when the participants need to answer sensitive questions [1], [2]. The effects of privacy concerns can be magnified when spatial data is collected due to the fact that spatial information usually relates to the physical presence of people. Failure to protect spatial privacy can result in serious harm, including physical assault, to individuals [3]. This can explain the fact that spatial privacy draws increasingly more attention from the public [4]. As the perceived privacy is a deciding factor for the participation in data collection [5], many participants may intentionally provide false data, e.g., report categories that does not apply to them, and may even refuse to provide any data [6]. Consequently, the collected data may contain significant errors in certain categories. In addition, the data may become more unusable when answering range aggregate queries as errors can lead to more unexpected results or hide problems of the data due to errors canceling each other.
Although the quality of the collected data can be improved using background knowledge [7], we are interested in collecting high quality data in the first place. To achieve this, one needs to minimize the effects of the privacy concerns during data collection. Recent research proposes negative surveys [8], which are a type of privacy-aware data collection. Negative surveys collect negative data [9], [10], [11], which is complementary to the true data. Researchers guarantee a strong privacy protection by encouraging participants to report categories that do not apply to them. These categories are called negative categories. In contrast to negative surveys, we call the traditional data collections that collect true data positive surveys.
We envisage a simplified scenario that is used throughout the paper: a transportation authority wishes to know the number of passengers who arrive at a central station from other stations on a railway route. In this example, there are seven stations where a passenger can get on a train (Fig. 1). Aggregate data is maintained for each station. The query range Q shown in the example covers three adjacent stations, S3, S4, and S5. The true answer to the query is 541. We compare two types of data collections in Fig. 2. Assume that a participant starts a trip from station S4. The participant has to report S4 in a positive survey. However, in a negative survey the participant reports any station but S4. One only knows that the reported station, for example S2, is not the actual station where the participant started the trip in a negative survey. In this example, there are 6 possible stations, which safeguards the participant's location privacy.
We call the existing approach of collecting negative data Uniform Negative Surveys (UNSs) [8] because each category, except the true category, has an equal probability to be selected by participants. As the reported counts can be vastly different to the true counts, UNSs apply a technique to reconstruct the true data from the negative data. We detail the data reconstruction technique in Section 3.1. UNSs provide an accurate estimation of the true data if the ratio of participants to categories is high. Using the previous scenario of surveying passengers, we simulate a data collection with 10,000 participants. We can observe that the estimated data from UNS are close to the true data (Fig. 3).
UNSs are not immune to errors during data reconstruction. We found that this problem is particularly severe when the ratio of participants to categories is low. For example, Fig. 4 compares the true counts and the reconstructed counts from an UNS using the previous scenario of collecting aggregate data from passengers. The results are from a simulation with 100 participants. As the chart shows, the reconstructed data significantly deviate from the true data. When the data from adjacent categories are summed up for answering range aggregate queries, errors in individual categories may offset each other, if some of them cause over-counting while others cause under-counting. This may be desirable for an individual query but may lead to more significant problems in the future as it relies on erroneous data. For example, other queries may not show the same behavior leading to user confusion. Also, in many situations, the offset does not happen or the offset is insignificant compared to the answer. For example, using the data in Fig. 4, if the query range covers S2, S3 and S4, errors in those categories do not offset each other as all of them are lower than the true counts in corresponding categories. We can observe from Fig. 4 that there are three categories with negative counts after reconstruction. The negative counts are unavoidable as shown in the original work on negative surveys. Those counts, which are wrong, further decrease the use of negative surveys. Thus, we propose GNSs, a method that does not generate negative counts.
In this paper, we propose Gaussian Negative Surveys (GNSs). Different to UNSs, the reported categories are not uniformly chosen in GNSs. Instead, negative categories that are close to the true category are more likely to be selected than the negative categories far from the true category. Our work is focused on surveying geo-spatial data, which often shows strong correlation between adjacent locations in two-dimensional environments. One should not confuse the Gaussian distribution used by GNSs with the underlying distribution of the phenomena being surveyed, which does not have to follow Gaussian distribution. Our experimental results show the high accuracy levels achieved by GNSs for real geo-spatial data. We should also note that GNSs are applied to discrete categories because public surveys normally collect the data based on spatial decomposition with a certain resolution.
Our technique is particularly useful when aggregate spatial data is collected. Data collections involving spatial data may have hundreds or thousands of categories, each of which represents a certain location or area. The success of our approach lies in the fact that GNSs retain the correlation between adjacent categories in collected data because a majority of participants are expected to report the categories that are close to their true categories. As a range aggregate query normally covers a number of categories, a large portion of the reported counts within the query range would come from the participants who are actually in the range. Thus, the reported data can be used to approximate the true data in the query range. We should note that the participants can still achieve a high level of privacy protection when the data collection is modulated as above. This is because a category can be reported by participants from a number of nearby categories with similar probabilities. Consequently, similar to UNSs, it is still difficult for an adversary to derive the true category of a participant in GNSs. Our experimental results (Section 6) show that GNSs can achieve a high level of privacy while providing accurate answers to queries.
The contributions of this paper are:
- •
We develop GNSs for collecting aggregate data while protecting individuals' privacy. GNSs are particularly suitable for collecting spatial data. GNSs are also significantly better than UNSs in answering range aggregate queries.
- •
We define a privacy metric for individuals in data collection.
- •
We analyze the factors that affect the privacy level and the accuracy level of GNSs.
- •
We compare GNSs with UNSs and a common data perturbation technique, Uniform Retention Replacement Perturbations [12], through comprehensive experiments. Our results show that GNSs achieve high privacy levels similar to the existing approaches but are significantly more accurate in solving queries.
The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 details UNSs and gives our analysis of the errors in UNSs. We present GNSs in Section 4. Privacy and accuracy metrics are introduced in Section 5. The experimental results are shown in Section 6. We conclude our paper in Section 7.
Section snippets
Privacy protection in statistical databases
Our work is related to privacy protection in statistical databases. Adam et al. [13] categorize the privacy-preserving approaches for statistical databases: conceptual framework, query restriction, data perturbation and output perturbation. Conceptual frameworks address the security issues by defining the rules for constructing a database, such as avoiding the insertion of individual entities and building multi-dimensional tables with aggregated information. Query restriction protects the
Data collection and reconstruction in uniform negative surveys
In UNSs, participants are shown mutually exclusive categories. For a participant, there is exactly one category, which is called the positive category; other categories are called the negative categories [8]. A participant randomly reports a negative category during data collection. The counts of participants for each category are maintained without knowing the true categories for the participants. An important assumption of UNSs is that the reported category is uniformly selected from the
Gaussian negative surveys
We call our approach Gaussian negative surveys because the probabilities for selecting negative categories follow a Gaussian distribution centered at the true category. Due to this characteristic, the majority of true counts for a category is spread into the categories that are close to the category. Using the same example of collecting aggregate data from passengers (Fig. 1), we show the probabilities that participants in positive category i report negative category j in Fig. 5, where i and j
Privacy
We define a measure of privacy in GNSs. We do not measure the actual privacy level achieved by individuals based on k-anonymity. Anonymity-based metrics evaluate the privacy levels when people can report true sensitive information but want to be indistinguishable from others. Survey participants, however, usually want to hide sensitive information in the first place. We define a metric that estimates the privacy levels of an individual without knowing the true data. We measure privacy based on
Experiments
We compare our proposed approach, GNSs, with the existing approach of negative surveys, UNSs. We also compare GNSs with a common data perturbation technique, Retention Replacement Perturbations [12]. This approach works as follows. During a survey, a random value is generated for each participant. If the value is below a pre-determined retention threshold, the true category of the participant is collected. Otherwise, one of the categories (including the true category) is randomly selected based
Conclusion
This paper shows that the existing approach for negative surveys, i.e., Uniform Negative Surveys (UNSs), works fine for data collection with a small number of categories but with the increase of the number of categories, the collected data becomes unusable. The effects of the errors are particularly significant when UNSs are used for collecting aggregate spatial data. In this work, we focus on collecting statistical spatial data, although for certain applications, detailed personal data might
Acknowledgment
This work is partially supported by ARC Grant DP110100757.
Hairuo Xie is a Ph.D. student in the Department of Computer Science and Software Engineering, the University of Melbourne. He received his Master of IT degree from the University of Melbourne in 2005. His research focuses on distributed data structures and aggregate query processing in sensor networks.
References (53)
- et al.
Conceptual construction on incomplete survey data
Data & Knowledge Engineering
(2004) - et al.
Privacy-preserving data publishing for cluster analysis
Data & Knowledge Engineering
(2009) - et al.
Data privacy protection in multi-party clustering
Data & Knowledge Engineering
(2008) - et al.
Accurate and large-scale privacy-preserving data mining using the election paradigm
Data & Knowledge Engineering
(2009) - et al.
Discovering private trajectories using background information
Data & Knowledge Engineering
(2010) - et al.
A privacy preserving technique for distance-based classification with worst case privacy guarantees
Data & Knowledge Engineering
(2008) - et al.
Asking sensitive questions: the impact of data collection mode, question format, and question context
Public Opinion Quarterly
(1996) Survey research and societal change
Annual Review of Psychology
(2004)- et al.
Location privacy and location-aware computing
Attitudes toward spatial privacy in the United States of America
The influence of privacy and security on respondents' trust and participation in e-surveys
Information privacy concerns, procedural fairness, and impersonal trust: an empirical investigation
Organization Science
Anonymous data collection in sensor networks
MobiQuitous
Hiding a needle in a haystack using negative databases
Protecting data privacy through hard-to-reverse negative databases
International Journal of Information Security
Negative representations of information
International Journal of Information Security
Privacy preserving OLAP
SIGMOD
Security-control methods for statistical databases: a comparative study
ACM Computing Surveys
Privacy-preserving data mining
SIGMOD Record
Practical data-swapping: the first steps
ACM Transactions on Database Systems
Data swapping: variations on a theme by Dalenius and Reiss
Privacy in Statistical Databases
Aggregate query answering on anonymized tables
ICDE
Anonymization of set-valued data via top-down, local generalization
PVLDB
Privacy-preserving anonymization of set-valued data
PVLDB
Answering aggregation queries in a secure system model
VLDB
Order preserving encryption for numeric data
SIGMOD
Cited by (40)
A negative survey based privacy preservation method for topology of social networks
2023, Applied Soft ComputingEnhancing the privacy of negative surveys using negative combined categories
2020, Applied Soft Computing JournalCitation Excerpt :Therefore, a participant having B as the positive category, should select a category with the same probability from A, C, D, and E for Q1. A Gaussian negative survey model was proposed in [17], which simulates the probability of selecting different categories by the Gaussian distribution. For Q1, a participant with B as the positive category selects categories A and C with a higher probability than his/her selects D and E.
Multiple-negative survey method for enhancing the accuracy of negative survey-based cloud data privacy: Applications and extensions
2017, Engineering Applications of Artificial IntelligenceLocation privacy preferences: A survey-based analysis of consumer awareness, trade-off and decision-making
2015, Transportation Research Part C: Emerging TechnologiesCitation Excerpt :Given both the dynamic nature of location information and the constraints faced in balancing data privacy with usability, work on techniques for privacy preservation has seen substantial growth in recent years. Techniques for privacy preservation have ranged from methods of background data perturbation, masking, and cloaking (Hoh and Gruteser, 2005; Hoh et al., 2012; Xie et al., 2011) to methods focused on how location data are presented and shared publically (particularly within social networks) (Li and Chen, 2010; Puttaswamy et al., 2014; Mascetti et al., 2011). Additional exploration has also taken place regarding issues such as the economic valuation of privacy in cases of differentiated tolling (Zangui et al., 2013) and the implementation of ‘Privacy-By-Design’ approaches in the collection of mobile data for fine-grained modeling purposes (Sun et al., 2013).
On the dependable level of the negative survey
2014, Statistics and Probability LettersCitation Excerpt :The negative survey was first proposed by Esponda (2006). In this section, the background of the negative survey is introduced (Esponda, 2006; Esponda and Guerrero, 2009; Horey et al., 2007; Xie et al., 2011). The above formula only gives the point estimation of positive survey results.
How to ask sensitive questions in conservation: A review of specialized questioning techniques
2014, Biological ConservationCitation Excerpt :However, in order to reduce the chance of bias in respondents’ selection of response categories, a randomizing device with c − 1 options is used in private by the respondent to obtain a value m, they then choose the mth true alternative from the list accordingly (Esponda and Guerrero, 2009). Rather than using a randomizing device with known probabilities drawn from a uniform distribution, Xie et al. (2011) proposed that the probability of selecting response categories should follow a Gaussian distribution centred at the positive category. This approach achieves higher accuracy but reduces respondent privacy.
Hairuo Xie is a Ph.D. student in the Department of Computer Science and Software Engineering, the University of Melbourne. He received his Master of IT degree from the University of Melbourne in 2005. His research focuses on distributed data structures and aggregate query processing in sensor networks.
Lars Kulik received his PhD from the University of Hamburg, Germany, in 2002. He is a senior lecturer in the Department of Computer Science and Software Engineering at the University of Melbourne. His research focuses on efficient algorithms for moving objects and traffic applications, methods for safeguarding location privacy, information dissemination and aggregation algorithms in sensor networks, spatial algorithms in pervasive computing environments, and robust algorithms that cope with imperfection, especially in mobile computing.
Egemen Tanin is a senior lecturer in the Department of Computer Science and Software Engineering, the University of Melbourne. He received his PhD degree in computer science from the University of Maryland, College Park, Maryland, where he also held a postdoctoral research associate position from 2001 until 2003. His areas of research include spatial data management and database visualization.