Privacy-aware collection of aggregate spatial data

https://doi.org/10.1016/j.datak.2011.03.007Get rights and content

Abstract

Privacy concerns can be a major barrier to collecting aggregate data from the public. Recent research proposes negative surveys that collect negative data, which is complementary to the true data. This opens a new direction for privacy-aware data collection. However, the existing approach cannot avoid certain errors when applied to many spatial data collection tasks. The errors can make the data unusable in many real scenarios. We propose Gaussian negative surveys. We modulate data collection based on Gaussian distribution. The collected data can be used to compute accurate spatial distribution of participants and can be used to accurately answer range aggregate queries. Our approach avoids the errors that can occur with the existing approach. Our experiments show that we achieve an excellent balance between privacy and accuracy.

Introduction

Collecting aggregate data from users is important to many applications. A typical data collection shows a number of categories to participants and requires each participant to select one category. Such a data collection system normally maintains aggregate information, e.g., total number of participants, for each category. It is well known that privacy concerns of participants have a strong impact on data collection, especially when the participants need to answer sensitive questions [1], [2]. The effects of privacy concerns can be magnified when spatial data is collected due to the fact that spatial information usually relates to the physical presence of people. Failure to protect spatial privacy can result in serious harm, including physical assault, to individuals [3]. This can explain the fact that spatial privacy draws increasingly more attention from the public [4]. As the perceived privacy is a deciding factor for the participation in data collection [5], many participants may intentionally provide false data, e.g., report categories that does not apply to them, and may even refuse to provide any data [6]. Consequently, the collected data may contain significant errors in certain categories. In addition, the data may become more unusable when answering range aggregate queries as errors can lead to more unexpected results or hide problems of the data due to errors canceling each other.

Although the quality of the collected data can be improved using background knowledge [7], we are interested in collecting high quality data in the first place. To achieve this, one needs to minimize the effects of the privacy concerns during data collection. Recent research proposes negative surveys [8], which are a type of privacy-aware data collection. Negative surveys collect negative data [9], [10], [11], which is complementary to the true data. Researchers guarantee a strong privacy protection by encouraging participants to report categories that do not apply to them. These categories are called negative categories. In contrast to negative surveys, we call the traditional data collections that collect true data positive surveys.

We envisage a simplified scenario that is used throughout the paper: a transportation authority wishes to know the number of passengers who arrive at a central station from other stations on a railway route. In this example, there are seven stations where a passenger can get on a train (Fig. 1). Aggregate data is maintained for each station. The query range Q shown in the example covers three adjacent stations, S3, S4, and S5. The true answer to the query is 541. We compare two types of data collections in Fig. 2. Assume that a participant starts a trip from station S4. The participant has to report S4 in a positive survey. However, in a negative survey the participant reports any station but S4. One only knows that the reported station, for example S2, is not the actual station where the participant started the trip in a negative survey. In this example, there are 6 possible stations, which safeguards the participant's location privacy.

We call the existing approach of collecting negative data Uniform Negative Surveys (UNSs) [8] because each category, except the true category, has an equal probability to be selected by participants. As the reported counts can be vastly different to the true counts, UNSs apply a technique to reconstruct the true data from the negative data. We detail the data reconstruction technique in Section 3.1. UNSs provide an accurate estimation of the true data if the ratio of participants to categories is high. Using the previous scenario of surveying passengers, we simulate a data collection with 10,000 participants. We can observe that the estimated data from UNS are close to the true data (Fig. 3).

UNSs are not immune to errors during data reconstruction. We found that this problem is particularly severe when the ratio of participants to categories is low. For example, Fig. 4 compares the true counts and the reconstructed counts from an UNS using the previous scenario of collecting aggregate data from passengers. The results are from a simulation with 100 participants. As the chart shows, the reconstructed data significantly deviate from the true data. When the data from adjacent categories are summed up for answering range aggregate queries, errors in individual categories may offset each other, if some of them cause over-counting while others cause under-counting. This may be desirable for an individual query but may lead to more significant problems in the future as it relies on erroneous data. For example, other queries may not show the same behavior leading to user confusion. Also, in many situations, the offset does not happen or the offset is insignificant compared to the answer. For example, using the data in Fig. 4, if the query range covers S2, S3 and S4, errors in those categories do not offset each other as all of them are lower than the true counts in corresponding categories. We can observe from Fig. 4 that there are three categories with negative counts after reconstruction. The negative counts are unavoidable as shown in the original work on negative surveys. Those counts, which are wrong, further decrease the use of negative surveys. Thus, we propose GNSs, a method that does not generate negative counts.

In this paper, we propose Gaussian Negative Surveys (GNSs). Different to UNSs, the reported categories are not uniformly chosen in GNSs. Instead, negative categories that are close to the true category are more likely to be selected than the negative categories far from the true category. Our work is focused on surveying geo-spatial data, which often shows strong correlation between adjacent locations in two-dimensional environments. One should not confuse the Gaussian distribution used by GNSs with the underlying distribution of the phenomena being surveyed, which does not have to follow Gaussian distribution. Our experimental results show the high accuracy levels achieved by GNSs for real geo-spatial data. We should also note that GNSs are applied to discrete categories because public surveys normally collect the data based on spatial decomposition with a certain resolution.

Our technique is particularly useful when aggregate spatial data is collected. Data collections involving spatial data may have hundreds or thousands of categories, each of which represents a certain location or area. The success of our approach lies in the fact that GNSs retain the correlation between adjacent categories in collected data because a majority of participants are expected to report the categories that are close to their true categories. As a range aggregate query normally covers a number of categories, a large portion of the reported counts within the query range would come from the participants who are actually in the range. Thus, the reported data can be used to approximate the true data in the query range. We should note that the participants can still achieve a high level of privacy protection when the data collection is modulated as above. This is because a category can be reported by participants from a number of nearby categories with similar probabilities. Consequently, similar to UNSs, it is still difficult for an adversary to derive the true category of a participant in GNSs. Our experimental results (Section 6) show that GNSs can achieve a high level of privacy while providing accurate answers to queries.

The contributions of this paper are:

  • We develop GNSs for collecting aggregate data while protecting individuals' privacy. GNSs are particularly suitable for collecting spatial data. GNSs are also significantly better than UNSs in answering range aggregate queries.

  • We define a privacy metric for individuals in data collection.

  • We analyze the factors that affect the privacy level and the accuracy level of GNSs.

  • We compare GNSs with UNSs and a common data perturbation technique, Uniform Retention Replacement Perturbations [12], through comprehensive experiments. Our results show that GNSs achieve high privacy levels similar to the existing approaches but are significantly more accurate in solving queries.

The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 details UNSs and gives our analysis of the errors in UNSs. We present GNSs in Section 4. Privacy and accuracy metrics are introduced in Section 5. The experimental results are shown in Section 6. We conclude our paper in Section 7.

Section snippets

Privacy protection in statistical databases

Our work is related to privacy protection in statistical databases. Adam et al. [13] categorize the privacy-preserving approaches for statistical databases: conceptual framework, query restriction, data perturbation and output perturbation. Conceptual frameworks address the security issues by defining the rules for constructing a database, such as avoiding the insertion of individual entities and building multi-dimensional tables with aggregated information. Query restriction protects the

Data collection and reconstruction in uniform negative surveys

In UNSs, participants are shown mutually exclusive categories. For a participant, there is exactly one category, which is called the positive category; other categories are called the negative categories [8]. A participant randomly reports a negative category during data collection. The counts of participants for each category are maintained without knowing the true categories for the participants. An important assumption of UNSs is that the reported category is uniformly selected from the

Gaussian negative surveys

We call our approach Gaussian negative surveys because the probabilities for selecting negative categories follow a Gaussian distribution centered at the true category. Due to this characteristic, the majority of true counts for a category is spread into the categories that are close to the category. Using the same example of collecting aggregate data from passengers (Fig. 1), we show the probabilities that participants in positive category i report negative category j in Fig. 5, where i and j

Privacy

We define a measure of privacy in GNSs. We do not measure the actual privacy level achieved by individuals based on k-anonymity. Anonymity-based metrics evaluate the privacy levels when people can report true sensitive information but want to be indistinguishable from others. Survey participants, however, usually want to hide sensitive information in the first place. We define a metric that estimates the privacy levels of an individual without knowing the true data. We measure privacy based on

Experiments

We compare our proposed approach, GNSs, with the existing approach of negative surveys, UNSs. We also compare GNSs with a common data perturbation technique, Retention Replacement Perturbations [12]. This approach works as follows. During a survey, a random value is generated for each participant. If the value is below a pre-determined retention threshold, the true category of the participant is collected. Otherwise, one of the categories (including the true category) is randomly selected based

Conclusion

This paper shows that the existing approach for negative surveys, i.e., Uniform Negative Surveys (UNSs), works fine for data collection with a small number of categories but with the increase of the number of categories, the collected data becomes unusable. The effects of the errors are particularly significant when UNSs are used for collecting aggregate spatial data. In this work, we focus on collecting statistical spatial data, although for certain applications, detailed personal data might

Acknowledgment

This work is partially supported by ARC Grant DP110100757.

Hairuo Xie is a Ph.D. student in the Department of Computer Science and Software Engineering, the University of Melbourne. He received his Master of IT degree from the University of Melbourne in 2005. His research focuses on distributed data structures and aggregate query processing in sensor networks.

References (53)

  • R. Hsieh et al.

    The influence of privacy and security on respondents' trust and participation in e-surveys

  • M.J. Culnan et al.

    Information privacy concerns, procedural fairness, and impersonal trust: an empirical investigation

    Organization Science

    (1999)
  • J. Horey et al.

    Anonymous data collection in sensor networks

    MobiQuitous

    (2007)
  • F. Esponda

    Hiding a needle in a haystack using negative databases

  • F. Esponda et al.

    Protecting data privacy through hard-to-reverse negative databases

    International Journal of Information Security

    (2007)
  • F. Esponda et al.

    Negative representations of information

    International Journal of Information Security

    (2009)
  • R. Agrawal et al.

    Privacy preserving OLAP

    SIGMOD

    (2005)
  • N.R. Adam et al.

    Security-control methods for statistical databases: a comparative study

    ACM Computing Surveys

    (1989)
  • R. Agrawal et al.

    Privacy-preserving data mining

    SIGMOD Record

    (2000)
  • S.P. Reiss

    Practical data-swapping: the first steps

    ACM Transactions on Database Systems

    (1984)
  • S.E. Fienberg et al.

    Data swapping: variations on a theme by Dalenius and Reiss

    Privacy in Statistical Databases

    (2004)
  • Q. Zhang et al.

    Aggregate query answering on anonymized tables

    ICDE

    (2007)
  • Y. He et al.

    Anonymization of set-valued data via top-down, local generalization

    PVLDB

    (2009)
  • M. Terrovitis et al.

    Privacy-preserving anonymization of set-valued data

    PVLDB

    (2008)
  • T. Ge et al.

    Answering aggregation queries in a secure system model

    VLDB

    (2007)
  • R. Agrawal et al.

    Order preserving encryption for numeric data

    SIGMOD

    (2004)
  • Cited by (40)

    • Enhancing the privacy of negative surveys using negative combined categories

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      Therefore, a participant having B as the positive category, should select a category with the same probability from A, C, D, and E for Q1. A Gaussian negative survey model was proposed in [17], which simulates the probability of selecting different categories by the Gaussian distribution. For Q1, a participant with B as the positive category selects categories A and C with a higher probability than his/her selects D and E.

    • Location privacy preferences: A survey-based analysis of consumer awareness, trade-off and decision-making

      2015, Transportation Research Part C: Emerging Technologies
      Citation Excerpt :

      Given both the dynamic nature of location information and the constraints faced in balancing data privacy with usability, work on techniques for privacy preservation has seen substantial growth in recent years. Techniques for privacy preservation have ranged from methods of background data perturbation, masking, and cloaking (Hoh and Gruteser, 2005; Hoh et al., 2012; Xie et al., 2011) to methods focused on how location data are presented and shared publically (particularly within social networks) (Li and Chen, 2010; Puttaswamy et al., 2014; Mascetti et al., 2011). Additional exploration has also taken place regarding issues such as the economic valuation of privacy in cases of differentiated tolling (Zangui et al., 2013) and the implementation of ‘Privacy-By-Design’ approaches in the collection of mobile data for fine-grained modeling purposes (Sun et al., 2013).

    • On the dependable level of the negative survey

      2014, Statistics and Probability Letters
      Citation Excerpt :

      The negative survey was first proposed by Esponda (2006). In this section, the background of the negative survey is introduced (Esponda, 2006; Esponda and Guerrero, 2009; Horey et al., 2007; Xie et al., 2011). The above formula only gives the point estimation of positive survey results.

    • How to ask sensitive questions in conservation: A review of specialized questioning techniques

      2014, Biological Conservation
      Citation Excerpt :

      However, in order to reduce the chance of bias in respondents’ selection of response categories, a randomizing device with c − 1 options is used in private by the respondent to obtain a value m, they then choose the mth true alternative from the list accordingly (Esponda and Guerrero, 2009). Rather than using a randomizing device with known probabilities drawn from a uniform distribution, Xie et al. (2011) proposed that the probability of selecting response categories should follow a Gaussian distribution centred at the positive category. This approach achieves higher accuracy but reduces respondent privacy.

    View all citing articles on Scopus

    Hairuo Xie is a Ph.D. student in the Department of Computer Science and Software Engineering, the University of Melbourne. He received his Master of IT degree from the University of Melbourne in 2005. His research focuses on distributed data structures and aggregate query processing in sensor networks.

    Lars Kulik received his PhD from the University of Hamburg, Germany, in 2002. He is a senior lecturer in the Department of Computer Science and Software Engineering at the University of Melbourne. His research focuses on efficient algorithms for moving objects and traffic applications, methods for safeguarding location privacy, information dissemination and aggregation algorithms in sensor networks, spatial algorithms in pervasive computing environments, and robust algorithms that cope with imperfection, especially in mobile computing.

    Egemen Tanin is a senior lecturer in the Department of Computer Science and Software Engineering, the University of Melbourne. He received his PhD degree in computer science from the University of Maryland, College Park, Maryland, where he also held a postdoctoral research associate position from 2001 until 2003. His areas of research include spatial data management and database visualization.

    View full text