Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 5-6/2014

01.09.2014

Generalization-based privacy preservation and discrimination prevention in data publishing and mining

verfasst von: Sara Hajian, Josep Domingo-Ferrer, Oriol Farràs

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The use of PD (resp., PND) attributes in decision making does not necessarily lead to (or exclude) discriminatory decisions (Ruggieri et al. 2010).
 
2
In full-domain generalization if a value is generalized, all its instances are generalized. There are alternative generalization schemes, such as multi-dimensional generalization or cell generalization, in which some instances of a value may remain ungeneralized while other instances are generalized.
 
3
Although algorithms using multi-dimensional or cell generalizations (e.g. the Mondrian algorithm, Lefevre et al. 2006) cause less information loss than algorithms using full-domain generalization, the former suffer from the problem of data exploration (Fung et al. 2010). This problem is caused by the co-existence of specific and generalized values in the generalized data set, which make data exploration and interpretation difficult for the data analyst.
 
4
On the legal side, different measures are adopted worldwide; see Pedreschi et al. (2013) for parallels between different measures and anti-discrimination acts.
 
5
Discrimination occurs when a group is treated “less favorably” than others.
 
6
Discrimination of a group occurs when a higher proportion of people not in the group is able to comply with a qualifying criterion.
 
7
\(\alpha \) states an acceptable level of discrimination according to laws and regulations. For example, the U.S. Equal Pay Act (United States Congress 1963) states that “a selection rate for any race, sex, or ethnic group which is less than four-fifths of the rate for the group with the highest rate will generally be regarded as evidence of adverse impact”. This amounts to using clift with \(\alpha =1.25\).
 
Literatur
Zurück zum Zitat Aggarwal CC, Yu PS (eds) (2008) Privacy preserving data mining: models and algorithms. Springer, Berlin Aggarwal CC, Yu PS (eds) (2008) Privacy preserving data mining: models and algorithms. Springer, Berlin
Zurück zum Zitat Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB, pp 487–499 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB, pp 487–499
Zurück zum Zitat Agrawal R, Srikant R (2000) Privacy preserving data mining. In: ACM SIGMOD 2000, pp 439–450 Agrawal R, Srikant R (2000) Privacy preserving data mining. In: ACM SIGMOD 2000, pp 439–450
Zurück zum Zitat Australian Legislation (2008) (a) Equal Opportunity Act—Victoria State, (b) Anti-Discrimination Act—Queensland State Australian Legislation (2008) (a) Equal Opportunity Act—Victoria State, (b) Anti-Discrimination Act—Queensland State
Zurück zum Zitat Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE 2005: IEEE, pp 217–228 Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE 2005: IEEE, pp 217–228
Zurück zum Zitat Berendt B, Preibusch S (2012) Exploring discrimination: a user-centric evaluation of discrimination-aware data mining. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 344–351 Berendt B, Preibusch S (2012) Exploring discrimination: a user-centric evaluation of discrimination-aware data mining. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 344–351
Zurück zum Zitat Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. Data Mining Knowl Discov 21(2):277–292CrossRefMathSciNet Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. Data Mining Knowl Discov 21(2):277–292CrossRefMathSciNet
Zurück zum Zitat Custers B, Calders T, Schermer B, Zarsky TZ (eds) (2013) Discrimination and privacy in the information society—data mining and profiling in large databases. Studies in applied philosophy, epistemology and rational ethics, vol 3. Springer, Berlin Custers B, Calders T, Schermer B, Zarsky TZ (eds) (2013) Discrimination and privacy in the information society—data mining and profiling in large databases. Studies in applied philosophy, epistemology and rational ethics, vol 3. Springer, Berlin
Zurück zum Zitat Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212CrossRefMathSciNet Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212CrossRefMathSciNet
Zurück zum Zitat Dwork C (2006) Differential privacy. In: ICALP 2006, LNCS 4052, Springer, pp 112 Dwork C (2006) Differential privacy. In: ICALP 2006, LNCS 4052, Springer, pp 112
Zurück zum Zitat Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):8695CrossRef Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):8695CrossRef
Zurück zum Zitat Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: ITCS 2012, ACM, pp 214–226 Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: ITCS 2012, ACM, pp 214–226
Zurück zum Zitat European Union Legislation (1995) Directive 95/46/EC European Union Legislation (1995) Directive 95/46/EC
Zurück zum Zitat European Union Legislation (2009) (a) Race Equality Directive, 2000/43/EC, 2000; (b) Employment Equality Directive, 2000/78/EC, 2000; (c) Equal Treatment of Persons, European Parliament legislative resolution, P6\_TA(2009) 0211 European Union Legislation (2009) (a) Race Equality Directive, 2000/43/EC, 2000; (b) Employment Equality Directive, 2000/78/EC, 2000; (c) Equal Treatment of Persons, European Parliament legislative resolution, P6\_TA(2009) 0211
Zurück zum Zitat Fung BCM, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE 2005, IEEE, pp 205–216 Fung BCM, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE 2005, IEEE, pp 205–216
Zurück zum Zitat Fung BCM, Wang K, Fu AW-C, Yu P (2010) Introduction to privacy-preserving data publishing: concepts and techniques. Chapman & Hall/CRC, New YorkCrossRef Fung BCM, Wang K, Fu AW-C, Yu P (2010) Introduction to privacy-preserving data publishing: concepts and techniques. Chapman & Hall/CRC, New YorkCrossRef
Zurück zum Zitat Hajian S, Domingo-Ferrer J, Martínez-Ballesté A (2011) Rule protection for indirect discrimination prevention in data mining. In: MDAI 2011, LNCS 6820, Springer, pp 211–222 Hajian S, Domingo-Ferrer J, Martínez-Ballesté A (2011) Rule protection for indirect discrimination prevention in data mining. In: MDAI 2011, LNCS 6820, Springer, pp 211–222
Zurück zum Zitat Hajian S, Domingo-Ferrer J (2013) A methodology for direct and indirect discrimination prevention in data mining. IEEE Trans Knowl Data Eng 25(7):1445–1459CrossRef Hajian S, Domingo-Ferrer J (2013) A methodology for direct and indirect discrimination prevention in data mining. IEEE Trans Knowl Data Eng 25(7):1445–1459CrossRef
Zurück zum Zitat Hajian S, Monreale A, Pedreschi D, Domingo-Ferrer J, Giannotti F (2012) Injecting discrimination and privacy awareness into pattern discovery. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 360–369 Hajian S, Monreale A, Pedreschi D, Domingo-Ferrer J, Giannotti F (2012) Injecting discrimination and privacy awareness into pattern discovery. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 360–369
Zurück zum Zitat Hajian S, Domingo-Ferrer J (2012) A study on the impact of data anonymization on anti-discrimination. In: 2012 IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 352–359 Hajian S, Domingo-Ferrer J (2012) A study on the impact of data anonymization on anti-discrimination. In: 2012 IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 352–359
Zurück zum Zitat Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte-Nordholt E, Spicer K, de Wolf P-P (2012) Statistical disclosure control. Wiley, Chichester Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte-Nordholt E, Spicer K, de Wolf P-P (2012) Statistical disclosure control. Wiley, Chichester
Zurück zum Zitat Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: SIGKDD 2002, ACM, pp 279288 Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: SIGKDD 2002, ACM, pp 279288
Zurück zum Zitat Kamiran F, Calders T (2011) Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33(1):1–33CrossRef Kamiran F, Calders T (2011) Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33(1):1–33CrossRef
Zurück zum Zitat Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination aware decision tree learning. In: ICDM 2010, IEEE, pp 869–874 Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination aware decision tree learning. In: ICDM 2010, IEEE, pp 869–874
Zurück zum Zitat Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: ECML/PKDD, LNCS 7524, Springer, pp 35–50 Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: ECML/PKDD, LNCS 7524, Springer, pp 35–50
Zurück zum Zitat Lefevre K, Dewitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In SIGMOD 2005, ACM, pp 49–60 Lefevre K, Dewitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In SIGMOD 2005, ACM, pp 49–60
Zurück zum Zitat Lefevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE 2006, IEEE, p 25 Lefevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE 2006, IEEE, p 25
Zurück zum Zitat Li N, Li T, Venkatasubramanian S (2007) \(t\)-Closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: IEEE ICDE 2007, IEEE, pp 106–115 Li N, Li T, Venkatasubramanian S (2007) \(t\)-Closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: IEEE ICDE 2007, IEEE, pp 106–115
Zurück zum Zitat Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Bellare M (ed) Advances in cryptology-CRYPTO’00, LNCS 1880, Springer, Berlin, pp 36–53 Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Bellare M (ed) Advances in cryptology-CRYPTO’00, LNCS 1880, Springer, Berlin, pp 36–53
Zurück zum Zitat Loung BL, Ruggieri S, Turini F (2011) k-NN as an implementation of situation testing for discrimination discovery and prevention. In: KDD 2011, ACM, pp 502–510 Loung BL, Ruggieri S, Turini F (2011) k-NN as an implementation of situation testing for discrimination discovery and prevention. In: KDD 2011, ACM, pp 502–510
Zurück zum Zitat Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) \(l\)-Diversity: privacy beyond \(k\)-anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):Article 3 Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) \(l\)-Diversity: privacy beyond \(k\)-anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):Article 3
Zurück zum Zitat Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD 2011, ACM, pp 493–501 Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD 2011, ACM, pp 493–501
Zurück zum Zitat Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: KDD 2008, ACM, pp 560–568 Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: KDD 2008, ACM, pp 560–568
Zurück zum Zitat Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: SDM 2009, SIAM, pp 581–592 Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: SDM 2009, SIAM, pp 581–592
Zurück zum Zitat Pedreschi D, Ruggieri S, Turini F (2009) Integrating induction and deduction for finding evidence of discrimination. In: ICAIL 2009, ACM, pp 157–166 Pedreschi D, Ruggieri S, Turini F (2009) Integrating induction and deduction for finding evidence of discrimination. In: ICAIL 2009, ACM, pp 157–166
Zurück zum Zitat Pedreschi D, Ruggieri S, Turini F (2013) The discovery of discrimination. In: Custers BHM, Calders T, Schermer BW, Zarsky TZ (eds) Discrimination and privacy in the information society: studies in applied philosophy, epistemology and rational, ethics. Springer, Berlin, pp 91–108 Pedreschi D, Ruggieri S, Turini F (2013) The discovery of discrimination. In: Custers BHM, Calders T, Schermer BW, Zarsky TZ (eds) Discrimination and privacy in the information society: studies in applied philosophy, epistemology and rational, ethics. Springer, Berlin, pp 91–108
Zurück zum Zitat Ruggieri S, Pedreschi D, Turini F (2010) Data mining for discrimination discovery. ACM Trans Knowl Discov Data (TKDD) 4(2):Article 9 Ruggieri S, Pedreschi D, Turini F (2010) Data mining for discrimination discovery. ACM Trans Knowl Discov Data (TKDD) 4(2):Article 9
Zurück zum Zitat Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027CrossRef Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027CrossRef
Zurück zum Zitat Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on principles of database systems (PODS 98), Seattle, WA, p 188 Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on principles of database systems (PODS 98), Seattle, WA, p 188
Zurück zum Zitat Sweeney L (1998) Datafly: a system for providing anonymity in medical data. In: Proceedings of the IFIP TC11 WG11.3 11th international conference on database security XI: status and prospects, pp 356–381 Sweeney L (1998) Datafly: a system for providing anonymity in medical data. In: Proceedings of the IFIP TC11 WG11.3 11th international conference on database security XI: status and prospects, pp 356–381
Zurück zum Zitat Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: ICDM 2004, IEEE, pp 249–256 Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: ICDM 2004, IEEE, pp 249–256
Zurück zum Zitat Willenborg L, de Waal T (1996) Elements of statistical disclosure control. Springer, BerlinCrossRef Willenborg L, de Waal T (1996) Elements of statistical disclosure control. Springer, BerlinCrossRef
Zurück zum Zitat Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Zurück zum Zitat Zliobaite I, Kamiran F, Calders T (2011) Handling conditional discrimination. In: ICDM 2011, IEEE, pp 992–1001 Zliobaite I, Kamiran F, Calders T (2011) Handling conditional discrimination. In: ICDM 2011, IEEE, pp 992–1001
Metadaten
Titel
Generalization-based privacy preservation and discrimination prevention in data publishing and mining
verfasst von
Sara Hajian
Josep Domingo-Ferrer
Oriol Farràs
Publikationsdatum
01.09.2014
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-014-0346-1

Weitere Artikel der Ausgabe 5-6/2014

Data Mining and Knowledge Discovery 5-6/2014 Zur Ausgabe