Choquet integral for record linkage

Abril, Daniel; Navarro-Arribas, Guillermo; Torra, Vicenç

doi:10.1007/s10479-011-0989-x

Choquet integral for record linkage

Published: 05 October 2011

Volume 195, pages 97–110, (2012)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Daniel Abril¹,
Guillermo Navarro-Arribas¹ &
Vicenç Torra¹

158 Accesses
18 Citations
Explore all metrics

Abstract

Record linkage is used in data privacy to evaluate the disclosure risk of protected data. It models potential attacks, where an intruder attempts to link records from the protected data to the original data. In this paper we introduce a novel distance based record linkage, which uses the Choquet integral to compute the distance between records. We use a fuzzy measure to weight each subset of variables from each record. This allows us to improve standard record linkage and provide insightful information about the re-identification risk of each variable and their interaction. To do that, we use a supervised learning approach which determines the optimal fuzzy measure for the linkage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies and techniques series (data-centric systems and applications). New York: Springer
Google Scholar
Brand, R., Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Reference datasets to test and compare SDC methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC.
Choquet, G. (1953). Theory of capacities. Annales de L’Institut Fourier, 5, 131–295.
Article Google Scholar
Colledge, M. (1995). Frames and business registers: an overview. business survey methods. Wiley series in probability and statistics. New York: Wiley.
Google Scholar
Data.gov.uk (2010). UK Government.
Data.gov (2010). USA Government.
Defays, D., & Nanopoulos, P. (1993). Panels of enterprises and confidentiality: the small aggregates method. In Proc. of the 1992 symposium on design and analysis of longitudinal surveys, statistics, Canada (pp. 195–204).
Google Scholar
Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: theory and practical applications for statistical agencies (pp. 111–133). Amsterdam: Elsevier.
Google Scholar
Domingo-Ferrer, J., & Torra, V. (2005). Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2), 195–212.
Article Google Scholar
Domingo-Ferrer, J., Mateo-Sanz, J. M., & Torra, V. (2001). Comparing sdc methods for microdata on the basis of information loss and disclosure risk. In Preproceedings of ETK-NTTS 2001 (Vol. 2, pp. 807–826). Luxembourg: Eurostat.
Google Scholar
Domingo-Ferrer, J., Torra, V., Mateo-Sanz, J. M., & Sebe, F. (2006). Empirical disclosure risk assessment of the ipso synthetic data generators. In Monographs in official statistics-work session on statistical data confidentiality (pp. 227–238). Luxembourg: Eurostat.
Google Scholar
Dunn, H. L. (1946). Record Linkage. American Journal of Public Health, 36(12), 1412–1416.
Article Google Scholar
Elmagarmid, A., Panagiotis, G., & Verykios, V. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Article Google Scholar
Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Article Google Scholar
Hartley, H. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14, 174–194.
Article Google Scholar
IBM (2010). IBM ILOG CPLEX, High-performance mathematical programming engine. International Business Machines Corp. http://www-01.ibm.com/software/integration/optimization/cplex/.
Jaro, M. A. (1989). Advances in record linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 84(406), 414–420.
Google Scholar
Lane, J., Heus, P., & Mulcahy, T. (2008). Data access in a cyber world: making use of cyberinfrastructure. Transactions on Data Privacy, 1(1), 2–16.
Google Scholar
Laszlo, M., & Mukherjee, S. (2005). Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7), 902–911.
Article Google Scholar
McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. Wiley series in probability and statistics. New York: Wiley.
Google Scholar
Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130, 954–959.
Article Google Scholar
Pagliuca, D., & Seri, G. (1999). Some results of individual ranking method on the system of enterprise accounts annual survey. Esprit SDC Project, Delivrable MI-3/D2.
Statistics Canada (2010). Record linkage at Statistics Canada. http://www.statcan.gc.ca/record-enregistrement/index-eng.htm.
Templ, M. (2008). Statistical disclosure control for microdata using the R-Package sdcMicro. Transactions on Data Privacy, 1(2), 67–85.
Google Scholar
Templ, M., & Petelin, T. (2009). A graphical user interface for microdata protection which provides reproducibility and interactions: the sdcMicro GUI. Transactions on Data Privacy, 2(3), 207–224.
Google Scholar
Torra, V. (2004). Microaggregation for categorical variables: a median based approach. In Lecture notes in computer science: Vol. 3050. Proc. privacy in statistical databases (PSD 2004) (pp. 162–174). Berlin: Springer.
Chapter Google Scholar
Torra, V. (2008). Constrained microaggregation: adding constraints for data editing. Transactions on Data Privacy, 1(2), 86–104.
Google Scholar
Torra, V., & Narukawa, Y. (2007). Modeling decisions: information fusion and aggregation operators. Berlin: Springer.
Google Scholar
Torra, V., Abowd, J. M., & Domingo-Ferrer, J. (2006). Using Mahalanobis distance-based record linkage for disclosure risk assessment. In Lecture notes in computer science: Vol. 4302. Privacy in statistical databases 2006 (pp. 233–242). Berlin: Springer.
Chapter Google Scholar
Torra, V., Navarro-Arribas, G., & Abril, D. (2010). On the applications of aggregation operators in data privacy. In Advances in soft computing (integrated uncertainty management and applications): Vol. 68. International symposium on integrated uncertainty management and applications (pp. 479–488).
Chapter Google Scholar
U.S. Census Bureau (2010). Data Extraction System. http://www.census.gov/.
Winkler, W. E. (2003). Data cleaning methods. In Ninth ACM SIGKDD international conference on knowledge discovery and data mining.
Google Scholar
Winkler, W. E. (2004). Re-identification methods for masked microdata. In Lecture notes in computer science: Vol. 3050. Privacy in statistical databases, PSD 2004 (pp. 216–230). Berlin: Springer.
Google Scholar
Yancey, W., Winkler, W., & Creecy, R. (2002). Disclosure risk assessment in perturbative microdata protection. In Lecture notes in computer science: Vol. 2316. Inference control in statistical databases (pp. 135–152). Berlin: Springer.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

IIIA, Institut d’Investigació en Intel⋅ligència Artificial—CSIC, Consejo Superior de Investigaciones Científicas, Campus UAB s/n, 08193, Bellaterra, Catalonia, Spain
Daniel Abril, Guillermo Navarro-Arribas & Vicenç Torra

Authors

Daniel Abril
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Navarro-Arribas
View author publications
You can also search for this author in PubMed Google Scholar
Vicenç Torra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillermo Navarro-Arribas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abril, D., Navarro-Arribas, G. & Torra, V. Choquet integral for record linkage. Ann Oper Res 195, 97–110 (2012). https://doi.org/10.1007/s10479-011-0989-x

Download citation

Published: 05 October 2011
Issue Date: May 2012
DOI: https://doi.org/10.1007/s10479-011-0989-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Choquet integral for record linkage

Abstract

Access this article

Similar content being viewed by others

Record Linkage

Record Linkage

Record Linkage

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Choquet integral for record linkage

Abstract

Access this article

Similar content being viewed by others

Record Linkage

Record Linkage

Record Linkage

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation