Skip to main content

2011 | OriginalPaper | Buchkapitel

8. Salt: Scalable Automated Linking Technology for Data-Intensive Computing

verfasst von : Anthony M. Middleton, David Alan Bayliss

Erschienen in: Handbook of Data Intensive Computing

Verlag: Springer New York

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business. The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging [3, 13]. New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage. A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data [17]. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bilenko, M., & Mooney, R. J. (2003, August 24–27). Adaptive duplicate detection using learnable string similarity measures. Proceedings of the KDD ’03 Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., 39–48. Bilenko, M., & Mooney, R. J. (2003, August 24–27). Adaptive duplicate detection using learnable string similarity measures. Proceedings of the KDD ’03 Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., 39–48.
2.
Zurück zum Zitat Branting, L. K. (2003). A comparative evaluation of name-matching algorithms. Proceedings of the ICAIL ’03 9th International Conference on Artificial Intelligence and Law, Edinburgh, Scotland, 224–232. Branting, L. K. (2003). A comparative evaluation of name-matching algorithms. Proceedings of the ICAIL ’03 9th International Conference on Artificial Intelligence and Law, Edinburgh, Scotland, 224–232.
3.
Zurück zum Zitat Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. Proceedings of the KDD ’08 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, 151–159. Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. Proceedings of the KDD ’08 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, 151–159.
4.
Zurück zum Zitat Cochinwala, M., Dalal, S., Elmagarmid, A. K., & Verykios, V. V. (2001). Record matching: Past, present and future (No. Technical Report CSD-TR #01-013): Department of Computer Sciences, Purdue University. Cochinwala, M., Dalal, S., Elmagarmid, A. K., & Verykios, V. V. (2001). Record matching: Past, present and future (No. Technical Report CSD-TR #01-013): Department of Computer Sciences, Purdue University.
5.
Zurück zum Zitat Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. Proceedings of the ACM SIGIR’01 workshop on Mathematical /Formal Methods in IR. Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. Proceedings of the ACM SIGIR’01 workshop on Mathematical /Formal Methods in IR.
6.
Zurück zum Zitat Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. Proceedings of the KDD ’02 Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada. Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. Proceedings of the KDD ’02 Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada.
7.
Zurück zum Zitat Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3). Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3).
8.
Zurück zum Zitat Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003, August). A comparison of string distance metrics for name matching tasks. Proceedings of the IJCAI-03 Workshop on Information Integration, Acapulco, Mexico, 73–78. Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003, August). A comparison of string distance metrics for name matching tasks. Proceedings of the IJCAI-03 Workshop on Information Integration, Acapulco, Mexico, 73–78.
9.
Zurück zum Zitat Dunn, H. L. (1946). Record linkage. American Journal of Public Health, 36, 1412–1415.CrossRef Dunn, H. L. (1946). Record linkage. American Journal of Public Health, 36, 1412–1415.CrossRef
10.
Zurück zum Zitat Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRef Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRef
11.
Zurück zum Zitat Gravano, L., Ipeirotis, P. G., Koudas, N., & Srivastava, D. (2003, May 20–24). Text joins in an RDBMS for web data integration. Proceedings of the WWW ’03 12th international conference on World Wide Web, Budapest, Hungary. Gravano, L., Ipeirotis, P. G., Koudas, N., & Srivastava, D. (2003, May 20–24). Text joins in an RDBMS for web data integration. Proceedings of the WWW ’03 12th international conference on World Wide Web, Budapest, Hungary.
12.
Zurück zum Zitat Gu, L., Baxter, R., Vickers, D., & Rainsford, C. (2003). Record linkage: Current practice and future directions (No. CMIS Technical Report No. 03/83): CSIRO Mathematical and Information Sciences. Gu, L., Baxter, R., Vickers, D., & Rainsford, C. (2003). Record linkage: Current practice and future directions (No. CMIS Technical Report No. 03/83): CSIRO Mathematical and Information Sciences.
13.
Zurück zum Zitat Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. New York: Springer Science and Business Media LLC.MATH Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. New York: Springer Science and Business Media LLC.MATH
14.
Zurück zum Zitat Jones, K. S. (1972). A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), 11–21.CrossRef Jones, K. S. (1972). A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), 11–21.CrossRef
15.
Zurück zum Zitat Koudas, N., Marathe, A., & Srivastava, D. (2004). Flexible string matching against large databases in practice. Proceedings of the 30th VLDB Conference, Toronto, Canada, 1078–1086. Koudas, N., Marathe, A., & Srivastava, D. (2004). Flexible string matching against large databases in practice. Proceedings of the 30th VLDB Conference, Toronto, Canada, 1078–1086.
16.
Zurück zum Zitat Maggi, F. (2008). A survey of probabilistic record matching models, techniques and tools (No. Advanced Topics in Information Systems B, Cycle XXII, Scientific Report TR-2008-22): DEI, Politecnico di Milano. Maggi, F. (2008). A survey of probabilistic record matching models, techniques and tools (No. Advanced Topics in Information Systems B, Cycle XXII, Scientific Report TR-2008-22): DEI, Politecnico di Milano.
17.
Zurück zum Zitat Middleton, A. M. (2010). Data-intensive technologies for cloud computing. In B. Furht & A. Escalante (Eds.), Handbook of cloud computing (pp. 83–136). New York: Springer.CrossRef Middleton, A. M. (2010). Data-intensive technologies for cloud computing. In B. Furht & A. Escalante (Eds.), Handbook of cloud computing (pp. 83–136). New York: Springer.CrossRef
18.
Zurück zum Zitat Newcombe, H. B., & Kennedy, J. M. (1962). Record linkage. Communications of the ACM, 5(11), 563–566.CrossRef Newcombe, H. B., & Kennedy, J. M. (1962). Record linkage. Communications of the ACM, 5(11), 563–566.CrossRef
19.
Zurück zum Zitat Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130, 954–959.CrossRef Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130, 954–959.CrossRef
20.
Zurück zum Zitat Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 503–520.CrossRef Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 503–520.CrossRef
21.
Zurück zum Zitat Winkler, W. E. (1989). Frequency-based matching in Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 778–783. Winkler, W. E. (1989). Frequency-based matching in Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 778–783.
22.
Zurück zum Zitat Winkler, W. E. (1994). Advanced methods for record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 274–279. Winkler, W. E. (1994). Advanced methods for record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 274–279.
23.
Zurück zum Zitat Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox, D. A. Binder, B. N. Chinnappa, M. J. Christianson, M. J. Colledge & P. S. Kott (Eds.), Business survey methods. New York: John Wiley & Sons. Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox, D. A. Binder, B. N. Chinnappa, M. J. Christianson, M. J. Colledge & P. S. Kott (Eds.), Business survey methods. New York: John Wiley & Sons.
24.
Zurück zum Zitat Winkler, W. E. (1999). The state of record linkage and current research problems: U.S. Bureau of the Census Statistical Research Division. Winkler, W. E. (1999). The state of record linkage and current research problems: U.S. Bureau of the Census Statistical Research Division.
25.
Zurück zum Zitat Winkler, W. E. (2001). Record linkage software and methods for merging administrative lists (No. Statistical Research Report Series No. RR/2001/03). Washington, D.C.: US Bureau of the Census. Winkler, W. E. (2001). Record linkage software and methods for merging administrative lists (No. Statistical Research Report Series No. RR/2001/03). Washington, D.C.: US Bureau of the Census.
Metadaten
Titel
Salt: Scalable Automated Linking Technology for Data-Intensive Computing
verfasst von
Anthony M. Middleton
David Alan Bayliss
Copyright-Jahr
2011
Verlag
Springer New York
DOI
https://doi.org/10.1007/978-1-4614-1415-5_8