Skip to main content

2016 | OriginalPaper | Buchkapitel

Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

verfasst von : Ramesh Dharavath, Abhishek Kumar Singh

Erschienen in: Proceedings of the Second International Conference on Computer and Communication Technologies

Verlag: Springer India

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Entity Resolution (ER) is a task for identifying same real world entity. It refers to data object matching or deduplication. It has been a leading research in the field of structure database. Due to its significance, entity resolution continues to be a most important challenge for heterogeneous distributed databases. Several methods have been proposed for the Entity resolution, but they have yielded unsatisfactory results. In this paper, we propose an efficient integrated solution to the entity resolution problem based on Jaccard similarity coefficient. Here we use Markov logic and Jaccard similarity coefficient for providing an efficient solution towards ER problem in heterogeneous distributed databases. The approach that we have implemented gives an overall success rate of about 98 %, thus proving better than the previously implemented algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for XML (and relational) data. In: Proceedings of Workshop on Information Quality for Information Systems (IQIS) (2006) Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for XML (and relational) data. In: Proceedings of Workshop on Information Quality for Information Systems (IQIS) (2006)
2.
Zurück zum Zitat Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007, pp. 886–895. IEEE (2007) Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007, pp. 886–895. IEEE (2007)
3.
Zurück zum Zitat Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 179–182. IEEE (2010) Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 179–182. IEEE (2010)
4.
Zurück zum Zitat Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRef Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRef
5.
Zurück zum Zitat Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (TODS) 36(3), 15 (2011)CrossRef Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (TODS) 36(3), 15 (2011)CrossRef
6.
Zurück zum Zitat Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003) Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)
7.
Zurück zum Zitat Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 11–18. ACM (2004) Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 11–18. ACM (2004)
8.
Zurück zum Zitat Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for probabilistic data. Inf. Sci. 277, 492–511 (2014)MathSciNetCrossRef Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for probabilistic data. Inf. Sci. 277, 492–511 (2014)MathSciNetCrossRef
9.
Zurück zum Zitat Schewe, K.D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. Theoret. Comput. Sci. 549, 101–126 (2014)MATHMathSciNetCrossRef Schewe, K.D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. Theoret. Comput. Sci. 549, 101–126 (2014)MATHMathSciNetCrossRef
10.
Zurück zum Zitat Suciu, D., Connolly, A.J., Howe, B.: Embracing uncertainty in large-scale computational astrophysics. In: MUD, pp. 63–77 (2009) Suciu, D., Connolly, A.J., Howe, B.: Embracing uncertainty in large-scale computational astrophysics. In: MUD, pp. 63–77 (2009)
11.
Zurück zum Zitat Soliman, M.A., Ilyas, I.F., Chen-Chuan Chang, K.: Top-k query processing in uncertain databases. In: IEEE 23rd International Conference onData Engineering, 2007. ICDE 2007, pp. 896–905. IEEE (2007) Soliman, M.A., Ilyas, I.F., Chen-Chuan Chang, K.: Top-k query processing in uncertain databases. In: IEEE 23rd International Conference onData Engineering, 2007. ICDE 2007, pp. 896–905. IEEE (2007)
12.
Zurück zum Zitat Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In: AAAI, vol. 5, pp. 868–873 (2005) Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In: AAAI, vol. 5, pp. 868–873 (2005)
13.
Zurück zum Zitat Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endowment 2(1), 1282–1293 (2009)CrossRef Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endowment 2(1), 1282–1293 (2009)CrossRef
14.
Zurück zum Zitat Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27 (2003) Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27 (2003)
15.
Zurück zum Zitat Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)CrossRef Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)CrossRef
16.
Zurück zum Zitat Kopcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008) Kopcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
17.
Zurück zum Zitat Singla, P., Domingos, P.: Entity resolution with markov logic. In: Sixth International Conference on Data Mining, 2006. ICDM’06, pp. 572–582. IEEE (2006) Singla, P., Domingos, P.: Entity resolution with markov logic. In: Sixth International Conference on Data Mining, 2006. ICDM’06, pp. 572–582. IEEE (2006)
18.
Zurück zum Zitat Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 441–448. ACM (2005) Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 441–448. ACM (2005)
19.
Zurück zum Zitat Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for uncertain data. In: BDA’2012: 28e Journées Bases de Données Avancées, p. 20 (2002) Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for uncertain data. In: BDA’2012: 28e Journées Bases de Données Avancées, p. 20 (2002)
20.
Zurück zum Zitat Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proceedings of the 22nd International Conference on Data Engineering, 2006, ICDE’06, pp. 7–7. IEEE (2006) Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proceedings of the 22nd International Conference on Data Engineering, 2006, ICDE’06, pp. 7–7. IEEE (2006)
21.
Zurück zum Zitat Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Kdd Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003) Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Kdd Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
22.
Zurück zum Zitat Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases with x-relations. IEEE Trans. Knowl. Data Eng. 20(12), 1669–1682 (2008)CrossRef Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases with x-relations. IEEE Trans. Knowl. Data Eng. 20(12), 1669–1682 (2008)CrossRef
23.
Zurück zum Zitat Yuen, S.M., Tao, Y., Xiao, X., Pei, J., Zhang, D.: Superseding nearest neighbor search on uncertain spatial databases. IEEE Trans. Knowl. Data Eng. 22(7), 1041–1055 (2010)CrossRef Yuen, S.M., Tao, Y., Xiao, X., Pei, J., Zhang, D.: Superseding nearest neighbor search on uncertain spatial databases. IEEE Trans. Knowl. Data Eng. 22(7), 1041–1055 (2010)CrossRef
24.
Zurück zum Zitat Peng, L., Diao, Y., Liu, A.: Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endowment 4(11), 1169–1180 (2011) Peng, L., Diao, Y., Liu, A.: Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endowment 4(11), 1169–1180 (2011)
25.
Zurück zum Zitat McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003) McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)
26.
Zurück zum Zitat Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)CrossRef Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)CrossRef
Metadaten
Titel
Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases
verfasst von
Ramesh Dharavath
Abhishek Kumar Singh
Copyright-Jahr
2016
Verlag
Springer India
DOI
https://doi.org/10.1007/978-81-322-2517-1_48