Skip to main content

2018 | OriginalPaper | Buchkapitel

Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce

verfasst von : Kumar Sharma, Ujjal Marjit, Utpal Biswas

Erschienen in: Advances in Electronics, Communication and Computing

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce the data quality as well as increase unnecessary size of the dataset. We propose an approach for detecting duplicate resources in RDF datasets using Hadoop and MapReduce framework. RDF resources are compared using similarity metrics defined at resource level, RDF statement level as well as object level. The performance is evaluated with the evaluation metrics and the experimental evaluation showed the accuracy, effectiveness, and efficiency of the proposed approach.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRef Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRef
2.
Zurück zum Zitat Holmes, A.: Hadoop in practice. Manning Publications Co. (2012) Holmes, A.: Hadoop in practice. Manning Publications Co. (2012)
3.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRef
4.
Zurück zum Zitat Zhou, P., Lei, J., Ye, W.: Large-scale data sets clustering based on MapReduce and Hadoop. J. Comput. Inform. Syst. 7(16), 5956–5963 (2011) Zhou, P., Lei, J., Ye, W.: Large-scale data sets clustering based on MapReduce and Hadoop. J. Comput. Inform. Syst. 7(16), 5956–5963 (2011)
5.
Zurück zum Zitat Kelkar, B.A., Manwade, K.B., Patil, G.A.: Near duplicate detection in relational database. Int. J. Eng. Res. Technol. 2(3), (2013) (ESRSA Publications) Kelkar, B.A., Manwade, K.B., Patil, G.A.: Near duplicate detection in relational database. Int. J. Eng. Res. Technol. 2(3), (2013) (ESRSA Publications)
6.
Zurück zum Zitat Achimugu, P., Soriyan, A., Oluwagbemi, O., Ajayi, A.: Record Linkage system in a complex relational database-MINPHIS example. Stud. Health Technol. Inform. 160(Pt 2), 1127–1130 (2009) Achimugu, P., Soriyan, A., Oluwagbemi, O., Ajayi, A.: Record Linkage system in a complex relational database-MINPHIS example. Stud. Health Technol. Inform. 160(Pt 2), 1127–1130 (2009)
7.
Zurück zum Zitat Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. Proceedings of the 2004 International Workshop on Information Quality in Information Systems, pp. 10–19. ACM (2004) Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. Proceedings of the 2004 International Workshop on Information Quality in Information Systems, pp. 10–19. ACM (2004)
8.
Zurück zum Zitat Weis, M., Naumann, F.: Detecting duplicates in complex XML data. Data Engineering (ICDE’06), IEEE, pp. 109–111 (2006) Weis, M., Naumann, F.: Detecting duplicates in complex XML data. Data Engineering (ICDE’06), IEEE, pp. 109–111 (2006)
9.
Zurück zum Zitat Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient Semantic-Aware Detection of Near Duplicate Resources. The Semantic Web: Research and Applications, pp. 136–150. Springer, Berlin (2010) Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient Semantic-Aware Detection of Near Duplicate Resources. The Semantic Web: Research and Applications, pp. 136–150. Springer, Berlin (2010)
10.
Zurück zum Zitat Song, D., Heflin J.: Domain-independent entity coreference in RDF graphs. Proceedings of the 19th ACM International Conference on INFORMATION and Knowledge Management, ACM, pp. 1821–1824 (2010) Song, D., Heflin J.: Domain-independent entity coreference in RDF graphs. Proceedings of the 19th ACM International Conference on INFORMATION and Knowledge Management, ACM, pp. 1821–1824 (2010)
11.
Zurück zum Zitat Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. Advanced Information Systems Engineering, pp. 556–570. Springer, Berlin (2008) Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. Advanced Information Systems Engineering, pp. 556–570. Springer, Berlin (2008)
12.
Zurück zum Zitat Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maiNtaining Links on the Web of Data, pp. 650–665. Springer, Berlin (2009) Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maiNtaining Links on the Web of Data, pp. 650–665. Springer, Berlin (2009)
13.
Zurück zum Zitat Li, M., Wang, H., Li, J., Gao, H.: Efficient Duplicate Record Detection Based on Similarity Estimation. International Conference on Web-Age Information Management, pp. 595–607. Springer, Berlin (2010) Li, M., Wang, H., Li, J., Gao, H.: Efficient Duplicate Record Detection Based on Similarity Estimation. International Conference on Web-Age Information Management, pp. 595–607. Springer, Berlin (2010)
14.
Zurück zum Zitat Jin, H., Huang, L., Yuan, P.: K-radius Subgraph Comparison for RDF Data Cleansing. International Conference on Web-Age Information Management, pp. 309–320. Springer, Berlin (2010) Jin, H., Huang, L., Yuan, P.: K-radius Subgraph Comparison for RDF Data Cleansing. International Conference on Web-Age Information Management, pp. 309–320. Springer, Berlin (2010)
15.
Zurück zum Zitat Yadagiri, N., Ramesh, P.: Semantic web and the libraries: An overview. Int. J. Library Sci. 7(1), 80–94 (2013) Yadagiri, N., Ramesh, P.: Semantic web and the libraries: An overview. Int. J. Library Sci. 7(1), 80–94 (2013)
16.
Zurück zum Zitat Faye, D.C., Curé, O., Blin, G.A.: A survey of RDF storage approaches, pp. 11–35 (2012) Faye, D.C., Curé, O., Blin, G.A.: A survey of RDF storage approaches, pp. 11–35 (2012)
17.
Zurück zum Zitat Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. Proc. Int. MultiConference Eng. Comput. Scientists 1, 13–15 (2013) Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. Proc. Int. MultiConference Eng. Comput. Scientists 1, 13–15 (2013)
Metadaten
Titel
Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce
verfasst von
Kumar Sharma
Ujjal Marjit
Utpal Biswas
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-4765-7_26

Neuer Inhalt