Skip to main content

2025 | OriginalPaper | Buchkapitel

CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries

verfasst von : Thi Hoang Thi Pham, Pascal Molli, Brice Nédelec, Hala Skaf-Molli, Julien Aimonier-Davat

Erschienen in: The Semantic Web – ISWC 2024

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Count-distinct SPARQL queries compute the number of unique values in the results of a query executed on a Knowledge Graph. However, counting the exact number of distinct values is often computationally demanding and time-consuming. As a result, these queries often fail on public SPARQL endpoints due to fair use policies. In this paper, we propose CRAWD, a new sampling-based approach designed to approximate count-distinct SPARQL queries. CRAWD significantly improves sampling efficiency and allows feasible execution of count-distinct SPARQL queries on public SPARQL endpoints, considerably improving existing methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Both engines crash after 5 h, running out of memory.
 
3
We were unable to compute frequencies of distinct subjects and objects for WDBench with our computing resources.
 
Literatur
1.
Zurück zum Zitat Aimonier-Davat, J., Nédelec, B., Dang, M.H., Molli, P., Skaf-Molli, H.: RAW-JENA: approximate query processing for SPARQL endpoints. In: 22nd International Semantic Web Conference, ISWC 2023, Athens, Greece, 6–10 November 2023 (2023) Aimonier-Davat, J., Nédelec, B., Dang, M.H., Molli, P., Skaf-Molli, H.: RAW-JENA: approximate query processing for SPARQL endpoints. In: 22nd International Semantic Web Conference, ISWC 2023, Athens, Greece, 6–10 November 2023 (2023)
2.
Zurück zum Zitat Aimonier-Davat, J., Skaf-Molli, H., Molli, P., Grall, A., Minier, T.: Online approximative SPARQL query processing for COUNT-DISTINCT queries with Web Preemption. Semantic Web - Interoperability, Usability, Applicability (2022) Aimonier-Davat, J., Skaf-Molli, H., Molli, P., Grall, A., Minier, T.: Online approximative SPARQL query processing for COUNT-DISTINCT queries with Web Preemption. Semantic Web - Interoperability, Usability, Applicability (2022)
5.
Zurück zum Zitat Chao, A., Lee, S.M.: Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87(417), 210–217 (1992)MathSciNetCrossRef Chao, A., Lee, S.M.: Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87(417), 210–217 (1992)MathSciNetCrossRef
6.
Zurück zum Zitat Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Conference on Analysis of Algorithms, AofA 07, DMTCS Proceedings, AH, pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France (2007) Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Conference on Analysis of Algorithms, AofA 07, DMTCS Proceedings, AH, pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France (2007)
8.
Zurück zum Zitat Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of VLDB’95: 21th International Conference on Very Large Data Bases, Zurich, Switzerland, 11–15 September 1995, pp. 311–322. Morgan Kaufmann (1995) Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of VLDB’95: 21th International Conference on Very Large Data Bases, Zurich, Switzerland, 11–15 September 1995, pp. 311–322. Morgan Kaufmann (1995)
9.
Zurück zum Zitat Hasnain, A., Mehmood, Q., e Zainab, S.S., Hogan, A.: SPORTAL: profiling the content of public SPARQL endpoints. Int. J. Semant. Web Inf. Syst. 12(3), 134–163 (2016) Hasnain, A., Mehmood, Q., e Zainab, S.S., Hogan, A.: SPORTAL: profiling the content of public SPARQL endpoints. Int. J. Semant. Web Inf. Syst. 12(3), 134–163 (2016)
10.
Zurück zum Zitat Heule, S., Nunkesser, M., Hall, A.: HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692 (2013) Heule, S., Nunkesser, M., Hall, A.: HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692 (2013)
11.
Zurück zum Zitat Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)MathSciNetCrossRef Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)MathSciNetCrossRef
12.
Zurück zum Zitat Kaminski, M., Kostylev, E.V., Grau, B.C.: Query nesting, assignment, and aggregation in SPARQL 1.1. ACM Trans. Database Syst. 42(3), 1–46 (2017) Kaminski, M., Kostylev, E.V., Grau, B.C.: Query nesting, assignment, and aggregation in SPARQL 1.1. ACM Trans. Database Syst. 42(3), 1–46 (2017)
13.
Zurück zum Zitat , Li, F., Wu, B., Yi, K., Zhao, Z.: Wander Join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, pp. 615–629. ACM, New York (2016) , Li, F., Wu, B., Yi, K., Zhao, Z.: Wander Join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, pp. 615–629. ACM, New York (2016)
14.
Zurück zum Zitat Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)CrossRef Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)CrossRef
15.
Zurück zum Zitat Maillot, P., Corby, O., Faron, C., Gandon, F., Michel, F.: IndeGx: a model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits. J. Web Semant. 76, 100775 (2023)CrossRef Maillot, P., Corby, O., Faron, C., Gandon, F., Michel, F.: IndeGx: a model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits. J. Web Semant. 76, 100775 (2023)CrossRef
16.
Zurück zum Zitat Park, Y., Ko, S., Bhowmick, S.S., Kim, K., Hong, K., Han, W.: G-CARE: a framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In: International Conference on Management of Data, SIGMOD Conference 2020, pp. 1099–1114. ACM (2020) Park, Y., Ko, S., Bhowmick, S.S., Kim, K., Hong, K., Han, W.: G-CARE: a framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In: International Conference on Management of Data, SIGMOD Conference 2020, pp. 1099–1114. ACM (2020)
17.
Zurück zum Zitat Pérez, J., Arenas, M., Gutiérrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 1–45 (2009)CrossRef Pérez, J., Arenas, M., Gutiérrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 1–45 (2009)CrossRef
18.
Zurück zum Zitat Pham, T.H.T., Skaf-Molli, H., Molli, P., Nédelec, B.: Online sampling of summaries from public SPARQL endpoints. In: Companion Proceedings of the ACM on Web Conference 2024, WWW 2024, pp. 617–620, ACM, New York (2024) Pham, T.H.T., Skaf-Molli, H., Molli, P., Nédelec, B.: Online sampling of summaries from public SPARQL endpoints. In: Companion Proceedings of the ACM on Web Conference 2024, WWW 2024, pp. 617–620, ACM, New York (2024)
19.
Zurück zum Zitat Saleem, M., Hasnain, A., Ngomo, A.C.N.: LargeRDFBench: a billion triples benchmark for SPARQL endpoint federation. J. Web Semant. 48, 85–125 (2018)CrossRef Saleem, M., Hasnain, A., Ngomo, A.C.N.: LargeRDFBench: a billion triples benchmark for SPARQL endpoint federation. J. Web Semant. 48, 85–125 (2018)CrossRef
20.
Zurück zum Zitat Särndal, C.E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (2003) Särndal, C.E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (2003)
22.
Zurück zum Zitat Wu, R., et al.: Learning to be a statistician: learned estimator for number of distinct values. Proc. VLDB Endow. 15(2), 272–284 (2021)CrossRef Wu, R., et al.: Learning to be a statistician: learned estimator for number of distinct values. Proc. VLDB Endow. 15(2), 272–284 (2021)CrossRef
23.
Zurück zum Zitat Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1525–1539. ACM (2018) Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1525–1539. ACM (2018)
Metadaten
Titel
CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries
verfasst von
Thi Hoang Thi Pham
Pascal Molli
Brice Nédelec
Hala Skaf-Molli
Julien Aimonier-Davat
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-77850-6_6