Skip to main content
Top

2025 | OriginalPaper | Chapter

CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries

Authors : Thi Hoang Thi Pham, Pascal Molli, Brice Nédelec, Hala Skaf-Molli, Julien Aimonier-Davat

Published in: The Semantic Web – ISWC 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Count-distinct SPARQL queries compute the number of unique values in the results of a query executed on a Knowledge Graph. However, counting the exact number of distinct values is often computationally demanding and time-consuming. As a result, these queries often fail on public SPARQL endpoints due to fair use policies. In this paper, we propose CRAWD, a new sampling-based approach designed to approximate count-distinct SPARQL queries. CRAWD significantly improves sampling efficiency and allows feasible execution of count-distinct SPARQL queries on public SPARQL endpoints, considerably improving existing methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Both engines crash after 5 h, running out of memory.
 
3
We were unable to compute frequencies of distinct subjects and objects for WDBench with our computing resources.
 
Literature
1.
go back to reference Aimonier-Davat, J., Nédelec, B., Dang, M.H., Molli, P., Skaf-Molli, H.: RAW-JENA: approximate query processing for SPARQL endpoints. In: 22nd International Semantic Web Conference, ISWC 2023, Athens, Greece, 6–10 November 2023 (2023) Aimonier-Davat, J., Nédelec, B., Dang, M.H., Molli, P., Skaf-Molli, H.: RAW-JENA: approximate query processing for SPARQL endpoints. In: 22nd International Semantic Web Conference, ISWC 2023, Athens, Greece, 6–10 November 2023 (2023)
2.
go back to reference Aimonier-Davat, J., Skaf-Molli, H., Molli, P., Grall, A., Minier, T.: Online approximative SPARQL query processing for COUNT-DISTINCT queries with Web Preemption. Semantic Web - Interoperability, Usability, Applicability (2022) Aimonier-Davat, J., Skaf-Molli, H., Molli, P., Grall, A., Minier, T.: Online approximative SPARQL query processing for COUNT-DISTINCT queries with Web Preemption. Semantic Web - Interoperability, Usability, Applicability (2022)
5.
go back to reference Chao, A., Lee, S.M.: Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87(417), 210–217 (1992)MathSciNetCrossRef Chao, A., Lee, S.M.: Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87(417), 210–217 (1992)MathSciNetCrossRef
6.
go back to reference Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Conference on Analysis of Algorithms, AofA 07, DMTCS Proceedings, AH, pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France (2007) Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Conference on Analysis of Algorithms, AofA 07, DMTCS Proceedings, AH, pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France (2007)
8.
go back to reference Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of VLDB’95: 21th International Conference on Very Large Data Bases, Zurich, Switzerland, 11–15 September 1995, pp. 311–322. Morgan Kaufmann (1995) Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of VLDB’95: 21th International Conference on Very Large Data Bases, Zurich, Switzerland, 11–15 September 1995, pp. 311–322. Morgan Kaufmann (1995)
9.
go back to reference Hasnain, A., Mehmood, Q., e Zainab, S.S., Hogan, A.: SPORTAL: profiling the content of public SPARQL endpoints. Int. J. Semant. Web Inf. Syst. 12(3), 134–163 (2016) Hasnain, A., Mehmood, Q., e Zainab, S.S., Hogan, A.: SPORTAL: profiling the content of public SPARQL endpoints. Int. J. Semant. Web Inf. Syst. 12(3), 134–163 (2016)
10.
go back to reference Heule, S., Nunkesser, M., Hall, A.: HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692 (2013) Heule, S., Nunkesser, M., Hall, A.: HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692 (2013)
11.
go back to reference Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)MathSciNetCrossRef Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)MathSciNetCrossRef
12.
go back to reference Kaminski, M., Kostylev, E.V., Grau, B.C.: Query nesting, assignment, and aggregation in SPARQL 1.1. ACM Trans. Database Syst. 42(3), 1–46 (2017) Kaminski, M., Kostylev, E.V., Grau, B.C.: Query nesting, assignment, and aggregation in SPARQL 1.1. ACM Trans. Database Syst. 42(3), 1–46 (2017)
13.
go back to reference , Li, F., Wu, B., Yi, K., Zhao, Z.: Wander Join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, pp. 615–629. ACM, New York (2016) , Li, F., Wu, B., Yi, K., Zhao, Z.: Wander Join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, pp. 615–629. ACM, New York (2016)
14.
go back to reference Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)CrossRef Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)CrossRef
15.
go back to reference Maillot, P., Corby, O., Faron, C., Gandon, F., Michel, F.: IndeGx: a model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits. J. Web Semant. 76, 100775 (2023)CrossRef Maillot, P., Corby, O., Faron, C., Gandon, F., Michel, F.: IndeGx: a model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits. J. Web Semant. 76, 100775 (2023)CrossRef
16.
go back to reference Park, Y., Ko, S., Bhowmick, S.S., Kim, K., Hong, K., Han, W.: G-CARE: a framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In: International Conference on Management of Data, SIGMOD Conference 2020, pp. 1099–1114. ACM (2020) Park, Y., Ko, S., Bhowmick, S.S., Kim, K., Hong, K., Han, W.: G-CARE: a framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In: International Conference on Management of Data, SIGMOD Conference 2020, pp. 1099–1114. ACM (2020)
17.
go back to reference Pérez, J., Arenas, M., Gutiérrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 1–45 (2009)CrossRef Pérez, J., Arenas, M., Gutiérrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 1–45 (2009)CrossRef
18.
go back to reference Pham, T.H.T., Skaf-Molli, H., Molli, P., Nédelec, B.: Online sampling of summaries from public SPARQL endpoints. In: Companion Proceedings of the ACM on Web Conference 2024, WWW 2024, pp. 617–620, ACM, New York (2024) Pham, T.H.T., Skaf-Molli, H., Molli, P., Nédelec, B.: Online sampling of summaries from public SPARQL endpoints. In: Companion Proceedings of the ACM on Web Conference 2024, WWW 2024, pp. 617–620, ACM, New York (2024)
19.
go back to reference Saleem, M., Hasnain, A., Ngomo, A.C.N.: LargeRDFBench: a billion triples benchmark for SPARQL endpoint federation. J. Web Semant. 48, 85–125 (2018)CrossRef Saleem, M., Hasnain, A., Ngomo, A.C.N.: LargeRDFBench: a billion triples benchmark for SPARQL endpoint federation. J. Web Semant. 48, 85–125 (2018)CrossRef
20.
go back to reference Särndal, C.E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (2003) Särndal, C.E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (2003)
22.
go back to reference Wu, R., et al.: Learning to be a statistician: learned estimator for number of distinct values. Proc. VLDB Endow. 15(2), 272–284 (2021)CrossRef Wu, R., et al.: Learning to be a statistician: learned estimator for number of distinct values. Proc. VLDB Endow. 15(2), 272–284 (2021)CrossRef
23.
go back to reference Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1525–1539. ACM (2018) Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1525–1539. ACM (2018)
Metadata
Title
CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries
Authors
Thi Hoang Thi Pham
Pascal Molli
Brice Nédelec
Hala Skaf-Molli
Julien Aimonier-Davat
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-77850-6_6

Premium Partner