Skip to main content

2014 | OriginalPaper | Buchkapitel

Uncertainty Estimation and Analysis of Categorical Web Data

verfasst von : Davide Ceolin, Willem Robert van Hage, Wan Fokkink, Guus Schreiber

Erschienen in: Uncertainty Reasoning for the Semantic Web III

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web data often manifest high levels of uncertainty. We focus on categorical Web data and we represent these uncertainty levels as first- or second-order uncertainty. By means of concrete examples, we show how to quantify and handle these uncertainties using the Beta-Binomial and the Dirichlet-Multinomial models, as well as how take into account possibly unseen categories in our samples by using the Dirichlet process. We conclude by exemplifying how these higher-order models can be used as a basis for analyzing datasets, once at least part of their uncertainty has been taken into account. We demonstrate how to use the Battacharyya stastistical distance to quantify the similarity between Dirichlet distributions, and use such results to analyze a Web dataset of piracy attacks both visually and automatically.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley, Hoboken (2013)MATH Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley, Hoboken (2013)MATH
2.
Zurück zum Zitat Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. Technical report, W3C (2011) Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. Technical report, W3C (2011)
3.
Zurück zum Zitat Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012)CrossRef Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012)CrossRef
4.
Zurück zum Zitat Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)MATHMathSciNet Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)MATHMathSciNet
5.
Zurück zum Zitat Ceolin, D., Moreau, L., O’Hara, K., van Hage, W.R., Fokkink, W.J., Maccatrozzo, V., Schreiber, G., Shadbolt, N.: Two procedures for estimating the reliability of open government data. In: Laurent, A., Strauss, O., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2014. CCIS, vol. 442, pp. 15–24. Springer, Heidelberg (2014)CrossRef Ceolin, D., Moreau, L., O’Hara, K., van Hage, W.R., Fokkink, W.J., Maccatrozzo, V., Schreiber, G., Shadbolt, N.: Two procedures for estimating the reliability of open government data. In: Laurent, A., Strauss, O., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2014. CCIS, vol. 442, pp. 15–24. Springer, Heidelberg (2014)CrossRef
6.
Zurück zum Zitat Ceolin, D., van Hage, W.R., Fokkink, W.J., Schreiber, G.: Estimating Uncertainty of Categorical Web Data. In: URSW, pp. 15–26, November 2011. CEUR-WS.org Ceolin, D., van Hage, W.R., Fokkink, W.J., Schreiber, G.: Estimating Uncertainty of Categorical Web Data. In: URSW, pp. 15–26, November 2011. CEUR-WS.​org
7.
Zurück zum Zitat Koch, G., Davis, C.: Categorical Data Analysis Using SAS, 3rd edn. SAS Institute, Norwood (2012) Koch, G., Davis, C.: Categorical Data Analysis Using SAS, 3rd edn. SAS Institute, Norwood (2012)
8.
Zurück zum Zitat Cyganiak, R., Reynolds, D., Tennison, J.: The RDF data cube vocabulary. Technical report, W3C (2014) Cyganiak, R., Reynolds, D., Tennison, J.: The RDF data cube vocabulary. Technical report, W3C (2014)
9.
Zurück zum Zitat Davy, M., Tourneret, J.: Generative supervised classification using dirichlet process priors. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1781–1794 (2010)CrossRef Davy, M., Tourneret, J.: Generative supervised classification using dirichlet process priors. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1781–1794 (2010)CrossRef
10.
Zurück zum Zitat Dirac, P.: Principles of Quantum Mechanics. Oxford at the Clarendon Press, Oxford (1958)MATH Dirac, P.: Principles of Quantum Mechanics. Oxford at the Clarendon Press, Oxford (1958)MATH
11.
Zurück zum Zitat Andersen, E.: Sufficiency and exponential families for discrete sample spaces. J. Am. Stat. Assoc. 65, 1248–1255 (1970)CrossRefMATH Andersen, E.: Sufficiency and exponential families for discrete sample spaces. J. Am. Stat. Assoc. 65, 1248–1255 (1970)CrossRefMATH
12.
Zurück zum Zitat Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: ICML, pp. 289–296. ACM (2006) Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: ICML, pp. 289–296. ACM (2006)
13.
Zurück zum Zitat Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 90, 577–588 (1994)CrossRefMathSciNet Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 90, 577–588 (1994)CrossRefMathSciNet
15.
Zurück zum Zitat Fink, D.: A compendium of conjugate priors. Technical report, Cornell University (1995) Fink, D.: A compendium of conjugate priors. Technical report, Cornell University (1995)
16.
Zurück zum Zitat Fokoue, A., Srivatsa, M., Young, R.: Assessing trust in uncertain information. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 209–224. Springer, Heidelberg (2010)CrossRef Fokoue, A., Srivatsa, M., Young, R.: Assessing trust in uncertain information. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 209–224. Springer, Heidelberg (2010)CrossRef
17.
Zurück zum Zitat Schlaifer, R., Raiffa, H.: Applied Statistical Decision Theory. M.I.T Press, Cambridge (1968)MATH Schlaifer, R., Raiffa, H.: Applied Statistical Decision Theory. M.I.T Press, Cambridge (1968)MATH
18.
Zurück zum Zitat Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: SCOVO: using statistics on the Web of data. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 708–722. Springer, Heidelberg (2009) Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: SCOVO: using statistics on the Web of data. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 708–722. Springer, Heidelberg (2009)
19.
Zurück zum Zitat Hazewinkel, M.: Encyclopaedia of Mathematics. In: Chapter De Finetti theorem. Springer, New York (2001)MATH Hazewinkel, M.: Encyclopaedia of Mathematics. In: Chapter De Finetti theorem. Springer, New York (2001)MATH
20.
Zurück zum Zitat Hilgevoord, J., Uffink, J.: Uncertainty in prediction and in inference. Found. Phys. 21, 323–341 (1991)CrossRef Hilgevoord, J., Uffink, J.: Uncertainty in prediction and in inference. Found. Phys. 21, 323–341 (1991)CrossRef
22.
Zurück zum Zitat Krause, E.F.: Taxicab Geometry. Dover, New York (1987) Krause, E.F.: Taxicab Geometry. Dover, New York (1987)
23.
24.
Zurück zum Zitat Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: ICML, pp. 545–552. ACM (2005) Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: ICML, pp. 545–552. ACM (2005)
25.
Zurück zum Zitat Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)MathSciNet Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)MathSciNet
26.
27.
Zurück zum Zitat Rasmussen, C.E.: The Infinite Gaussian Mixture Model. Advances in Neural Information Processing Systems, vol. 12, pp. 554–560. MIT Press, Cambridge (2000) Rasmussen, C.E.: The Infinite Gaussian Mixture Model. Advances in Neural Information Processing Systems, vol. 12, pp. 554–560. MIT Press, Cambridge (2000)
28.
Zurück zum Zitat Rauber, T.W., Conci, A., Braun, T., Berns, K.: Bhattacharyya probabilistic distance of the dirichlet density and its application to split-and-merge image segmentation. In: WSSIP08, pp. 145–148 (2008) Rauber, T.W., Conci, A., Braun, T., Berns, K.: Bhattacharyya probabilistic distance of the dirichlet density and its application to split-and-merge image segmentation. In: WSSIP08, pp. 145–148 (2008)
29.
30.
Zurück zum Zitat Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)CrossRefMATHMathSciNet Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)CrossRefMATHMathSciNet
31.
Zurück zum Zitat van Hage, W.R., van Erp, M., Malaisé, V.: Linked open piracy: a story about e-science, linked data, and statistics. J. Data Seman. 1(3), 187–201 (2012)CrossRef van Hage, W.R., van Erp, M., Malaisé, V.: Linked open piracy: a story about e-science, linked data, and statistics. J. Data Seman. 1(3), 187–201 (2012)CrossRef
35.
Zurück zum Zitat Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)CrossRef Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)CrossRef
36.
Zurück zum Zitat Xing, E.: Bayesian haplotype inference via the Dirichlet process. In: ICML, pp. 879–886. ACM Press (2004) Xing, E.: Bayesian haplotype inference via the Dirichlet process. In: ICML, pp. 879–886. ACM Press (2004)
Metadaten
Titel
Uncertainty Estimation and Analysis of Categorical Web Data
verfasst von
Davide Ceolin
Willem Robert van Hage
Wan Fokkink
Guus Schreiber
Copyright-Jahr
2014
DOI
https://doi.org/10.1007/978-3-319-13413-0_14