Skip to main content

2018 | OriginalPaper | Buchkapitel

Randomness of Data Quality Artifacts

verfasst von : Toon Boeckling, Antoon Bronselaer, Guy De Tré

Erschienen in: Information Processing and Management of Uncertainty in Knowledge-Based Systems. Applications

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Quality of data is often measured by counting artifacts. While this procedure is very simple and applicable to many different types of artifacts like errors, inconsistencies and missing values, counts do not differentiate between different distributions of data artifacts. A possible solution is to add a randomness measure to indicate how randomly data artifacts are distributed. It has been proposed to calculate randomness by means of the Lempel-Ziv complexity algorithm, this approach comes with some demerits. Most importantly, the Lempel-Ziv approach assumes that there is some implicit order among data objects and the measured randomness depends on this order. To overcome this problem, a new method is proposed which measures randomness proportionate to the average amount of bits needed to compress the bit matrix matching the artifacts in a database relation by using unary coding. It is shown that this method has several interesting properties that align the proposed measurement procedure with the intuitive perception of randomness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The paper in questions deals with correctness of data, but we aim at a more general approach here.
 
2
In [8], Fisher et al. also add a parameter indicating the probability distribution of the errors. This is out of scope of this paper.
 
3
Determining that data is in error is difficult, while determining that it is missing is easy.
 
Literatur
2.
Zurück zum Zitat Bronselaer, A., De Mol, R., De Tré, G.: A measure-theoretic foundation for data quality. IEEE Trans. Fuzzy Syst. (2017) (published online) Bronselaer, A., De Mol, R., De Tré, G.: A measure-theoretic foundation for data quality. IEEE Trans. Fuzzy Syst. (2017) (published online)
3.
Zurück zum Zitat Chaitin, G.: Randomness and mathematical proof. Sci. Am. 232(5), 47–52 (1975)CrossRef Chaitin, G.: Randomness and mathematical proof. Sci. Am. 232(5), 47–52 (1975)CrossRef
4.
Zurück zum Zitat Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)CrossRef Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)CrossRef
5.
Zurück zum Zitat Even, A., Shankaranarayanan, G.: Utility-driven assessment of data quality. ACM SIGMIS Database 38(2), 75–93 (2007)CrossRef Even, A., Shankaranarayanan, G.: Utility-driven assessment of data quality. ACM SIGMIS Database 38(2), 75–93 (2007)CrossRef
6.
Zurück zum Zitat Even, A., Shankaranarayanan, G.: Dual assessment of data quality in customer databases. J. Data Inf. Qual. 1(3), 15:1–15:29 (2009) Even, A., Shankaranarayanan, G.: Dual assessment of data quality in customer databases. J. Data Inf. Qual. 1(3), 15:1–15:29 (2009)
7.
Zurück zum Zitat Falk, R., Konold, C.: Making sense of randomness: Implicit encoding as a basis for judgment. Psychol. Rev. 104(2), 301–318 (1997)CrossRef Falk, R., Konold, C.: Making sense of randomness: Implicit encoding as a basis for judgment. Psychol. Rev. 104(2), 301–318 (1997)CrossRef
9.
Zurück zum Zitat Haegemans, T., Reusens, M., Baesens, B., Lemahieu, W., Snoeck, M.: Towards a visual approach to aggregate data quality measurements. In: Proceedings of the International Conference on Information Quality (Accepted 2017) Haegemans, T., Reusens, M., Baesens, B., Lemahieu, W., Snoeck, M.: Towards a visual approach to aggregate data quality measurements. In: Proceedings of the International Conference on Information Quality (Accepted 2017)
10.
Zurück zum Zitat Haegemans, T., Snoeck, M., Lemahieu, W.: Towards a precise definition of data accuracy and a justification for its measure. In: Proceedings of the International Conference on Information Quality, pp. 16:1–16:13 (2016) Haegemans, T., Snoeck, M., Lemahieu, W.: Towards a precise definition of data accuracy and a justification for its measure. In: Proceedings of the International Conference on Information Quality, pp. 16:1–16:13 (2016)
11.
Zurück zum Zitat Haegemans, T., Snoeck, M., Lemahieu, W., Stumpe, F., Goderis, A.: Towards a theoretical framework to explain root causes of errors in manually acquired data. In: Proceedings of the International Conference on Information Quality, pp. 15:1–15:10 (2016) Haegemans, T., Snoeck, M., Lemahieu, W., Stumpe, F., Goderis, A.: Towards a theoretical framework to explain root causes of errors in manually acquired data. In: Proceedings of the International Conference on Information Quality, pp. 15:1–15:10 (2016)
12.
Zurück zum Zitat Heinrich, B., Klier, M.: Metric-based data quality assessment–developing and evaluating a probability-based currency metric. Decis. Support Syst. 72, 82–96 (2015)CrossRef Heinrich, B., Klier, M.: Metric-based data quality assessment–developing and evaluating a probability-based currency metric. Decis. Support Syst. 72, 82–96 (2015)CrossRef
14.
Zurück zum Zitat Krantz, D.H., Luce, D.R., Suppes, P., Tversky, A.: Foundations of Measurement Volume I: Additive and Polynomial Representations. Academic Press, New York (1971) Krantz, D.H., Luce, D.R., Suppes, P., Tversky, A.: Foundations of Measurement Volume I: Additive and Polynomial Representations. Academic Press, New York (1971)
15.
Zurück zum Zitat Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. MIT Press, Cambridge (2006) Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. MIT Press, Cambridge (2006)
16.
17.
Zurück zum Zitat Leszak, M., Perry, D.E., Stoll, D.: A case study in root cause defect analysis. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 428–437. IEEE (2000) Leszak, M., Perry, D.E., Stoll, D.: A case study in root cause defect analysis. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 428–437. IEEE (2000)
19.
Zurück zum Zitat Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)CrossRef Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)CrossRef
20.
Zurück zum Zitat Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)CrossRef Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)CrossRef
21.
Zurück zum Zitat Pipino, L.L., Wang, R.Y., Kopcso, D., Rybolt, W.: Developing measurement scales for data-quality dimensions. In: Wang, R.Y., Pierce, E.M., Madnick, S.E., Fisher, C.W. (eds.) Information Quality, chap. 3, pp. 37–51. M.E. Sharpe (2005) Pipino, L.L., Wang, R.Y., Kopcso, D., Rybolt, W.: Developing measurement scales for data-quality dimensions. In: Wang, R.Y., Pierce, E.M., Madnick, S.E., Fisher, C.W. (eds.) Information Quality, chap. 3, pp. 37–51. M.E. Sharpe (2005)
22.
Zurück zum Zitat Redman, T.C.: Data Quality for the Information Age, 1st edn. Artech House Inc., Norwood (1997) Redman, T.C.: Data Quality for the Information Age, 1st edn. Artech House Inc., Norwood (1997)
23.
Zurück zum Zitat Risannen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)CrossRef Risannen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)CrossRef
24.
Zurück zum Zitat Sayood, K.: Introduction to Data Compression. Morgan Kaufmann Series in Multimedia Information and Systems, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2005)MATH Sayood, K.: Introduction to Data Compression. Morgan Kaufmann Series in Multimedia Information and Systems, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2005)MATH
25.
26.
Zurück zum Zitat Wang, R.Y.: A product perspective on total data quality management. Commun. ACM 41(2), 58–65 (1998)CrossRef Wang, R.Y.: A product perspective on total data quality management. Commun. ACM 41(2), 58–65 (1998)CrossRef
27.
Zurück zum Zitat Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)CrossRef Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)CrossRef
28.
Zurück zum Zitat Wilson, P., Dell, L., Anderson, G.: Root Cause Analysis: A Tool for Total Quality Management. ASQ Quality Press, Milwaukee (1993) Wilson, P., Dell, L., Anderson, G.: Root Cause Analysis: A Tool for Total Quality Management. ASQ Quality Press, Milwaukee (1993)
Metadaten
Titel
Randomness of Data Quality Artifacts
verfasst von
Toon Boeckling
Antoon Bronselaer
Guy De Tré
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-91479-4_44