Skip to main content
Top

2018 | OriginalPaper | Chapter

Randomness of Data Quality Artifacts

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Quality of data is often measured by counting artifacts. While this procedure is very simple and applicable to many different types of artifacts like errors, inconsistencies and missing values, counts do not differentiate between different distributions of data artifacts. A possible solution is to add a randomness measure to indicate how randomly data artifacts are distributed. It has been proposed to calculate randomness by means of the Lempel-Ziv complexity algorithm, this approach comes with some demerits. Most importantly, the Lempel-Ziv approach assumes that there is some implicit order among data objects and the measured randomness depends on this order. To overcome this problem, a new method is proposed which measures randomness proportionate to the average amount of bits needed to compress the bit matrix matching the artifacts in a database relation by using unary coding. It is shown that this method has several interesting properties that align the proposed measurement procedure with the intuitive perception of randomness.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The paper in questions deals with correctness of data, but we aim at a more general approach here.
 
2
In [8], Fisher et al. also add a parameter indicating the probability distribution of the errors. This is out of scope of this paper.
 
3
Determining that data is in error is difficult, while determining that it is missing is easy.
 
Literature
2.
go back to reference Bronselaer, A., De Mol, R., De Tré, G.: A measure-theoretic foundation for data quality. IEEE Trans. Fuzzy Syst. (2017) (published online) Bronselaer, A., De Mol, R., De Tré, G.: A measure-theoretic foundation for data quality. IEEE Trans. Fuzzy Syst. (2017) (published online)
3.
go back to reference Chaitin, G.: Randomness and mathematical proof. Sci. Am. 232(5), 47–52 (1975)CrossRef Chaitin, G.: Randomness and mathematical proof. Sci. Am. 232(5), 47–52 (1975)CrossRef
4.
go back to reference Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)CrossRef Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)CrossRef
5.
go back to reference Even, A., Shankaranarayanan, G.: Utility-driven assessment of data quality. ACM SIGMIS Database 38(2), 75–93 (2007)CrossRef Even, A., Shankaranarayanan, G.: Utility-driven assessment of data quality. ACM SIGMIS Database 38(2), 75–93 (2007)CrossRef
6.
go back to reference Even, A., Shankaranarayanan, G.: Dual assessment of data quality in customer databases. J. Data Inf. Qual. 1(3), 15:1–15:29 (2009) Even, A., Shankaranarayanan, G.: Dual assessment of data quality in customer databases. J. Data Inf. Qual. 1(3), 15:1–15:29 (2009)
7.
go back to reference Falk, R., Konold, C.: Making sense of randomness: Implicit encoding as a basis for judgment. Psychol. Rev. 104(2), 301–318 (1997)CrossRef Falk, R., Konold, C.: Making sense of randomness: Implicit encoding as a basis for judgment. Psychol. Rev. 104(2), 301–318 (1997)CrossRef
9.
go back to reference Haegemans, T., Reusens, M., Baesens, B., Lemahieu, W., Snoeck, M.: Towards a visual approach to aggregate data quality measurements. In: Proceedings of the International Conference on Information Quality (Accepted 2017) Haegemans, T., Reusens, M., Baesens, B., Lemahieu, W., Snoeck, M.: Towards a visual approach to aggregate data quality measurements. In: Proceedings of the International Conference on Information Quality (Accepted 2017)
10.
go back to reference Haegemans, T., Snoeck, M., Lemahieu, W.: Towards a precise definition of data accuracy and a justification for its measure. In: Proceedings of the International Conference on Information Quality, pp. 16:1–16:13 (2016) Haegemans, T., Snoeck, M., Lemahieu, W.: Towards a precise definition of data accuracy and a justification for its measure. In: Proceedings of the International Conference on Information Quality, pp. 16:1–16:13 (2016)
11.
go back to reference Haegemans, T., Snoeck, M., Lemahieu, W., Stumpe, F., Goderis, A.: Towards a theoretical framework to explain root causes of errors in manually acquired data. In: Proceedings of the International Conference on Information Quality, pp. 15:1–15:10 (2016) Haegemans, T., Snoeck, M., Lemahieu, W., Stumpe, F., Goderis, A.: Towards a theoretical framework to explain root causes of errors in manually acquired data. In: Proceedings of the International Conference on Information Quality, pp. 15:1–15:10 (2016)
12.
go back to reference Heinrich, B., Klier, M.: Metric-based data quality assessment–developing and evaluating a probability-based currency metric. Decis. Support Syst. 72, 82–96 (2015)CrossRef Heinrich, B., Klier, M.: Metric-based data quality assessment–developing and evaluating a probability-based currency metric. Decis. Support Syst. 72, 82–96 (2015)CrossRef
14.
go back to reference Krantz, D.H., Luce, D.R., Suppes, P., Tversky, A.: Foundations of Measurement Volume I: Additive and Polynomial Representations. Academic Press, New York (1971) Krantz, D.H., Luce, D.R., Suppes, P., Tversky, A.: Foundations of Measurement Volume I: Additive and Polynomial Representations. Academic Press, New York (1971)
15.
go back to reference Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. MIT Press, Cambridge (2006) Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. MIT Press, Cambridge (2006)
16.
17.
go back to reference Leszak, M., Perry, D.E., Stoll, D.: A case study in root cause defect analysis. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 428–437. IEEE (2000) Leszak, M., Perry, D.E., Stoll, D.: A case study in root cause defect analysis. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 428–437. IEEE (2000)
19.
go back to reference Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)CrossRef Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)CrossRef
20.
go back to reference Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)CrossRef Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)CrossRef
21.
go back to reference Pipino, L.L., Wang, R.Y., Kopcso, D., Rybolt, W.: Developing measurement scales for data-quality dimensions. In: Wang, R.Y., Pierce, E.M., Madnick, S.E., Fisher, C.W. (eds.) Information Quality, chap. 3, pp. 37–51. M.E. Sharpe (2005) Pipino, L.L., Wang, R.Y., Kopcso, D., Rybolt, W.: Developing measurement scales for data-quality dimensions. In: Wang, R.Y., Pierce, E.M., Madnick, S.E., Fisher, C.W. (eds.) Information Quality, chap. 3, pp. 37–51. M.E. Sharpe (2005)
22.
go back to reference Redman, T.C.: Data Quality for the Information Age, 1st edn. Artech House Inc., Norwood (1997) Redman, T.C.: Data Quality for the Information Age, 1st edn. Artech House Inc., Norwood (1997)
23.
go back to reference Risannen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)CrossRef Risannen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)CrossRef
24.
go back to reference Sayood, K.: Introduction to Data Compression. Morgan Kaufmann Series in Multimedia Information and Systems, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2005)MATH Sayood, K.: Introduction to Data Compression. Morgan Kaufmann Series in Multimedia Information and Systems, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2005)MATH
26.
go back to reference Wang, R.Y.: A product perspective on total data quality management. Commun. ACM 41(2), 58–65 (1998)CrossRef Wang, R.Y.: A product perspective on total data quality management. Commun. ACM 41(2), 58–65 (1998)CrossRef
27.
go back to reference Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)CrossRef Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)CrossRef
28.
go back to reference Wilson, P., Dell, L., Anderson, G.: Root Cause Analysis: A Tool for Total Quality Management. ASQ Quality Press, Milwaukee (1993) Wilson, P., Dell, L., Anderson, G.: Root Cause Analysis: A Tool for Total Quality Management. ASQ Quality Press, Milwaukee (1993)
Metadata
Title
Randomness of Data Quality Artifacts
Authors
Toon Boeckling
Antoon Bronselaer
Guy De Tré
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-91479-4_44

Premium Partner