1 Introduction
2 Data and Information Quality
Rman
stands for Roman
, thus causing an accuracy problem. Nevertheless, another accuracy problem is related to the exchange of the director between movies 1 and 2: Weir
is actually the director of movie 2 and Curtiz
the director of movie 1. Other data quality problems are: a missing value for the director of movie 4, causing a completeness problem, and a 0 value for the number of remakes of movie 4, causing a currency problem because a remake of the movie has actually been proposed. Finally, there are two consistency problems: first, for movie 1, the value of LastRemakeYear
cannot be lower than Year
; second, for movie 4, the value of LastRemakeYear
cannot be different from null, because the value of #Remakes
is 0.Movies
with data quality problemsID | Title | Director | Year | #Remakes | LastRemakeYear |
---|---|---|---|---|---|
1 | Casablanca |
Weir
|
1942
| 3 |
1940
|
2 | Dead poets society |
Curtiz
| 1989 | 0 | Null |
3 |
Rman Holiday
| Wylder | 1953 | 0 | Null |
4 | Sabrina |
Null
| 1964 |
0
|
1985
|
-
Data quality is a multifaceted concept, and different dimensions concur to define it;
-
Quality problems related to some dimensions, such as accuracy, can be easily detected in some cases (e.g., misspellings) but are more difficult to detect in other cases (e.g., where admissible but not correct values are provided);
-
A simple example of a completeness error has been shown, but as it happens with accuracy, completeness can also be very difficult to evaluate (e.g., if a tuple representing a movie is entirely missing from the relation
Movie
); -
Consistency detection does not always localize the errors (e.g., for movie 1, the value or the
LastRemakeYear
attribute is wrong).
2.1 On the Definition and Measurement of Information Quality: Dimensions and Metrics
Cluster | Information type |
---|---|
Accuracy | Structured data |
Completeness | Structured data |
Consistency | Structured data |
Redundancy | Linked data—structured Web data |
Readability | Texts—unstructured data |
Accessibility | Web sites’ data |
Trust | Web data sources |
Usefulness | Images |
2.1.1 The Accuracy Cluster
v
and a data value v
\(^\prime \), considered as the correct representation of the real-life phenomenon that the data value v
aims to represent. As an example, if the name of a person is John
, the value v
\(^\prime \) = John
is correct, while the value v
= Jhn
is incorrect. The world around us changes (velocity is one of the 3 V’s of big data), and what we have referred in the above definition as “the real-life phenomenon that the data value v
aims to represent” reflects such changes. So, there is a particular yet relevant type of data accuracy that refers to the rapidity with which the change in real-world phenomenon is reflected in the update to the data value; we call temporal accuracy such type of accuracy, in contrast to structural accuracy (or, simply, accuracy), that characterizes the accuracy of data as observed in a specific time frame, where the data value can be considered stable and unchanged. In the following, we consider first structural accuracy and later temporal accuracy. Two kinds of (structural) accuracy can be identified, namely a syntactic accuracy and a semantic accuracy.v
to the elements of the corresponding definition domain D. In syntactic accuracy, we are not interested in comparing v
with the true value v
\(^\prime \); rather, we are interested in checking whether v
belongs to D, whatever it is. So, if v
= Jack
, even if v
\(^\prime \) = John
, v
is considered syntactically correct, as Jack
is an admissible value in the domain of persons’ names. Syntactic accuracy is measured by means of functions, called comparison functions, that evaluate the distance between v
and the values in D. The edit distance is a simple example of a comparison function, taking into account the minimum number of character insertions, deletions, and replacements to convert a string s
to a string s
\(^\prime \). More complex comparison functions exist, e.g., taking into account similar sounds or character transpositions (see [8]).v
to the true value v
\(^\prime \). Let us consider again the relation Movies
of Table 1. The exchange of directors’ names in tuples 1 and 2 is an example of a semantic accuracy error. Indeed, for movie 1, a director named Curtiz
would be admissible, and thus, it is syntactically correct. Nevertheless, Curtiz
is not the director of Casablanca
; therefore, a semantic accuracy error occurs. The above examples clearly show the difference between syntactic and semantic accuracy. Note that, while it is reasonable to measure syntactic accuracy using a distance function, semantic accuracy is measured better with a \(<\)
yes, no
\(>\) or a \(<\)
correct, not correct
\(>\) domain. Consequently, semantic accuracy coincides with the concept of correctness. In contrast with what happens for syntactic accuracy, in order to measure the semantic accuracy of a value v
, the corresponding true value has to be known, or, else, it should be possible, by considering additional knowledge, to infer whether that value v
is or is not the true value. In a general context, a technique for checking semantic accuracy consists of looking for the same data in different data sources and finding the correct data by comparisons. This latter approach also requires the solution of an object identification problem, i.e., the problem of understanding whether two tuples refer to the same real-world entity or not [14].-
Currency concerns how promptly data are updated with respect to changes occurred in the real world. As an example in Table 1, the attribute
#Remakes
of movie 4 has low currency because a remake of movie 4 has been done, but this information did not result in an increased value for the number of remakes. Similarly, if the residential address of a person is updated, i.e., it corresponds to the address where the person lives, then the currency is high. -
Volatility characterizes the frequency with which data vary in time. For instance, stable data such as birth dates have volatility equal to 0, as they do not vary at all. Conversely, stock quotes, a kind of frequently changing data, have a high degree of volatility due to the fact that they remain valid for very short time intervals.
-
Timeliness expresses how data are current for the task at hand. The timeliness dimension is motivated by the fact that it is possible to have current data that are actually useless because they are late for a specific usage. For instance, the timetable for university courses is current if contains the most recent data, but it is not timely if it is available only after the start of the classes.
2.1.2 The Completeness Cluster
-
a value completeness to capture the presence of null values for some fields of a tuple;
-
a tuple completeness to characterize the completeness of a tuple with respect to the values of all its fields;
-
an attribute completeness to measure the number of null values of a specific attribute in a relation;
-
a relation completeness to capture the presence of null values in a whole relation.
Student
relation is shown. The tuple completeness evaluates the percentage of specified values in the tuple with respect to the total number of attributes of the tuple itself. Therefore, in the example, the tuple completeness is 1 for tuples 6754 and 8907, 0.8 for tuple 6587, 0.6 for tuple 0987, and so on. A possible way to measure the tuple completeness is to measure the information content of the tuple with respect to its maximum potential information content. With reference to this interpretation, we are implicitly assuming that all values of the tuple contribute equally to the total information content of the tuple. Of course, this may not be the case, as different applications can weight the attributes of a tuple differently.Vote
attribute simply implies a deviation in the calculation of the average; therefore, a characterization of Vote
completeness may be useful.Student
relation exemplifying the completeness of tuples, attributes, and relationsStudent ID | Name | Surname | Vote | Examination date |
---|---|---|---|---|
6754 | Mike | Collins | 29 | 07/17/2004 |
8907 | Anne | Herbert | 18 | 07/17/2004 |
6578 | Julianne | Merrals | Null | 07/17/2004 |
0987 | Robert | Archer | Null | Null |
1243 | Mark | Taylor | 26 | 09/30/2004 |
2134 | Bridget | Abbott | 30 | 09/30/2004 |
6784 | John | Miller | 30 | Null |
0098 | Carl | Adams | 25 | 09/30/2004 |
1111 | John | Smith | 28 | 09/30/2004 |
2564 | Edward | Monroe | Null | Null |
8976 | Anthony | White | 21 | Null |
8973 | Marianne | Collins | 30 | 10/15/2004 |
Student
in Table 3 is 53/60.2.1.3 The Consistency Cluster
2.1.4 The Redundancy Cluster
-
intensional conciseness, which refers to the case when the data set does not contain redundant schema elements (properties and classes). Only essential properties and classes are included in the schema;
-
extensional conciseness, which refers to the case when the data set does not contain redundant objects (instances).
A123
, being represented by two different properties in the same data set, such as http://flights.org/airlineID and http://flights.org/name. In this case, redundancy between airlineID
and name
can ideally be solved by merging the two properties and keeping only one unique identifier. In other words, conciseness should push stakeholders to reuse as much as possible schema elements from existing schemata/ontologies rather than creating new ones since the reuse will support data interoperability.2.1.5 The Readability Cluster
-
characters are the number of characters in the text;×
-
words are the number of words in the text;
-
sentences are the number of sentences in the text;
-
complex words are difficult words defined as those with three or more syllables.
2.1.6 The Accessibility Cluster
2.1.7 The Trust Cluster
2.1.8 The Usefulness Cluster
3 Big Data
-
Volume refers to the size of the data;
-
Velocity refers to the data provisioning rate and to the time within which it is necessary to act on them. Every minute about 400.000 tweets on Twitter are posted, 200 millions of e-mails are sent, and 2 millions of Google search queries are submitted [40];
-
Variety refers to the heterogeneity of data acquisition, data representation, and semantic interpretation.
3.1 Human-Sourced Information Sources
Source | Structure | Human influence |
---|---|---|
Human sourced | Loosely structured | Direct |
Process mediated | Structured | Indirect (e.g., data entry activities) |
Machine generated | Well structured | None |
3.2 Process-Mediated Sources
3.3 Machine-Generated Sources
4 Big Data Quality
4.1 Process-Mediated Sources
-
outdated values
-
incomplete values
-
conflicting values
-
wrong values
-
noise in the data extraction.
-
up-to-date, Up(S, t), including the entities that also exist in the real world and have their attribute values in agreement with the world;
-
out-of-date, Out(S, t), including the entities for which the latest value changes are not captured by the source;
-
nondeleted, including all the remaining entities, i.e., entities that have disappeared from the real world.
4.1.1 Redundancy
-
redundancy on objects is the percentage of sources that provide a particular object;
-
redundancy on data items is the percentage of sources that provide a particular data item.
4.1.2 Consistency
-
number of values is the number of different values provided on d, which is the size of V(d).
-
entropy is(the higher the inconsistency, the higher the entropy).$$\begin{aligned} E(d)=-\sum _{v \in V(d)} \frac{|S(d,v)|}{|S(d)|} \log \frac{|S(d,v)|}{|S(d)|} \end{aligned}$$(1)
-
deviation iswhere \(v_0\) is the value provided by the largest number of sources (it applies to data items with numerical values).$$\begin{aligned} D(d)=\sqrt{\frac{1}{|V(d)|} \sum _{v \in V(d)} \left( \frac{v-v_0}{v_0}\right) ^{2}} \end{aligned}$$(2)
4.1.3 Accuracy
-
source accuracy is the fraction of values provided by the given source that are correct;
-
accuracy deviation: let us denote by T the set of time points in a period, by A(t) the accuracy of a source at a time \(t \in T\), and by \(A'\) the mean accuracy over T, the accuracy deviation is$$\begin{aligned} Dev(S)=\sqrt{\frac{1}{|T|} \sum _{t \in T} (A(t)-A')^2} \end{aligned}$$(3)
-
average accuracy is the average source accuracy.
-
loss function is defined based on the data type.
-
Categorical data: the most commonly used loss function is 0–1 loss in which an error is incurred if the value is different from the gold standard:$$\begin{aligned} L(d)={\left\{ \begin{array}{ll} 1&{}\quad \text { if }\; v=v^* \\ 0&{}\quad \text { otherwise} \end{array}\right. } \end{aligned}$$(4)
-
Continuous data: The loss function should characterize the distance from the value to the gold standard with respect to the variance of values across sources. One common loss function is the normalized squared loss, which is defined as:$$\begin{aligned} L(d)=\frac{(v^*-v)^2}{\text {std}(V(d))} \end{aligned}$$(5)
-
4.1.4 Copying
-
schema commonality is the average Jaccard similarity between the sets of provided attributes on each pair of sources$$\begin{aligned} C=\text {avg}_{S,S'}\frac{|A(S) \cap A(S')|}{|A(S) \cup A(S')|} \end{aligned}$$(6)
-
object commonality is the average Jaccard similarity but between the sets of provided objects;
-
value commonality is the average percentage of common values over all shared data items between each source pair.
4.1.5 Spread
-
spread: If one only needs to identify and wrap a few top sites in order to build a comprehensive set of sources, the spread is low. A comprehensive set should also include some redundancy to overcome errors introduced by a single source;
-
value of tail: If one needs to construct a comprehensive database, including the extraction of unpopular entities (i.e., relevant to a smaller group of users), the tail has high value;
-
connectivity: If the data sources can be easily discovered by a bootstrapping-based Web-scale extraction algorithms (i.e., where one starts with seed entities, use them to reach all sites covering these entities, and iterate), the sources are connected.
4.1.6 Freshness
-
the freshness of a source at a time t is the probability that a randomly selected entity is up-to-date, i.e.,where \(S_t\) is the set of entities in the source at a time t;$$\begin{aligned} F(S)=\frac{|Up(S,t)|}{|S_t|} \end{aligned}$$(7)
-
the coverage of a source is the probability that a random entity of the real world at a time t belongs to S, i.e.,where \(W_t\) is the set of entities in the real world at a time t.$$\begin{aligned} Cov(S)=\frac{|Up(S,t) \cup Out(S,t)|}{|W_t|} \end{aligned}$$(8)
4.2 Machine-Generated Sources
-
hardware noise
-
inaccuracies and impressions in sampling methods and derived data
-
environmental effects
-
adverse weather conditions
-
faulty equipment.
4.2.1 Accuracy
4.2.2 Completeness
-
attribute ratio is the ratio of the number of attributes available to the total number of attributes of the sample;
-
weighted attributes ratio is the same as the attribute ratio, where the contribution of each attribute is proportional to its importance for the application of interest.
4.2.3 Consistency
Types of consistency | Numerical/temporal/frequency | Individual data/data streams/both | Definition |
---|---|---|---|
Numerical | Numerical | Individual data | Collected data should be accurate |
Temporal | Temporal | Individual data | Data should be delivered to the sink before or by it is expected |
Frequency | Frequency | Both | Controls the frequency of dramatic data changes and abnormal readings of data streams |
Absolute numerical | Numerical | Both | Sensor reading is out of the normal range, which can be preset by the application |
Relative numerical | Numerical | Both | Error between the real field reading and the corresponding data at the sink |
Hop | Numerical | Individual data | Data should keep consistency at each hop |
Single path | Numerical and temporal | Individual data | Consistency holds when data are transmitted from the source to the sink using a single path |
Multiple path | Numerical and temporal | Individual data | Consistency holds when data are transmitted from the source to the sink using multiple paths |
Strict | Numerical and temporal | Data streams | Differs from hope consistency because it is defined on a set of data and requires no data loss |
Alpha-loss | Numerical and temporal | Data streams | Similar to strict consistency except that alpha-data loss is accepted at the sink |
Partial | Numerical and temporal | Data streams | Similar to alpha consistency except that temporal consistency is released |
Trend | Numerical and temporal | Data streams | Similar to partial consistency except that numerical consistency is released |
Range frequency | Frequency | Data streams | The number of abnormal readings exceed a certain number preset by the application |
Change frequency | Frequency | Data streams | Changes of sensor readings exceed preset threshold |
4.2.4 Trustworthiness
4.2.5 Freshness
-
age of a data item, calculated by taking the difference between the current time, \(t_\mathrm{curr}\), and the measurement time of that data item t(d);
-
up-to-dateness: the up-to-dateness decreases as age increases, specifically$$\begin{aligned} U(d)={\left\{ \begin{array}{ll}1-\frac{Age(d)}{Lifetime(d)} &{}\quad \text { if }\; Age(d) < {\textit{Lifetime}}(d) \\ 0 &{}\quad \text {otherwise} \end{array}\right. } \end{aligned}$$(10)
4.3 Human-Sourced Information Sources
4.3.1 Ambiguity
-
Level 0 refers to entities that most people regard as unambiguous. These entities contain only one meaning, such as dog (animal), California (state), and potato (vegetable).
-
Level 1 refers to entities that both make sense when treated as ambiguous or unambiguous. These entities usually have more meanings, but all of these meanings are related to some extent. For example, Google (company & search engine), French (language & country), truck (vehicle & public transportation) all belong to Level 1.
-
Level 2 refers to entities that most people think as ambiguous. These entities have two of more meanings which are extremely different from each other, such as apple (fruit & company), jaguar (animal & company), python (animal & programming language).