1 Introduction

The increasing awareness of the importance of Crowdsourcing (Howe 2008) as a means of obtaining labeled data is promoting a shift in machine learning towards models that are annotator-aware. A good example is that of online platforms such as Amazon’s Mechanical Turk (AMT).Footnote 1 These platforms provide an accessible and inexpensive resource to obtain labeled data, whose quality, in many situations, competes directly with the one of “experts” (Snow et al. 2008; Novotney and Callison-Burch 2010). Also, by distributing a labeling task by multiple annotators it can be completed in a considerably smaller amount of time. For such reasons, these online work-recruiting platforms are rapidly changing the way datasets are built.

Furthermore, the social web promotes an implicit form of Crowdsourcing, as multiple web users interact and share contents (e.g., document tags, product ratings, opinions, user clicks, etc.). As the social web expands, so does the need for annotator-aware models.

On another perspective, there are tasks for which ground truth labels are simply very hard to obtain. Consider for instance the tasks of Sentiment Analysis, Movie Rating or Keyphrase Extraction. These tasks are subjective in nature and hence the definition of ground truth requires very strict guidelines, which can be very hard to achieve and follow. Even in well studied tasks like Named Entity Recognition linguists argue what should and should not be considered a named entity and consensus is not easily obtained. In cases where the task is inherently subjective an attainable goal is to build a model that captures the wisdom of the crowds (Surowiecki 2004) as good as possible while paying less attention to dissonant views.

Another example can be found in the field of medical diagnosis, where obtaining ground truth can mean expensive or invasive medical procedures like biopsies. On the other hand, it is much simpler for a physician to consult his colleagues for an opinion, resulting in a multiple “experts” scenario.

Sequence labeling refers to the supervised learning task of assigning a label to each element of a sequence. Typical examples are Part-of-Speech tagging, Named Entity Recognition and Gene Prediction (Allen et al. 2004; Allen and Salzberg 2005). In such tasks, the individual labels cannot be considered as detached from the context (i.e. the preceding and succeeding elements of the sequence and their corresponding labels). Two of the most popular sequence models are hidden Markov models (HMM) (Rabiner 1989) and Conditional Random Fields (CRF) (Lafferty et al. 2001). Due to the usually high dimensional feature spaces (specially considering CRFs), these models frequently require large amounts of labeled data to be properly trained, which hinders the construction and release of datasets and makes it almost prohibitive to do with a single annotator. Although in some domains, the use of unlabeled data can help in making this problem less severe (Bellare and Mccallum 2007), a more natural solution is to rely on multiple annotators. For example, for many tasks, AMT can be used to label large amounts of data (Callison-Burch and Dredze 2010). However, the large numbers needed to compensate for the heterogeneity of annotators expertise rapidly raise its actual cost beyond acceptable values. A parsimonious solution needs to be designed that is able to deal with such real world constraints and heterogeneity.

In the past few years many approaches have been proposed that deal with the problem of supervised learning from multiple annotators in different paradigms (classification, regression, ranking, etc.), however the particular problem of sequence labeling from multiple annotators was practically left untouched, and most of the applications typically rely on majority voting (e.g. Laws et al. 2011). Given its importance in such fields as Natural Language Processing, Bioinformatics, Computer Vision, Speech and Ubiquitous Computing, sequence labeling from multiple annotators is a very important problem. Unfortunately, due to its nature, typical approaches proposed for binary or categorical classification cannot be directly applied for sequences.

In this paper we propose a probabilistic approach using the Expectation-Maximization algorithm (EM) for sequence labeling using CRFs for the scenario where we have multiple annotators providing labels with different levels of “reliability” but no actual ground truth. The proposed method is able to jointly learn the CRF model parameters, the reliabilities of the annotators and the estimated ground truth label sequences. It is empirically shown that this method outperforms the baselines even in situations of high levels of noise in the labels of the annotators and when the less “trustworthy” annotators dominate. The proposed approach also has the advantage of not requiring repeated labeling of the same input sequences by the different annotators. Finally, this approach can be easily modified to work with other sequence labeling models like HMMs.

2 Related work

The first works that relate to the problem of learning from multiple annotators go back to 1979 when Dawid and Skene proposed an approach for estimating the error rates of multiple patients (annotators) given their responses (labels) to multiple medical questions. Although this work just focused on estimating the hidden ground truth labels, it inspired other works where there is an explicit attempt to learn a classifier. For example, Smyth et al. (1995) propose a similar approach to solve the problem of volcano detection and classification in Venus imagery with data labelled by multiple experts. Like in previous works, their approach relies on a latent variable model, where they treat the ground truth labels as latent variables. The main difference is that the authors use the estimated (probabilistic) ground truth to explicitly learn a classifier.

More recently, Snow et al. (2008) demonstrated that learning from labels provided by multiple non-expert annotators can be as good as learning from the labels of one expert. Such kind of findings inspired the development of new approaches that, unlike previous ones (Smyth et al. 1995; Donmez and Carbonell 2008; Sheng et al. 2008), do not rely on repeated labeling, i.e. having the same annotators labeling the same set of instances. In Raykar et al. (2009, 2010) an approach is proposed where the classifier and the annotators reliabilities are learnt jointly. Later works then relaxed the assumption that the annotators’ reliabilities do not depend on the instances they are labeling (Yan et al. 2010), and extended the proposed methodology to an active learning scenario (Yan et al. 2011). All these approaches shared a few key aspects: (1) they use a latent variable model where the ground truth labels are treated as latent variables; (2) they rely on the EM algorithm (Dempster et al. 1977) to find maximum likelihood estimates for the model parameters; and (3) they deal mostly with binary classification problems (although some suggest extensions to handle categorical, ordinal and even continuous data).

The acclaimed importance of supervised learning from multiple annotators lead to many interesting alternative approaches and variations/extensions of previous works in the past couple of years. In Donmez et al. (2010) the authors propose the use of a particle filter to model the time-varying accuracies of the different annotators. Groot et al. (2011) propose a annotator-aware methodology for the regression problems using Gaussian processes, and Wu et al. (2011) present a solution for ranking problems with multiple annotators.

Despite the variety of approaches presented for different learning paradigms, the problem of sequence labeling from multiple annotators was left practically untouched, with the only relevant work being the work by Dredze et al. (2009). In this work the authors propose a method for learning structured predictors, namely CRFs, from instances with multiple labels in the presence of noise. This is achieved by modifying the CRF objective function used for training through the inclusion of a per-label prior, thereby restricting the model from straying too far from the provided priors. The per-label priors are then re-estimated by making use of their likelihoods under the whole dataset. In this way, the model is capable of using knowledge from other parts of the dataset to prefer certain labels over others. By iterating between the computation of the expected values of the label priors and the estimation of the model parameters in an EM-like style, the model is expected to give preference to the less noisy labels. Hence, we can view this process as self-training, a process whereby the model is trained iteratively on its own output. Although this approach makes the model computationally tractable, their experimental results indicate that the method only improves performance in scenarios where there is a small amount of training data (low quantity) and when the labels are noisy (low quality).

It is important to stress that, contrarily to the model proposed in this paper, the model by Dredze et al. (2009) is a multi-label model, and not a multi-annotator model, in the sense that the knowledge about who provided the multiple label sequences is completely discarded. The obvious solution for including this knowledge would be to use a latent ground truth model similar to the one proposed by Raykar et al. (2009, 2010), thus extending this work to sequence labeling tasks. However, treating the ground truth label sequences as latent variables and using the EM algorithm to estimate the model parameters would be problematic, since the number of possible label sequences grows exponentially with the length of the sequence, making the marginalization over the latent variables intractable. In contrast to this, the approach presented in this paper avoids this problem by treating the annotators reliabilities as latent variables, making the marginalization over the latent variables tractable (see Sect. 3).

In the field of Bioinformatics a similar problem has been attracting attention, in which multiple sources of evidence are combined for gene prediction (e.g. Allen et al. 2004; Allen and Salzberg 2005). In these approaches the outputs of multiple predictors (e.g. HMMs) are usually combined using a voting of the labels predicted, weighted by the confidence (posteriors) of the various sources in their predictions (Allen et al. 2004). Non-linear decision schemes also exist, for example using Decision Trees (Allen and Salzberg 2005), but similarly to the linearly weighted voting schemes, the confidence weights are estimated once and never corrected. This contrasts with the approaches discussed in this paper, where the goal is to build a single predictor (CRF) from the knowledge of multiple annotators (sources), and where the confidence of each source is iteratively re-estimated.

3 Approach

3.1 Measuring the reliability of the annotators

Let y r be a sequence of labels assigned by the r th annotator to some observed input sequence x. If we were told the actual (unobserved) sequence of true labels y for that same input sequence x, we could evaluate the quality (or reliability) of the r th annotator in a dataset by measuring its precision and recall. Furthermore, we could combine precision and recall in a single measure by using the traditional F1-measure, and use this combined measure to evaluate how “good” or “reliable” a given annotator is according to some ground truth. In practice any appropriate loss function can be used to evaluate the quality of the annotators. The choice of one metric over others is purely problem-specific. The F-measure was used here due to its wide applicability in sequence labeling problems and, particularly, in the tasks used in the experiments (Sect. 4).

3.2 Sequence labeling

If for a dataset of N input sequences \(\mathcal {X}= \{\mathbf{x}_{i}\}_{i=1}^{N}\) we knew the actual ground truth label sequences \(\mathcal {Y}= \{\mathbf{y}_{i}\}_{i=1}^{N}\), we could model the probabilities of the label sequences \(\mathcal {Y}\) given the input sequences \(\mathcal {X}\) using a linear-chain Conditional Random Field (CRF) (Lafferty et al. 2001).

In a linear-chain CRF the conditional probability of a sequence of labels y given a sequence of observations x is given by

$$ p_{\mathit{crf}}(\mathbf{y}|\mathbf{x}, \boldsymbol{\lambda}) = \frac{1}{\varPsi(\mathbf{x})} \prod_{t=1}^{T} \exp \Biggl\{ \sum_{k=1}^{K} \lambda_{k} f_{k}(y_{t-1},y_{t}, \mathbf{x},t) \Biggr\} $$
(1)

where Ψ(x) is a normalization constant that makes the sum of the probability of all label sequences equal to one, f k (y t−1,y t ,x,t) is a feature function (often binary-valued, but that can also be real-valued) and λ k is a learned weight associated with feature f k . The feature functions can capture any aspect of the state transitions y t−1y t and of the whole input sequence x, which in fact, can be used to understand the relationship between labels and the characteristics of the whole input sequence x at a given moment t.

According to the model defined in Eq. (1), the most probable labeling sequence for an input sequence x is given by y =argmax y p crf (y|x,λ), which can be efficiently determined through dynamic programming using the Viterbi algorithm.

The parameters λ of the CRF model are typically estimated from an i.i.d. dataset by maximum likelihood using limited-memory BFGS (Liu and Nocedal 1989). The loglikelihood for a dataset \(\{\mathbf{x}_{i}, \mathbf{y}_{i}\}_{i=1}^{N}\) is given by \(\sum_{i=1}^{N} \ln p_{\mathit{crf}}(\mathbf{y}_{i}|\mathbf{x}_{i}, \boldsymbol{\lambda})\).

3.3 Maximum likelihood estimator

Since we do not know the set of actual ground truth label sequences \(\mathcal {Y}\) for the set of input sequences \(\mathcal {X}\), we must find a way to estimate it using the sets of label sequences provided by the R different annotators \(\{\mathcal {Y}^{1},\ldots,\mathcal {Y}^{R}\}\), and learn a CRF model along the way.

Let the observed data \(\{\mathbf{y}_{i}^{1},\ldots,\mathbf{y}_{i}^{R}, \mathbf{x}_{i}\}_{i=1}^{N}\) be denoted by \(\mathcal{D}\). Given this data, and assuming the instances are i.i.d., the likelihood function, for some parameters θ (their definition can be ignored for now) can be factored as

$$ p(\mathcal{D}|\theta) = \prod_{i=1}^{N} p \bigl(\mathbf{y}_{i}^{1},\ldots,\mathbf{y}_{i}^{R}| \mathbf{x}_{i},\theta\bigr). $$
(2)

We now introduce a random vector z that represents the reliability of the annotators. We can define z to be an R-dimensional vector with values {z 1,…,z R}, so that zMultinomial(π 1,…,π R ) with probability

$$ p(\mathbf{z}|\boldsymbol{\pi}) = \prod _{r=1}^{R} (\pi_{r})^{z^{r}} $$
(3)

where we made the parameters of the multinomial (π) explicit. If we define z r to be the F1-measure of the r th annotator, the parameters π of the multinomial are then defined as

$$ \pi_{r} = p\bigl(z^{r}=1\bigr) = \frac{F1\mbox{-}\mathit{measure}_{r}}{\sum_{j=1}^{R} F1\mbox{-}\mathit{measure}_{j}} $$
(4)

thus ensuring the constraints π r ≥0 (since the F1-measure is always non-negative) and ∑ r π r =1.Footnote 2 The expectation of this multinomial random variable, \(\mathbb{E}\{z^{r}\} = p(z^{r}=1)\), can be interpreted as the probability picking the label sequences provided by the r th annotator as the correct ones (i.e. for which F1-measure r =1) and using those for training. An analogy for this model would be a student picking a book to learn about some subject. The student is provided by the University’s library with a set of books that cover that subject but differ only in the correctness of the content. The student then has to pick one of the books from which to learn about that subject. Transferring this analogy back to our multiple annotator setting, the random vector z can be viewed as picking the best annotator from which to learn from, thus enforcing competition among the annotators. The correct annotator is assumed to provide label sequences according to \(p_{\mathit{crf}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i},\lambda)\). The others are assumed to provide incorrect labels which we assume to come from a random model \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i})\)). The generative process can then be summarized as follows:

  1. 1.

    draw zMultinomial(π 1,…,π R )

  2. 2.

    for each instance x i :

    1. (a)

      for each annotator r:

      1. (i)

        if z r=1, draw \(\mathbf{y}_{i}^{r}\) from \(p_{\mathit{crf}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i},\boldsymbol{\lambda})\)

      2. (ii)

        if z r=0, draw \(\mathbf{y}_{i}^{r}\) from \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i})\)

Figure 1 shows a plate representation of the proposed model.

Fig. 1
figure 1

Plate representation of proposed model

For the sake of simplicity, we assume the random model \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i})\) to be uniformly distributed, hence

$$ p_{\mathit{rand}}\bigl(\mathbf{y}_i^r| \mathbf{x}_i\bigr) = \prod_{t=1}^T \frac{1}{\mathcal{C}} $$
(5)

where T denotes the length of the sequence and \(\mathcal{C}\) is the number of possible classes/labels for a sequence element.

Although it might seem too restrictive to assume that only one annotator provides the correct label sequences, it is important to note that the model can still capture the uncertainty in who the correct annotator should be. In alternative to this approach, one could replace the multinomial random variable with multiple Bernoullis (one for each annotator). From a generative perspective, this would allow for multiple annotators to be correct. However, this places too much emphasis on the form of \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i})\), since it would be crucial for deciding whether the annotator is likely to be correct. One the other hand, as we shall see later, by using a multinomial, the probabilities \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i})\) cancel out from the updates of the annotators reliabilities, thus forcing the annotators to “compete” with each other for the best label sequences.

Following the generative process described above, we can now define

$$ p\bigl(\mathbf{y}_i^1,\ldots, \mathbf{y}_i^R|\mathbf{x}_i,\mathbf{z}, \boldsymbol{\lambda}\bigr) = \prod_{r=1}^R \bigl\{ p_{\mathit{crf}}\bigl(\mathbf{y}_i^r| \mathbf{x}_i,\boldsymbol{\lambda}\bigr) \bigr\}^{(z^r)} \bigl\{ p_{\mathit{rand}}\bigl(\mathbf{y}_i^r| \mathbf{x}_i\bigr) \bigr\}^{(1-z^r)} $$
(6)

where we made use of the assumption that the annotators make their decisions independently of each other.

Including the vector z in our model as observed would yield the following expression for the likelihood

$$ p(\mathcal{D},\mathbf{z}|\theta) = p(\mathbf{z}|\boldsymbol{\pi}) \prod _{i=1}^N p\bigl(\mathbf{y}_i^1,\ldots, \mathbf{y}_i^R|\mathbf{x}_i,\mathbf{z}, \boldsymbol{\lambda}\bigr) $$
(7)

where θ={π,λ} are the model parameters.

Since we do not actually observe z, we must treat it as latent and marginalize over it by summing over all its possible values. The likelihood of our model then becomes

$$ p(\mathcal{D}|\theta) = \sum_{\mathbf{z}} p(\mathbf{z}| \boldsymbol{\pi}) \prod_{i=1}^N p\bigl( \mathbf{y}_i^1,\ldots,\mathbf{y}_i^R| \mathbf{x}_i,\mathbf{z},\boldsymbol{\lambda}\bigr). $$
(8)

The choice of explicitly including the reliability of the annotators (which we represent through the vector z) as latent variables and marginalizing over it, contrasts with typical approaches in learning from multiple annotators (e.g. Raykar et al. 2009, 2010; Dredze et al. 2009; Yan et al. 2011), where the unobserved ground truth labels are treated as latent variables. Since these variables are not observed (i.e. latent), they must be marginalized over. For sequence labeling problems, this marginalization can be problematic due to the combinatorial explosion of possible label sequences over which we would have to marginalize. Instead, by explicitly handling the annotators reliabilities as latent variables this problem can be completely avoided.

Making use of Eqs. (3) and (6), the likelihood can be further simplified giving

$$\begin{aligned} p(\mathcal{D}|\theta) &= \sum _{r=1}^{R} \pi_r \prod _{i=1}^N \Biggl\{ p_{\mathit{crf}}\bigl( \mathbf{y}_i^r|\mathbf{x}_i,\boldsymbol{\lambda}\bigr) \prod_{\substack{j=1\\ j \neq r}}^R p_{\mathit{rand}}\bigl(\mathbf{y}_i^j| \mathbf{x}_i\bigr) \Biggr\}. \end{aligned}$$
(9)

The maximum likelihood estimator is then found by determining the parameters θ MLE that maximize

$$ \theta_{\mathit{MLE}} = \arg \max_{\theta} \mbox{ } \ln \sum _{r=1}^{R} \pi_r \prod _{i=1}^N \Biggl\{ p_{\mathit{crf}}\bigl( \mathbf{y}_i^r|\mathbf{x}_i,\boldsymbol{\lambda}\bigr) \prod_{\substack{j=1\\ j \neq r}}^R p_{\mathit{rand}}\bigl(\mathbf{y}_i^j| \mathbf{x}_i\bigr) \Biggr\}. $$
(10)

3.4 EM algorithm

As with other latent variable models, we rely on the Expectation-Maximization algorithm (Dempster et al. 1977) to find a maximum likelihood parameters of the proposed model.

If we observed the complete dataset \(\{\mathcal{D},\mathbf{z}\}\) then the loglikelihood function would simply take the form \(\ln p(\mathcal{D},\mathbf{z}|\theta)\). Since we only have the “incomplete” dataset \(\mathcal{D}\), our state of the knowledge of the values of the latent variable z (the reliabilities of the annotators) can be given by the posterior distribution \(p(\mathbf{z}|\mathcal{D},\theta)\). Therefore, instead of the complete-data loglikelihood, we consider its expected value under the posterior distribution of the latent variable \(p(\mathbf{z}|\mathcal{D},\theta)\), which corresponds to the E-step of the EM algorithm. Hence, in the E-step we use the current parameter values θ old to find the posterior distribution of the latent variable z. We then use this posterior distribution to find the expectation of the complete-data loglikelihood evaluated for some general parameter values θ. This expectation is given by

$$\begin{aligned} \mathbb{E}_{p(\mathbf{z}|\mathcal {D},\theta^{old})}\bigl\{ \ln p(\mathcal {D}, \mathbf{z}|\theta) \bigr\} &= \sum_{\mathbf{z}} p\bigl( \mathbf{z}|\mathcal{D},\theta^{old}\bigr) \ln p(\mathcal{D},\mathbf{z}| \theta) \\ &= \sum_{\mathbf{z}} p\bigl(\mathbf{z}|\mathcal{D}, \theta^{old}\bigr) \ln \Biggl\{ p(\mathbf{z}|\boldsymbol{\pi}) \prod _{i=1}^N p\bigl(\mathbf{y}_i^1,\ldots, \mathbf{y}_i^R|\mathbf{x}_i,\mathbf{z}, \boldsymbol{\lambda}\bigr) \Biggr\}. \end{aligned}$$
(11)

The posterior distribution of the latent variable z can be estimated using the Bayes theorem, giving

$$\begin{aligned} \gamma\bigl(z^r\bigr) &= p\bigl(z^r=1|\mathcal {D}, \theta^{old}\bigr) \\ &= \frac{ p(z^r=1|\boldsymbol{\pi}^{old}) \mbox{ } p(\mathcal {Y}^1,\ldots,\mathcal {Y}^R|\mathcal {X},z^r=1,\boldsymbol{\lambda}^{old}) }{ \sum_{j=1}^R p(z^j=1| \boldsymbol{\pi}^{old}) \mbox{ } p(\mathcal {Y}^1,\ldots,\mathcal {Y}^R|\mathcal {X},z^j=1,\boldsymbol{\lambda}^{old}) } \\ &= \frac{ \pi_r^{old} \prod_{i=1}^N \{ p_{\mathit{crf}}(\mathbf{y}_i^r| \mathbf{x}_i,\boldsymbol{\lambda}^{old}) \prod_{\substack{k=1\\ k \neq r}}^R p_{\mathit{rand}}(\mathbf{y}_i^k|\mathbf{x}_i) \} }{ \sum_{j=1}^{R} \pi_j^{old} \prod_{i=1}^N \{ p_{\mathit{crf}}(\mathbf{y}_i^j|\mathbf{x}_i,\boldsymbol{\lambda}^{old}) \prod_{\substack{k=1\\ k \neq j}}^R p_{\mathit{rand}}(\mathbf{y}_i^k|\mathbf{x}_i) \} }. \end{aligned}$$
(12)

As long as we are assuming a uniform model for \(p_{\mathit{rand}}(\mathbf{y}_{i}^{r}|\mathbf{x}_{i})\), this expression can be further simplified, giving

$$\begin{aligned} \gamma\bigl(z^r\bigr) = \frac{ \pi_r^{old} \prod_{i=1}^N p_{\mathit{crf}} (\mathbf{y}_i^r|\mathbf{x}_i,\boldsymbol{\lambda}^{old}) }{ \sum_{j=1}^{R} \pi_j^{old} \prod_{i=1}^N p_{\mathit{crf}}(\mathbf{y}_i^j|\mathbf{x}_i,\boldsymbol{\lambda}^{old}) }. \end{aligned}$$
(13)

Making use of Eqs. (3), (6) and (11) the expected value of the complete-data loglikelihood becomes

$$\begin{aligned} &\mathbb{E}_{p(\mathbf{z}|\mathcal {D},\theta^{old})}\bigl\{ \ln p(\mathcal {D},\mathbf{z}|\theta) \bigr\} \\ &\quad = \sum_{r=1}^R \gamma\bigl(z^r\bigr) \Biggl\{ \ln \pi_r + \sum_{i=1}^N \ln p_{\mathit{crf}}\bigl(\mathbf{y}_i^r|\mathbf{x}_i, \boldsymbol{\lambda}\bigr) + \sum_{\substack{j=1\\ j \neq r}}^R \ln p_{\mathit{rand}}\bigl(\mathbf{y}_i^j| \mathbf{x}_i\bigr) \Biggr\}. \end{aligned}$$
(14)

In the M-step of the EM algorithm we maximize this expectation with respect to the model parameters θ, obtaining new parameter values θ new.

The EM algorithm can then be summarized as follows:

E-step :

Evaluate

$$ \gamma\bigl(z^r\bigr) \propto \pi_r^{old} \prod _{i=1}^N p_{\mathit{crf}} \bigl(\mathbf{y}_i^r|\mathbf{x}_i,\boldsymbol{\lambda}^{old}\bigr). $$
(15)
M-step :

Estimate the new ground truth labels sequences \(\widehat{\mathcal {Y}}^{new}\) and the new parameters θ new={π new,λ new} given by

$$ \boldsymbol{\lambda}^{new} = \arg\max _{\boldsymbol{\lambda}} \sum_{i=1}^{N} \sum _{r=1}^{R} \gamma\bigl(z^r\bigr) \bigl\{ \ln p_{\mathit{crf}}\bigl(\mathbf{y}_{i}^{r}| \mathbf{x}_{i}, \boldsymbol{\lambda}\bigr) \bigr\} $$
(16)
$$ \widehat{\mathcal {Y}}^{new} = \arg\max_{\widehat{\mathcal {Y}}} p_{\mathit{crf}}\bigl(\widehat{\mathcal {Y}}|\mathcal {X},\boldsymbol{\lambda}^{new}\bigr) $$
(17)
$$ \pi_r^{new} = \frac{F1\mbox{-}\mathit{measure}_{r}}{\sum_{j=1}^{R} F1\mbox{-}\mathit{measure}_{j}} $$
(18)

where in Eq. (17) the new ground truth estimate is efficiently determined using the Viterbi algorithm,Footnote 3 and in Eq. (16) the new CRF model parameters λ new are determined using limited-memory BFGS similarly to normal CRF training (Sutton and McCallum 2006). However, the loglikelihood function now includes a weighting factor: γ(z r). From this perspective, when learning from the label sequences of the different annotators, the proposed approach is weighting the latter by how much we expect them to be right, considering also how likely the other annotators are to be correct. If, for example, there are only two “good” annotators among a group of five, they will share the responsibility in “teaching” the CRF model.

The initialization of the EM algorithm can be simply done by assigning random values to the annotators reliabilities or by estimating the ground truth label sequences \(\widehat{\mathcal {Y}}\) using majority voting. The algorithm stops when the expectation in equation 11 converges or when the updates to the annotators reliabilities fall below a given threshold.

3.5 Maximum-a-posteriori

Sometimes we know a priori that some annotators are better or more trustworthy than others. This knowledge can be incorporated in the model by imposing a Dirichlet prior with parameters {α 1,…,α R } over the annotators reliabilities z. Similarly, it is also useful to add a zero-mean Gaussian prior with σ 2 variance over the CRF parameters λ to enforce regularization (Sutton and McCallum 2006). The maximum-a-posteriori (MAP) estimator is found by determining

$$ \theta_{\mathit{MAP}} = \arg \max_{\theta} \bigl\{ \ln p( \mathcal{D}|\theta) + \ln p(\theta)\bigr\}. $$
(19)

An EM algorithm can then be derived in a similar fashion.

When no prior knowledge about the annotators reliabilities is given, the Dirichlet prior can also be used as non-informative prior with all parameters α r equal. This prior would act as a regularization term preventing the model to overfit the data provided by a few annotators. The strength of the regularization would depend on the parameter α.

4 Experiments

The proposed approach is evaluated in the field of Natural Language Processing (NLP) for the particular tasks of Named Entity Recognition (NER) and Noun Phrase (NP) chunking. NER refers to the Information Retrieval subtask of identifying and classifying atomic elements in text into predefined categories such as the names of persons, organizations, locations and others, while NP chunking consists of recognizing chunks of sentences that correspond to noun phrases. Because of their many applications these tasks are considered very important in the field of NLP and other related areas.

We make our experiments using two types of data: artificial data generated by simulating multiple annotators, and real data obtained using Amazon’s Mechanical Turk (AMT). In both cases, the label sequences are represented using the traditional BIO (Begin, Inside, Outside) scheme as introduced by Ramshaw and Marcus (1995).

The proposed approach (henceforward referred to as “CRF-MA”)Footnote 4 is compared with four baselines:

  • MVseq: majority voting at sequence level (i.e., the label sequence with more votes wins);

  • MVtoken: majority voting at token level (i.e., the BIO label with more votes for a given token wins);

  • MVseg: this corresponds to a two-step majority voting performed over the BIO labels of the tokens. First, a majority voting is used for the segmentation process (i.e. to decide whether the token should be considered as part of a segment—a named entity for example), then a second majority voting is used to decide the labels of the segments identified (e.g. what type of named entity it is).

  • CRF-CONC: a CRF using all the data from all annotators concatenated for training.

The proposed model is also compared with the two variants of multi-label model proposed in Dredze et al. (2009): MultiCRF and MultiCRF-MAX. The latter differs from the former by training the CRF on the most likely (maximum) label instead of training on the (fuzzy) probabilistic labels (kindly see Dredze et al. (2009) for the details).

As an upper-bound, we also show the results of a CRF trained on ground truth (gold) data. We refer to this as “CRF-GOLD”.

For all the experiments a simple set of features that is typical in NLP tasks was used. The features used are summarized in Table 1. In CRF-MA, the EM algorithm was initialized with token-level majority voting (MVtoken). The MultiCRF model was initialized with uniform label priors. All the results are reported using (strict) phrase-level F1-score.

Table 1 Summary of CRF features

During our experiments we found that using the square of the F1-measure when computing π r gives the best results. This has the effect of emphasizing the differences between the reliabilities of the different annotators, and consequently their respective importances when learning the CRF model from the data. Hence, we use this version in all our experiments.

4.1 Artificial data

4.1.1 Named entity recognition

There are a few publicly available “golden” datasets for NER such as the 2003 CONLL English NER task dataset (Sang and Meulder 2003), which is a common benchmark for sequence labeling tasks in the NLP community. Using the 2003 CONLL English NER data we obtained a train set and a test set of 14987 and 3466 sentences respectively.

Since the 2003 CONLL shared NER dataset does not contain labels from multiple annotators, these are simulated for different reliabilities using the following method. First a CRF is trained for the complete training set. Then, random Gaussian noise (with zero mean and σ 2 variance) is applied to the CRF parameters and the modified CRF model is used to determine new sequences of labels for the training set texts. These label sequences differ more or less from the ground truth depending on σ 2. By repeating this procedure many times we can simulate multiple annotators with different levels of reliability.

An alternative approach would take the ground truth dataset and randomly flip the labels of each token with uniform probability p. However, this would result in simulated annotators that are inconsistent throughout the dataset, by labeling the data with a certain level of randomness. We believe that the scenario where the annotators are consistent but might not be as good as an “expert” is more realistic and challenging, and thus more interesting to investigate. Therefore we give preference to the CRF-based method in most of our experiments with artificial data. Nonetheless, we also make experiments using this alternative method of label-flipping to simulate annotators for the NP chunking task.

Using the CRF-based method described above, we simulated 5 artificial annotators with σ 2=[0.005,0.05,0.05,0.1,0.1]. This choice of values intends to reproduce a scenario where there is a “good”, two “bad” and two “average” annotators. The proposed approach (CRF-MA) and the four baselines were then evaluated against the test set. This process was repeated 30 times and the average results are presented in Table 2. We also report the results obtained on the training set. Note that, unlike for “typical” supervised learning tasks, in our case the F1 of the training set is important because it represents the estimation of the “unobserved” ground truth from the opinions of multiple annotators with different levels of expertise.

Table 2 Results for the NER task with 5 simulated annotators (with σ 2=[0.005,0.05,0.05,0.1,0.1]) with repeated labeling

The results in Table 2 indicate that CRF-MA outperforms the four proposed baselines in both the train set and test set. In order to assess the statistical significance of this result, a paired t-test was used to compare the mean F1-score of CRF-MA in the test set against the MVseq, MVtoken, MVseg and CRF-CONC baselines. The obtained p-values were 4×10−25, 7×10−10, 4×2−8 and 1×10−14 respectively, which indicates that the differences are all highly significant.

Regarding the MultiCRF model, we can see that, at best, it performs almost as good as MVtoken. Not surprisingly, the “MAX” version of MultiCRF performs better than the standard version. This behavior is expected since the “hard” labels obtained from majority voting also perform better than the “soft” label effect obtained in CRF-CONC. Nonetheless, neither version of MultiCRF performs as well as MA-CRF (test set p-values are 1×10−26 and 1×10−11 for the MultiCRF and MultiCRF-MAX respectively).

In order to empirically show that the proposed approach does not rely on repeated labeling (i.e., multiple annotators labeling the same data instances), the same “golden” NER dataset was split into 5 subsets, and for each subset an annotator was simulated with a different level of reliability σ 2 (namely σ 2=[0.005,0.05,0.05,0.1,0.1]) according to the CRF-based procedure described above. This process was repeated 30 times and the average results for the provided test set can be found in Table 3. Since there was no repeated labeling, the majority voting baselines, as well as the multi-label models (MultiCRF and MultiCRF-MAX), did not apply. The obtained results indicate that, in a scenario without any repeated labeling, the proposed approach (CRF-MA) still outperforms the CRF-CONC baseline. The statistical significance of the difference between the F1-scores of the two methods in the test set was evaluated using a paired t-test, indicating that the difference of the means is highly significant (p-value=1.47×10−11).

Table 3 Results for the NER task with 5 simulated annotators (with σ 2=[0.005,0.05,0.05,0.1,0.1]) without repeated labeling

The comparison of both experiments (i.e. with and without repeated labeling) indicates that, in this setting, having less repeated labeling hurts the performance of CRF-MA. Since this model differentiates between annotators with different levels of expertise, its performance is best when the more reliable ones have annotated more sequences, which is more likely to happen with more repeated labeling. Naturally, the opposite occurs with CRF-CONC. Since in this setting the less reliable annotators dominate, more repeated labeling translates in even more predominance of lower quality annotations, which affects the performance of CRF-CONC.

4.1.2 Noun phrase chunking

For the NP chunking task, the 2003 CONLL English NER dataset was also used. Besides named entities, this dataset also provides part-of-speech tags and syntactic tags (i.e. noun phrases, verbal phrases, prepositional phrases, etc.). The latter were used to generate a train and a test set for NP chunking with the same sizes of the corresponding NER datasets.

In order to simulate multiple annotators in the NP chunking data, the alternative method of randomly flipping the label of each token with uniform probability p was used. Since for this task there are only two possible labels for each token (part of a noun phrase or not part of a noun phrase)Footnote 5 it is trivial to simulate multiple annotators by randomly flipping labels. This annotator simulation process reproduces situations where there is noise in the labels of the annotators. Using this method we simulated 5 annotators with label flip probabilities p=[0.01,0.1,0.3,0.5,0.7]. This process was repeated 30 times and the average results are presented in Table 4. Differently to NER, NP chunking is only a segmentation task, therefore the results for the MVseg baseline would be equal to the results for MVtoken. The experimental evidence shows that the proposed approach (CRF-MA) achieves a higher F1-score than the MVseq, MVtoken and CRF-CONC baselines. The statistical significance of the difference between the test set F1-scores of CRF-MA and all these three baselines (MVseq, MVtoken and CRF-CONC) was evaluated using a paired t-test, yielding p-values of 2×10−30, 7×10−22 and 2×10−16 respectively. As with the NER task, the CRF-MA model also outperforms the MultiCRF and MultiCRF-MAX approaches (test set p-values are 6×10−32 and 2×10−21 respectively).

Table 4 Results for the NP chunking task with 5 simulated annotators (with p=[0.01,0.1,0.3,0.5,0.7]) with repeated labeling

4.2 Real data

The use of Crowdsourcing platforms to annotate sequences is currently a very active research topic (Laws et al. 2011), with many different solutions being proposed to improve both the annotation and the learning processes at various levels like, for example, by evaluating annotators through the use of an expert (Voyer et al. 2010), by using a better annotation interface (Lawson et al. 2010), or by learning from partially annotated sequences thus reducing annotation costs (Fernandes and Brefeld 2011).

With the purpose of obtaining real data from multiple annotators, we put 400 news articles from the 2003 CONLL shared NER task (for which we had ground truth) on Amazon’s Mechanical Turk for workers to label. In this experiment, the workers were then required to identify the named entities in the sentence and classify them as persons, locations, organizations or miscellaneous. Together with the named entity definition and the categories description, the workers were also provided with two exemplifying sentences. Workers with just a couple of answers were considered uninterested in the task and their answers were discarded, giving a total of 47 valid annotators. The average number of annotators per news article was 4.93, and each annotator labelled an average of 42 news articles (see Figs. 2a and 2b). In order to assess the “quality” of the annotators, we measured their F1-scores against the ground truth. Figure 2c shows a boxplot of the F1-scores obtained. It is interesting to notice that the quality of the AMT workers really varies enormously, with the lowest F1-score being 17.60 % (a very unreliable annotator), while the highest F1-score is 89.11 % (arguably almost an expert).

Fig. 2
figure 2

Boxplots for (a) the number of answers per AMT worker, (b) the number of answers per news article, and (c) the F1-scores of the annotators

As with the experiments with simulated annotators, the different approaches are evaluated in the provided test set, as well as in the ground truth labels for those 400 news articles. The obtained results are presented in Table 5. These results indicate that the proposed approach is better at uncovering the ground truth than all the other approaches tested. This in turn results in a better performance on the test set. Furthermore, the RMSE obtained between the true F1-scores of the annotators (measured against the actual ground truth) and their estimated F1-scores according to the CRF-MA approach (measured against the estimated ground truth) was 8.61 %, meaning that the reliability of the annotators is being approximated quite well. These results also indicate that crowdsourcing presents an interesting alternative solution for obtaining labeled data that could be used for training a NER system.

Table 5 Results for the NER task using real data obtained from Amazon’s Mechanical Turk

In order to evaluate the impact of repeated labeling, a random subsampling of the AMT data was performed. This experiment will allow us to reproduce a situation where each article is only labeled by one annotator, thus representing the minimum cost attainable with AMT (with the same price per task). For each of the 400 news articles, a single annotator was picked at random from the set of workers who labeled that article. This process was repeated 30 times to produce 30 subsampled datasets. The average precision, recall and F1-scores of the different methods are shown in Table 6. Notice that, since there is no repeated labeling, both the majority voting baselines and the multi-label models (MultiCRF and MultiCRF-MAX) do not apply. The obtained results show that CRF-MA also outperforms CRF-CONC in this setting (p-value=3.56×10−7). Interestingly, when compared to the results in Table 5, this experiment also shows how much could be gained by repeated labeling, thus providing a perspective on the trade-off between repeated labeling and cost.

Table 6 Results for the NER task using data from Amazon’s Mechanical Turk without repeated labelling (subsampled data from the original dataset)

5 Conclusion

This paper presented a probabilistic approach for sequence labeling using CRFs with data from multiple annotators which relies on a latent variable model where the reliability of the annotators are handled as latent variables. The EM algorithm is then used to find maximum likelihood estimates for the CRF model parameters, the reliability of the annotators and the ground truth label sequences. The proposed approach is empirically shown to significantly outperform traditional approaches, such as majority voting and using the labeled data from all the annotators concatenated for training, even in situations of high levels of noise in the labels of the annotators and when the less “trustworthy” annotators dominate. This approach also has the advantage of not requiring the repeated labeling of the same input sequences by the different annotators (unlike majority voting, for example). Although we presented a formulation using CRFs, it could be easily modified to work with other sequence labeling models such as HMMs.

Future work intends to explore dependencies of the reliabilities of the annotators on the input sequences they are labeling, which can be challenging due to the high dimensionality of the feature space, and the inclusion of a Dirichlet prior over the qualities of the annotators. Furthermore, the extension of the proposed model to an active learning setting will also be considered. Since the annotators reliabilities are being estimated by the EM algorithm, this information can be used to, for example, decide who are the most trustworthy annotators. Requesting new labels from those annotators will eventually improve the models performance and reduce annotation cost.