Introduction

Quality of Experience (QoE) is a complex concept, riddled with subtleties regarding several confluent domains, such as systems performance, psychology, physiology, etc., as well as contextual aspects of where, when and how a service is used. For all this complexity, it is most often treated in the most simplistic way in terms of statistical analysis, basically just looking at averages, and maybe standard deviations and confidence intervals.

In this article, we extend our previous work (Hoßfeld et al. 2015), putting forward the idea that it is necessary to go beyond these simple measures of quality when performing subjective assessments, in order to a) get a proper understanding of the QoE being measured, and b) be able to exploit it fully. We present the reasons why it is important to look beyond the Mean Opinion Score (MOS) when thinking about QoE, as well as other measures that can be extracted from subjective assessment data, why they are useful, and how they can be used.

Our main contribution is in highlighting the importance of the insight found in the uncertainty of the opinion scores. This uncertainty is masked by the MOS, and such an insight will enable the service providers to manage QoE in a more effective way. We propose different approaches to quantify the uncertainty; standard deviation, cumulative density functions (CDF), and quantiles, as well as looking into the impact of different types of rating scales on the results. We provide a formal proof that user diversity of a study can be compared by means of the SOS parameter a independent of the used rating scale. We also look at the relationship between quality and acceptance, both implicitly and explicitly. We provide several examples where going beyond simple MOS calculations allows for a better understanding of how the quality is actually perceived by the user population (as opposed to a hypothetical “average user”). A service provider might be interested e.g. for which conditions at least 95 % of the users are satisfied with the service quality – which may be quantified in terms of quantiles. In particular, we take a closer look at the link between acceptance and opinion ratings (for a possible classification of QoE measures, cf. Fig. 3). Such behavioral metrics like acceptance are important for service providers to plan, dimension and operate their services. Therefore, it is tempting to establish a link between opinion measurements from subjective QoE studies and behavioral measurements which we approach by defining the \(\theta \)-acceptability. The analysis of acceptance in relation to MOS values is another key contribution in the article. To cover a variety of relevant applications, we consider speech, video, and web QoE.

The remainder of this article is structured as follows. In “Motivation” we discuss why a more careful statistical treatment of subjective QoE assessments is needed.  “Background and related work” discusses related work. We present our proposed approach and define the QoE metrics in  “Definition of QOE metrics”, while in  “Application to real data sets: some examples” we look at several subjective assessment datasets, using other metrics besides MOS in our analysis, and also considering the impact of the scales used. We conclude the article in “Conclusions”, discussing the practical implications of our results.

Motivation

Objective and subjective QoE metrics

It is a common and well-established practice to use MOS (ITU-T 2003) to quantify perceived quality, both in the research literature, as well as in practical applications such as QoE models. This is simple and useful for some instances of “technical” evaluation of systems and applications such as network dimensioning, performance evaluation of new networking mechansims, assessment of new codecs, etc.

There is a wealth of literature on different objective metrics, subjective methods, models, etc, (Engelke and Zepernick 2007; Van Moorsel 2001; Chikkerur et al. 2011; Mohammadi et al. 2014; Korhonen et al. 2012; Mu et al. 2012). However, none of them consider anything more complex than MOS in terms of analyzing subjective data or producing QoE estimates. In Streijl et al. (2014), the authors discuss the limitations of MOS and other related issues.

Collapsing the results of subjective assessments into MOS values, however, hides information related to inter-user variation. Simply using the standard deviation to assess this variation might not be sufficient to understand what is really going on, either. Two very different assessment distributions could “hide” behind the same MOS and standard deviation, and in some QoE exploitation scenarios, this could have a significant impact both for the users and the service providers. Figures 1 and 2 show examples of such distributions, continuous and discrete (the latter type being closer to the 5-point scales commonly used for subjective assessment), respectively. As can be seen, while votes following these distributions would present the same MOS (and also standard deviation values in Fig. 2), the underlying ground truths would be significantly different in each case. For the discrete case, they differ significantly in skewness and in their quantiles, both of which have practical implications, e.g., for service providers.

Fig. 1
figure 1

Different continuous distributions with identical mean (2.5) which differ in other measures like standard deviation \(\sigma \) or 90 % quantiles Q

Fig. 2
figure 2

Different discrete distributions with identical mean (3.5) and standard deviation (0.968). It can easily be seen that e.g. the median (and other important quantiles, in fact) are significantly different in each distribution

Fig. 3
figure 3

Classification into opinion and behavioral metrics. Perceptual quality dimensions include for example loudness, noisiness, etc. Qualitative opinions are typically ‘yes/no’ questions like for acceptance. Within the article we address the bold-faced and blue colored opinion metrics. Some of the opinion metrics are related to the behavioral metrics in italics and colored in green. See  “Definition of QOE metrics” for formal definitions of some of the terms above

When conducting subjective assessments, a researcher may try to answer different types of questions regarding the quality of the service under study. These questions might relate to the overall perception of the quality (probably the most commonly case found in the literature), some more specific perceptual dimensions of quality (e.g., intelligibility, in the case of speech, or blockiness in the case of video), or other aspects such as usability or acceptability of the service. The assessment itself can either explicitly ask opinions from the subjects, or try to infer those opinions through more indirect, behavioral or physiological measurements. Figure 3 presents an overview of approaches to measuring and estimating QoE, both subjectively and objectively.

The need to go beyond MOS

Using average values (such as MOS) may be sufficient in some application areas, for instance when comparing the efficiency of different media encoding mechanisms (where quality is not the only consideration, or is a secondary one), or when only a single, simple indicator of quality is sought (e.g., some monitoring dashboard applications). For most other applications—and in particular from a service provider’s point of view—however, MOS values are not really sufficient. Averages only consider—well—averages, and do not provide a way to address variations between users. As an extreme example, if the MOS of a given service under a given condition is 3, it is a priori impossible to know whether all users perceived quality as acceptable (all scores are 3), or maybe half the users rated the quality 5 while the other half rated it 1, or anything in between, in principle. To some extent, this can be mitigated by quantifying user rating variation via e.g. standard deviations. However, the question often faced by service providers is of the type: “Assuming they observe comparable conditions, are at least 95 % of my users satisfied with the service quality they receive?”. As we will see, it is a common occurrence that mean quality values indicated as acceptable or better (e.g. MOS 3 or higher) hide a large percentage of users who deem the quality unacceptable. This clearly poses a problem for the service provider (who might get customer complaints despite seeing the estimated quality as “good” in their monitoring systems), and for the users, who might receive poor quality service while the provider is unaware of the issue, or worse, believes the problem to be rooted outside of their system.

Likewise, using higher order moments such as skewness and kurtosis can provide insight as to how differently users perceive the quality under a given condition, relative to the mean (e.g. are most users assessing “close” to the mean, and on which side of it).

Very little work has been done on this type of characterization of subjective assessment. One notable exception is  (Janowski and Papir 2009), where the authors propose a generalized linear model able to estimate a distribution of ratings for different conditions (with an example use case of FTP download times versus link capacity).

Background and related work

The suitability of the methods used to assess quality has historically been a contentious subject, which in a way reflects the multi-disciplinary nature of QoE research, where media, networking, user experience, psychology and other fields converge.

Qualitative approaches to quality assessment, whereby users describe their experiences with the service in question, have been proposed as tools to identify relevant factors that affect quality (Bouch et al. 2000).

In other contexts (see Nachlieli and Shaked 2011 for a nice example related to subjective validation of objective image quality assessment tools via subjective assessment panels), pair-wise comparisons, or preference rank ordering can be better suited than quantitative assessments.

In practice, most QoE research in the literature typically follows the (quantitative) assessment approaches put forward by the ITU (e.g., ITU-T P.800 (ITU-T 1996) for telephony, or ITU-R Rec. BT.500-13 (ITU-R 2012) for broadcast video), whereby a panel of users are asked to rate the quality of a set of media samples that have been subjected to different degradations. These approaches have shown to be useful in many contexts, but they are not without limitations.

In particular, different scales, labels, and rating mechanisms have been proposed (e.g. Watson and Sasse 1998), as well as other mechanisms for assessing quality in more indirect ways, for example, by seeing how it affects the way users perform certain tasks (Knoche et al. 1999; Gros et al. 2005, 2006; Durin and Gros 2008). These approaches provide, in some contexts, a more useful notion of quality, by considering its effects on the users, rather than considering user ratings. Their applicability, however, is limited to services and use cases where a clear task with measurable performance can be identified. This is limiting in many common scenarios, such as entertainment services. Moreover, the use of averages is still pervasive in them, posing the same type of limitations that the use of MOS values has. Other indirect measures of quality and how it affects users can be found in willingness to pay studies, which aim at understanding how quality affects the spending behavior of users (Sackl et al. 2013; Mäki et al. 2016).

Other approaches of quality assessment focus on (or at least explicitly include) the notion of acceptability (Pinson et al. 2007; Sasse and Knoche 2006; Spachos et al. 2015; Pessemier et al. 2011). Acceptability is a critical concept in certain quality assessment contextsFootnote 1 and application domains, both from the business point of view (“will customers find this level of quality acceptable, given the price they pay?”) and on more technical aspects, for instance for telemedicine applications, where applications often have a certain quality threshold below which they are not longer acceptable to use safely. Later in the article we discuss the relation between quality and acceptability (by looking at measures such as “Good or Better”, “Poor or Worse”, and introducing a more generic one, \(\theta \)-acceptability) in more detail.

QoE and influence factors on user ratings

From the definition of quality first introduced by Jekosch (2005), it follows that quality is the result of an individual’s perception and judgment process, see also Le Callet et al. (2013). Both processes lead to a certain degree of delight or annoyance of the judging individual when s/he is using an application or service, i.e. the Quality of Experience (QoE). The processes are subject to a number of influence factors (IFs) which are grouped in Le Callet et al. (2013) into human, system and context influence factors. Human IFs are static or dynamic user characteristics such as the demographic and socio-economic background, the physical or mental constitution, or the user’s mental state. They may influence the quality building processes at a lower, sensory level, or at a higher, cognitive level. System IFs subsume all technical content, media, network and device related characteristics of the system which impact quality. Context IFs “embrace any situational property to describe the user’s environment in terms of physical, temporal, social, economic, task, and technical characteristics” (Le Callet et al. 2013; JumiskospsPyykkö and Vainio 2010) which impact the quality judgment. Whereas the impact of System IFs is a common object of analysis when new services are to be implemented, with few exceptions little is known about the impact of User and Context IFs on the quality judgment.

Two well-known examples of actually including context factors into quality models are the so-called “advantage of access” factor in the E-model (Möller 2000), and the type of conversation and its impact on the quality judgment with respect to delay in telephony scenarios (Egger 2014; ITU-T 2011). Some of these contextual factors, such as the aforementioned “advantage of access” incorporated in the E-model might even vary with time, as different usage contexts become more or less common.

Influence factors in subjective experiments

In order to cope with the high number of IFs, subjective experiments which aim at quantifying QoE are usually carried out under controlled conditions in a laboratory environment, following standardized methodologies (ITU-T 2003, 2008; ITU-R 2012) in order to obtain quality ratings for different types of media and applications. These methodologies have been designed with consistency and reproducibility in mind, which allow results to be comparable across studies done in similar conditions. For the most part, these methodologies result in MOS ratings, along with standard deviation and confidence intervals, whereas even early application guidelines [such as the ones given in the ITU-T Handbook on Telephonometry (ITU-T 1992)] already state that the consideration of distributions of subjective ratings would be more appropriate, given the characteristics of the obtained ratings.

Regarding the Context IFs, the idea of laboratory experiments is to keep the usage context as far as possible constant between the participants of an experiment. This is commonly achieved by designing a common test task, e.g. perceiving pre-recorded stimuli and providing a quality judgment task, with or without a parallel (e.g. content-transcription) task, or providing scenarios for conversational tasks (ITU-T 2007). A context effect within the test results from presenting different test conditions (e.g. test stimuli) is a sequence, so that the previous perception process sets a new reference for the following process. This effect can partially be ruled out by factorial designs, distributing test conditions across participants in a mostly balanced way, or (approximately) by simple randomization of test sequences. Another context effect results from the rating scales which are used to quantify the subjective responses.

System IFs also carry an influence on the test outcome, in terms of the selection of test conditions chosen for a particular test (session). It is commonly known that a medium-quality stimulus will obtain a relatively bad judgment in a test where all the other stimuli are of better quality; in turn, the same stimulus will get a relatively positive judgment if it is nested in a test with only low-quality stimuli. This impact of the test conditions was ruled out in the past by applying the same stimuli with known “reference degradations” in different tests. In speech quality evaluation, for example, the Modulated Noise Reference Unit (MNRU) was used for this purpose (ITU-T 1996).

Service provider’s interest in QoE metrics

In order to stay in business in a free market, ISPs and other service providers need to maintain a large portion of their users satisfied, lest they stop using the service or change providers—the dreaded “churn” problem. For any given service level the provider can furnish, there will be a certain proportion of users who might find it unacceptable, and the perceived quality of the service is one of the key factors determining user churn (Kim and Yoon 2004). Moreover, a large majority (\(\sim 90\,\%\)) of users will simply defect a service provider without even complaining to them about service quality, and report their bad experience within their social circles (Soldani et al. 2006), resulting in a possibly even larger business impact in terms of e.g., brand reputation. With only a mean value as an indicator for QoE, such as the MOS, the service provider cannot know what this number of unsatisfied users might be, as user variation is lost in the averaging process.

For many applications, however, it is desirable to gauge the portion of users that is satisfied given a set of conditions (e.g., under peak-time traffic, for an IPTV service). For example, a service provider might want to ensure that at least, say, 95 % of its users find the service acceptable or better. In order to ascertain this, some knowledge of how the user ratings are distributed for any given condition is needed. In particular, calculating the 95 % quantile (keeping in line with the example above) would be sufficient for the provider.

In the past, service providers have also based their planning on (estimated) percentages of users judging a service as “poor or worse” (\(\%\mathrm {PoW}\)), “good or better” (\(\%\mathrm {GoB}\)), or the percentage of users abandoning a service (Terminate Early, \(\%\mathrm {TME}\)). These percentages have been calculated from MOS distributions on the basis of large collections of subjective test data, or of customer surveys. Whereas the original source data is proprietary in most cases, the resulting distributions and transformation laws have been published in some instances. One of the first service providers to do this was Bellcore (ITU-T 1993), who provided transformation laws between an intermediate variable, called the Transmission Rating R, and \(\%\mathrm {PoW}\), \(\%\mathrm {GoB}\) and \(\%\mathrm {TME}\). These transformation were further extended to other customer behavior predictions, like retrial (to use the service again) and complaints (to the service provider). The Transmission Rating could further be linked to MOS predictions, and in this way a link between MOS, \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) could be established. The E-model, a parametric model for planning speech telephony networks, took up this idea and slightly modified the Transmission Rating calculation and the transformation rules between R and MOS, see ETSI (1996). The resulting links can be seen in Fig. 4. Such links can be used for estimating the percentage of dissatisfied users from the ratings of a subjective laboratory test; there is, however, no guarantee that similar numbers would be observed with the real service in the field. In addition, the subjective data the links are based on mostly stem from the 1970–1980s; establishing such links anew, and for new types of services, is thus highly desirable.

In an attempt to go beyond user satisfaction and into user acquisition, many service providers have turned to the Net Promoter Score (NPS)Footnote 2, which purports to classify users into “promoters” [enthusiastic users likely to will keep buying the service and “promoting growth”, “passives” (users that are apathetic towards the service and might churn if a better offer from a competitor comes along) and “detractors” (vocal, dissatisfied users who can damage the service’s reputation)]. While popular with business people, the research literature on the NPS is critical of the reliability of such subjective test assessments (e.g. Keiningham et al. 2007; de Haan et al. 2015). The NPS is based on a single-item questionnaire whereby a user is asked how likely they are to recommend the service or product to a friend or colleague, which might explain its shortcomings.

Fig. 4
figure 4

Relationship between MOS, \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) as used in the E-model (ETSI 1996). The ratio of users not rating poor or worse as well as good or better is referred to as ‘neutral’ and is computed by 1—\(\%\mathrm {GoB}\)\(\%\mathrm {PoW}\)

Definition of QoE metrics

The key QoE metrics are defined in this section: the mean of the opinion scores (MOS); the standard deviation of opinion scores (SOS) reflecting the user diversity and its relation to MOS; the newly introduced \(\theta \)-acceptability as well as acceptance; the ratio of (dis-)satisfied users rating good or better \(\%\mathrm {GoB}\) and poor or worse \(\%\mathrm {PoW}\), respectively. The detailed formal definitions of the QoE metrics are added in the technical report (Hoßfeld et al. 2016).

Preamble

In this article we consider studies where users are asked their opinion on the overall quality (QoE) of a specific service. The subjects (the participants in a study that represent users), rate the quality as a quality rating on a quality rating scale. As a result, we obtain an opinion score by interpreting the results on the rating scale numerically. An example is a discrete 5-point scale with the categories \(1 \triangleq \) ‘bad’, \(2 \triangleq \) ‘poor’, \(3 \triangleq \) ‘fair’, \(4 \triangleq \) ‘good’, and \(5 \triangleq \) ‘excellent’, referred to as an Absolute Category Rating (ACR) scale (Möller 2000).

Expected value and its estimate: MOS

Let U be a random variable (RV) that represents the quality ratings, \(U \in \Omega \), where \(\Omega \) is the rating scale, which is also the state space of the random variable U. The RV U can be either discrete, with probability mass function \(f_s\), or continuous, with probability density function f(s) for rating score s. The estimated probability of opinion score s from the R user ratings \(U_i\) is

$$\begin{aligned} \hat{f}_s = \frac{1}{R} \sum _{i=1}^{R} \delta _{U_i,s} \end{aligned}$$
(1)

with the Kronecker delta \(\delta _{U_i,s}=1\) if user i is rating the quality with score s, i.e. \(U_i=s\), and 0 otherwise.

The Mean Opinion Score (MOS) is an estimate of E[U].

$$\begin{aligned} u = \hat{U} = \frac{1}{R} \sum _{i=1}^{R} U_i \end{aligned}$$
(2)

SOS as function of MOS

In Hoßfeld et al. (2011), the minimum, \(S^{-}(u)\), and the maximum SOS, \(S^{+}(u)\) were obtained, as a function of the MOS u. The minimum SOS is \(S^{-}(u) = 0\) on a continuous scale, \([U^{-};U^{+}]\), and

$$\begin{aligned} S^{-}(u) = \sqrt{u (2 \lfloor u\rfloor +1)-\lfloor u\rfloor (\lfloor u\rfloor +1)-u^2} \end{aligned}$$
(3)

on a discrete scale, \(\{U^{-}, \ldots , U^{+}\}\).

The maximum SOS is, on both continuous and discrete scales (the scales as above).

$$\begin{aligned} S^+(u)=\sqrt{-u^2+(U^{-} + U^+)u - U^{-} \cdot U^+} \end{aligned}$$
(4)

The SOS hypothesis (Hoßfeld et al. 2011), formulates a generic relationship between MOS and SOS values independent of the type of service or application under consideration.

$$\begin{aligned} S(u) = \sqrt{a} \cdot S^{+}(u) \end{aligned}$$
(5)

It has to be noted that the SOS parameter a is scale invariant when linearly transforming the user ratings and computing MOS and SOS values for the transformed ratings. The SOS parameter allows to compare user ratings across various rating scales. Thus, any linear transformation of the user ratings does not affect the SOS parameter a which is formally proven in the Appendix 2. However, it has to be clearly noted that if the participants are exposed to different scales, then different SOS parameters may be observed. This will be shown in “SOS hypothesis and modeling of complete distributions” e.g. for the results on speech QoE in Fig. 12a. The parameter a, depends on the application or service, and the test conditions. The parameter is derived from subjective tests, and in the “SOS hypothesis and modeling of complete distributions” a few examples are included.

\(\theta \)-Acceptability

For service providers, acceptance is an important metric to plan, dimension and operate their services. Therefore, we would like to establish a link between opinion measurements from subjective QoE studies and behavioral measurements. In particular, it would be very useful to derive the “accept” behavioral measure from opinion measurements of existing QoE studies. This would allow to reinterpret existing QoE studies from a business oriented perspective. Therefore, we introduce the notion of \(\theta \)-acceptability which is based on opinion scores.

The \(\theta \)-acceptability, \(\mathbb {A}_{\theta }\), is defined as the probability that the opinion score is above a certain threshold \(\theta \), \(P(U \ge \theta )\), and can be estimated by \(\hat{f}_s\) from Eq. (1) or by counting all user ratings \(U_i \ge \theta \) out of the R ratings.

$$\begin{aligned} \mathbb {A}_{\theta } = \int _{s = \theta }^{U^+} \hat{f}_s ds = \frac{1}{R} \left| \{U_i \ge \theta : i = 1, \dots , R \} \right| \end{aligned}$$
(6)

Acceptance

When a subject is asked to rate the quality as either acceptable or not acceptable, this means that U is Bernoulli-distributed. The quality ratings are then samples of \(U_i \in \{0,1 \}\), where 1 \(\triangleq \) ‘accepted’ and 0 \(\triangleq \) ‘not accepted’. The probability of acceptance is then \(f_u = P(U=u)\), \(U\in \{0,1\}\), and can be estimated by Eq. (1) with \(u=1\):

$$\begin{aligned} \hat{f}_1 = \frac{1}{R} \sum _{i=1}^{R} \delta _{U_i,1} \end{aligned}$$
(7)

(this is equal to \(\mathbb {A}_{1}\) in Eq. (6) with \(U^{-}=0\) and \(U^{+}=1\) on a discrete scale).

\(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\)

Section “Service provider’s interest in QoE metrics” describes the use of the percentage of Poor-or-Worse (\(\%\mathrm {PoW}\)) and Good-or-Better (\(\%\mathrm {GoB}\)). These are quantile levels in the distribution of the quality rating U, or in the empirical distribution of \(\mathcal {U} = \{ U_i \}\).

The two terms are used in the E-model (ETSI 1996) where the RV of the quality rating, \(U \in [0;100]\) is referring to Transmission Rating R that represents objective (estimated) rating of the voice quality. The E-model assumes that \(U \sim N(0,1)\), which is the standard normal distribution.

Under this assumption, the measures have been defined asFootnote 3

$$\begin{aligned} \mathrm {GoB} (u) &= F_U \left( \frac{u - 60}{16} \right) = P_U\left( U\ge 60 \right) \end{aligned}$$
(8)
$$\begin{aligned} \mathrm {PoW} (u) &= F_U \left( \frac{45 - u}{16} \right) = P_U\left( U\le 45 \right) \end{aligned}$$
(9)

The E-model also defines a transformation of the U onto a continuous scale of MOS \( \in [1;4.5]\), by the following relation:

$$\begin{aligned} MOS(u) = 7 \,u \, (u-60)(100-u) \, 10^{-6} + 0.035 \,u+1 \end{aligned}$$
(10)

The plot of (continuous) MOS (\(\in [1;4.5]\)) in Fig. 4 is an example where this transformation has been applied to map the MOS to \(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\). Observe that the sum of \(\%\mathrm {GoB}\) + \(\%\mathrm {PoW}\) does not add up to 100 %, because the probability (denoted “neutral” in the figure), \(P(45 < U < 60)\), is not included in neither \(\%\mathrm {PoW}\) nor \(\%\mathrm {GoB}\). The quantiles used (i.e. 45 and 60) for the two measures, and the assumed standard normal distribution, are chosen as a result of a large number of subjective audio quality tests conducted while developing the E-model (ETSI 1996). Table 1 includes the MOS and the Transmission Rating R, with their corresponding valuesFootnote 4 of the \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\).

Table 1 E-model: MOS and transmission Rating R with the quantile measures for speech quality

The measures are estimated based on the ordered set of quality ratings, \(\mathcal {U} = \{ U^{(i)} \}\), by using the \(\theta \)-Acceptability estimator from Eq. (6). First, discretise the quality rating scale \(\mathcal {U} \in \{0,100\}\). Then, using the Eq. (6), the following applies

$$\begin{aligned} \hat{\%\mathrm {GoB}}&= \mathbb {A}_{\theta _{gb}} \end{aligned}$$
(11)
$$\begin{aligned} \hat{\%\mathrm {PoW}}&= 1-\mathbb {A}_{\theta _{pw}} \end{aligned}$$
(12)

For example, in the E-model the \(\theta _{gb}=60\) and \(\theta _{pw}=45\) for \(\mathcal {U} \in \{0,100\}\), and \(\theta _{gb}=3.1\) and \(\theta _{pw}=2.3\) on a \(\mathcal {U} \in \{1,5\}\) scale (when using Eq. 10).

The purpose of the example above is to demonstrate GoB and PoW using an ACR scale (1–5). This is a theoretical exercise (valid for the E-model) where we apply the transformation from R to “MOS” (term used when E-model was introduced) as given in Eq. (10), and transform Eqs. (8), (9) into Eqs. (11), (12), using the notation introduced in Sect. “\(\theta \)-Acceptability”. Samples from Eq. (10) are given in Table 1. The \(\%\mathrm {GoB} = P(R \ge 60)\) corresponds to \(\%\mathrm {GoB} = P(\text {MOS} \ge 3.1)\) which on an integer scale is \(\%\mathrm {GoB} = P(\text {MOS} \ge 4)\). Correspondingly, for \(\%\mathrm {PoW} = P(R \le 45) = P(\text {MOS} \le 2.32) = P(\text {MOS} \le 2)\).

It is important to note that the quantiles in the examples are valid for speech quality tests under the assumptions given in the E-model. The mapping of the MOS to the \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) metrics in Table 1 are specific for this E-model, but the \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) metrics are general and can be obtained from any quality study, provided that the thresholds \(\theta _{gb}\) and \(\theta _{pw}\) are determined.

In the following we demonstrate the use of \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) metrics also for other quality tests.

Application to real data sets: some examples

Overview on selected applications and subjective studies

The presented QoE measures are applied to real data sets available in the literatureFootnote 5, subjective studies, but rather used the opinion scores from the existing studies to apply the QoE measures and interpret the results in a novel way, obtaining a deeper understanding of them., comparing MOS values to other quantities. To cover a variety of relevant applications, we consider speech, video, and web QoE. The example studies highlight which conclusions can be drawn from other measures beyond the MOS, such as SOS, quantiles, or \(\theta \)-acceptability. The limitations of MOS become clear from the results. These additional insights are valuable e.g., to service providers to properly plan or manage their systems.

Section “-θ-Acceptability derived from user ratings” focuses on the link between acceptance and opinion ratings. The study considers web QoE, however, users have to complete a certain task when browsing. Test subjects are asked to rate the overall quality as well as answering an acceptance question. This allows to investigate the relation between MOS, acceptance, \(\theta \)-acceptability, \(\%\mathrm {GoB}\), and \(\%\mathrm {PoW}\) based on the subjects’ opinions. The relation between acceptance as a behavioral measure and overall quality as opinion measure is particularly interesting. To wit, it would be very useful to be able to derive the “accept” behavioral measure from QoE studies and subjects’ opinions. This would provide a powerful tool to re-interpret existing QoE studies from a different, more business-oriented perspective.

Section “%GoB and %PoW: ratio of (dis-)satisfied users” investigates the ratio of (dis-)satisfied users. The study on speech quality demonstrates the impact of rating scales and compares \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) related to MOS when subjects are rating on a discrete and a continuous scale. The results are also checked against the E-model to analyze its validity when linking overall quality (MOS) to those quantities. Additional results for web QoE can be found in "Experimental Setup for Task-Related Web QoE", "Speech Quality on Discrete and Continuous Scale", "Web QoE and Discrete Rating Scale" in Appendix 1 (Fig. 13). In this subjective study on web QoE, page load times are varied while subjects are viewing a simple web page. The web QoE results confirm the gap between the \(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\) estimates (as defined e.g. for speech QoE by the E-model), and the measured \(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\).

Section “SOS hypothesis and modeling of complete distributions” relates the diversity in user ratings in terms of SOS to MOS. Results from subjective studies on web, speech, and video QoE are analyzed. As a result of the web QoE study, we find that the opinion scores for this study can be very well approximated with a binomial distribution—which allows us to fully specify the voting distribution using only the SOS parameter a. For the video QoE study, a continuous rating scale was used and we find that the opinion scores follow a truncated normal distribution. Again, the SOS parameter a derived for this video QoE study fully describes then the distribution of opinion scores for any given MOS value. Thus, the SOS parameter allows to model the entire distribution and then to derive measures such as quantiles. We highlight the discrepancy between quantiles and MOS, which is of major interest for service providers.

Section “Comparison of results” provides a brief comparison of the studies presented in the article. It serves mainly as an overview on interesting QoE measures beyond MOS and a guideline how to properly describe subjective studies and their results.

For the sake of completeness, the reader finds a detailed summary of the experimental description in the Appendix 1.

\(\theta \)-Acceptability derived from user ratings

The experiments in Schatz and Egger (2014) investigated task-related web QoE in conformance with ITU-T Rec. P.1501 (ITU-T 2013). In the campaign conducted, subjects were asked to carry out a certain task, e.g. ‘Browse to search for three recipes you would like to cook in the given section.’ on a certain cooking web page (cf. Table 3). The network conditions were changed and the impact of page load times during the web session was investigated. Besides assessing the overall quality of the web browsing session, subjects additionally answered an acceptance question. In particular, after each condition, subjects were asked to rate their overall experienced quality on a 9-point ACR scale, see Fig. 11, as well as a binary acceptance question. The experiment was carried out in a laboratory environment, with 32 subjects.

Fig. 5
figure 5

Task-Related Web QoE and Acceptance. Results of the task-related web QoE and acceptance study (Schatz and Egger 2014) in  “θ-Acceptability derived from user ratings”. The data is based on a subjective lab experiment in which participants had to browse four different websites at different network speeds resulting in different levels of experienced responsiveness. The network speeds determined the page load times while browsing and executing a certain task. Defined tasks for each technical condition should stimulate the interaction between the web site and the subject for each test condition, see Table 3. In total, there are 23 different test conditions in the data set. The overall quality for each test condition was evaluated by 10–30 subjects on a discrete 9-point scale which was subsequently mapped into a 5-point ACR scale. Furthermore, subjects gave their opinion on the acceptance (yes/no) of that test condition. a MOS & Acceptance per Condition. The blue bars in the foreground depict the MOS values per test condition on the left y-axis. The grey bars in the background depict the acceptance values for that test condition on the right axis. While the acceptance values reach the upper bound of 100 %, the maximum MOS observed is 4.39. The minimum MOS over all test conditions is 1.09, while the minimum acceptance ratio is 27.27 %. b Acceptance per Rating Category. The users are rating the overall quality on a 9-point ACR scale and additionally answer an acceptance question. All users who rate an arbitrary test condition with x are considered and the acceptance ratio y is computed. The plot shows how many users accept a condition and rate QoE with x. For each rating category 1,…,9, there are at least 20 ratings. Still, 20 % of the users accept the service, although the overall quality is bad. c %GoB-MOS Plot. The markers depict θ-acceptability \(P(U\ge \theta) \) depending on the MOS for θ = 3 ‘diamond’ and θ = 4 ‘triangle’ i.e. %GoB. The %GoB (solid line) overestimates the true ratio of users rating good or better (θ = 4). This can be adjusted by considering users rating fair or better P(U ≥ 3) which is close to the %GoB estimation. In addition, the acceptance ratio ‘Square’ is plotted depending on the MOS. However, the θ-acceptability curves as well as the %GoB estimates do not match the acceptance curve. In particular, for the minimum MOS of 1.09, the θ-acceptability is 0 %, while the acceptance ratio is 27.27 %. d %PoW-MOS Plot. The markers depict the the ratio of users not accepting a test condition ‘Square’ depending on the MOS for all 23 test conditions. The results are compared with %PoW estimation, but again the characteristics are not matched. Especially, 27.27 % of users are still accepting the service, although the MOS value is 1.09. The %PoW is close to 0 %. Nevertheless, this indicates that overall quality can be mapped roughly to other dimensions like ‘no acceptance’.

Figure 5 quantifies the acceptance and QoE results from the subjective study in Schatz and Egger (2014). This study also considered web QoE; however, users must complete a certain task when browsing. The test subjects were asked to rate the overall quality as well as answering an acceptance question. This allowed to investigate the relation between MOS, acceptance, \(\theta \)-acceptability, \(\%\mathrm {GoB}\), and \(\%\mathrm {PoW}\) based on the subjects’ opinions.

Figure 5a shows the MOS and the acceptance ratio for each test condition. The blue bars in the foreground depict the MOS values on the left y-axis. The grey bars in the background depict the acceptance values on the right y-axis. While the acceptance values reach the upper bound of 100 %, the maximum MOS observed is 4.3929. The minimum MOS over all test conditions is 1.0909, while the minimum acceptance ratio is 27.27 %. These results indicate that users may tolerate significant quality degradation for web services, provided they are able to successfully execute their task. This result contrasts with e.g., speech services, where very low speech quality makes it almost impossible to have a phone call, and hence results in non-acceptance of the service. Accordingly, the \(\%\mathrm {PoW}\) estimator defined in the E-model is almost 100 % for low MOS values.

Figure 5b makes this even more clear. The plot shows how many users accept a condition and rate QoE with x for \(x=1,\,\ldots\,,9\). All users who rate an arbitrary test condition with x are considered and the acceptance ratio y is computed over those users. For each rating category \(1,\ldots ,9\), there are at least 20 ratings. Even when the quality is perceived as bad (‘1’), 20 % of the users accept the service. For category ‘2’ between ‘poor’ and ‘bad’ (see Fig. 11), up to 75 % accept the service at an overall quality which is at most ‘poor’.

Figure 5c takes a closer look at the relation between MOS and acceptance, \(\theta \)-acceptability, as well as the \(\%\mathrm {GoB}\) estimation as defined in “%GoB and %PoW”. The markers depict \(\theta \)-acceptability \(P(U\ge \theta )\) depending on the MOS for \(\theta =3\)\(\lozenge \)’ and \(\theta =4\)\(\vartriangle \)’ i.e. \(\%\mathrm {GoB}\). The \(\%\;\mathrm {GoB}\) estimator (solid line) overestimates the true ratio of users rating good or better (\(\theta =4\)). This can be adjusted by considering users rating fair or better P(U ≥ 3) which is close to the \(\%\mathrm {GoB}\) estimator. In addition, the acceptance ratio ‘\(\square \)’ is plotted depending on the MOS. However, the \(\theta \)-acceptability curves as well as the \(\%\mathrm {GoB}\) do not match the acceptance curve. In particular, for the minimum MOS of 1.0909, the \(\theta \)-acceptability is 0 %, while the acceptance ratio is 27.27 %.

The discrepancy between acceptance and the \(\%\mathrm {GoB}\) estimator is also rather large, see Fig. 5c. The estimator in the E-model maps a MOS value of 1 to a \(\%\mathrm {GoB}\) of 0 %, as a speech service is not possible any more if the QoE is too bad. In contrast, in the context of web QoE, a very bad QoE can still result in a usable service which is accepted by the end user. Thus, the user can still complete for example the task to find a wikipedia article, although the page load time is rather high. This may explain why 20 % of the users accept the service even though they rate the QoE with bad quality (1).

We conclude that it is not generally possible to map opinion ratings on the overall quality to acceptance.Footnote 6 The conceptual difference between acceptance and the concept of \(\theta \)-acceptability is the following. In a subjective experiment, each user defines his own threshold determining when the overall quality is good enough to accept the service. Additional contextual factors like task or prices influence strongly acceptance Reichl et al. (2015). In contrast, \(\theta \)-acceptability considers a globally defined threshold (e.g. defined by the ISP) which is the same for all users. Results that are only based on user ratings do not reflect user acceptance, although the correlation is quite high (Pearson’s correlation coefficient of 0.9266).

Figure 5d compares acceptance and \(\%\mathrm {PoW}\). The markers depict the ratio of users not accepting a test condition ‘\(\square \)’ depending on the MOS for all 23 test conditions. The \(\%\mathrm {PoW}\) is a conservative estimator of the no acceptance’ characteristics. Especially, 27.27 % of users are still accepting the service, although the MOS value is 1.0909. The \(\%\mathrm {PoW}\) is close to 0 %. This indicates that overall quality can only be roughly mapped to other dimensions like ‘no acceptance’.

\(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\): Ratio of (dis-)satisfied users

The opinion ratings of the subjects on speech quality are taken from Köster et al. (2015). The listening-only experiments were conducted by 20 subjects in an environment fulfilling the requirements in ITU-T Rec. P.800 (ITU-T 2003) using the source speech material in Gibbon (1992). The subjects assessed the same test stimuli on two different scales: the ACR scale (Fig. 6) and the extended continuous scale (Fig. 7). To be more precise, each subject was using both scales during the experiment. The labels were internally assigned to numbers of the interval [0,6] in such a manner that the attributes corresponding to ITU-T Rec. P.800 were exactly assigned to the numbers \(1,\ldots ,5\).

Fig. 6
figure 6

Five point discrete quality scale as used for the speech QoE experiments (Köster et al. 2015)

Fig. 7
figure 7

Five point continuous quality scale as used for the speech QoE experiments (Köster et al. 2015)

Fig. 8
figure 8

Speech QoE Results of the speech QoE study Köster et al. (2015). For the 86 test conditions, the MOS and \(\%\mathrm {PoW}\), \(\%\mathrm {GoB}\) values were computed over the 20 subjects for the discrete 5-point ACR scale (Fig. 6) and the extended continuous scale (Fig. 7). The results for the discrete scale are marked with ‘square’, while the QoE measures for the continuous scale are marked with ‘diamond’. The dashed lines represent logistic fitting functions of the subjective data. a %POW-MOS Plot. The markers depict the MOS and the ratio \(P(U\le 2 )\) from the subjective study on the discrete and the continuous scale. The solid black line shows the %PoW ratio depending on MOS for the E-model. The E-model underestimates the measured %PoW on the discrete scale which is larger than the %PoW on the continuous scale. b %GoB-MOS Plot. The markers depict the MOS and the ratio \(P(U\ge 4 )\) from the subjective study on the discrete and the continuous scale. The solid black line shows the %GoB ratio depending on MOS for the E-model. The E-model overestimates the ratio of satisfied users on the discrete scale which is smaller than the %GoB on the continuous scale

Figure 8a investigates the impact of the rating scale on the ratio of dissatisfied users. For 86 test conditions, the MOS, \(\%\mathrm {PoW}\), and \(\%\mathrm {GoB}\) values were computed over the opinions from the 20 subjects on the discrete rating scale and the continuous rating scale. The results for the discrete scale are marked with ‘\(\square \)’, while the QoE measures for the continuous scale are marked with ‘\(\lozenge \)’.

Although the MOS is larger than 3, about 30 and 20 % of the users are not satisfied rating poor or worse on the discrete and the continuous scale, respectively. The results are also checked against the E-model to analyze its validity when linking overall quality (MOS) to \(\%\mathrm {PoW}\). We consider the ratio \(P(U \le 2)\) of users rating a test condition poor or worse. For that test condition, the MOS value is computed and each marker in Fig. 8a represents the measurement tuple (MOS, \(P(U \le 2)\) for a certain test condition. In addition, a logistic fitting is applied to the measurement values depicted as dashed line. It can be seen that the ratio \(\%\mathrm {PoW}\) of the subjects on the discrete rating scale is always above the E-model (solid curve). The maximum difference between the logistic fitting function and the E-model is 13.78 % at MOS 2.2867. Thus, the E-model underestimates the measured \(\%\mathrm {PoW}\) for the discrete scale.

For the continuous rating scale, the ratio \(P(U\le 2)\) is below the E-model. However, we can determine the parameter \(\theta \) in such a way that the mean squarred error (MSE) between the \(\%\mathrm {PoW}\) of the E-model and the subjective data \(P(U \le \theta )\) is minimized. In the appendix, Figure 12b shows the MSE for different realizations of \(\theta \). The value \(\theta =2.32 > 2\) leads to a minimum MSE regarding \(\%\mathrm {PoW}\). The E-model overestimates the measure \(\%\mathrm {PoW}\), i.e. \(P(U \le 2)\), for the continuous scale. However, \(P(U \le \theta )\) leads to a very good match with the E-model.

In a similar way, Fig. 8b investigates the \(\theta \)-acceptability and compares the results with \(\%\mathrm {GoB}\) of the E-model. Even when the MOS is around 4, the subjective results show that the ratio of users rating good or better is only 80 and 70 % on the discrete and the continuous scale, respectively. The E-model overestimates the ratio \(P(U \ge 4\)) of satisfied users rating good or better on the discrete scale. The maximum difference between the logistic fitting function and the \(\%\mathrm {GoB}\) of the E-model is 17.49 % at MOS 3.3379. For the continuous rating scale, the E-model further overestimates the ratio of satisfied users, with the maximum difference being 46.20 % at MOS 3.4862. The value \(\theta ={3.0140}\) leads to a minimum MSE between the E-model and \(P(U \ge \theta )\) on the continuous scale, as numerically derived from Fig. 12b. Thus, for the speech QoE study, the \(\%\mathrm {GoB}\) of the E-model corresponds to the ratio of users rating fair or better.

In summary, the E-model does not match the results from the speech QoE study for \(\mathrm {PoW}\), i.e. \(P(U\le 2)\), and \(\mathrm {GoB}\), i.e. \(P(U\ge 4)\), on both rating scales. The results on the discrete rating scale lead to a higher ratio of dissatisfied users rating poor or worse than a) the \(\%\mathrm {PoW}\) of the E-model and b) the \(\%\mathrm {PoW}\) for the continuous scale. The \(\%\mathrm {GoB}\) of the E-model overestimates the \(\%\mathrm {GoB}\) on the discrete and the continuous scale.Footnote 7 Thus, in order to understand the ratio of satisfied and dissatisfied users it is necessary to compute those QoE metrics for each subjective experiments since the E-model does not match for all subjective experiments. Due to the non-linear relationship between MOS and \(\theta \)-acceptability, the additional insights get evident. For service providers, the \(\theta \)-acceptability allows to go beyond the ’average’ user in terms of MOS and to derive the ratio of satisfied users with ratings larger than \(\theta \).

SOS hypothesis and modeling of complete distributions

We relate the SOS values to MOS values and show that the entire distribution of user ratings for a certain test condition can be modeled by means of the SOS hypothesis. A discrete and continuous rating scale will lead to a discrete and continuous distribution respectively.

Results for web QoE on a discrete rating scale

Figure 9 shows the results of the web QoE study (Hoßfeld et al. 2011). In the study, the page load time was influenced for each test condition and 72 subjects rated the overall quality on a discrete 5-point ACR scale. Each user viewed 40 web pages with different images on the page and page load times (PLTs) from 0.24 to 1.2 s resulting into 40 test conditions per user.Footnote 8 For each test condition, MOS and SOS are computed over the opinions of the 72 subjects. As users conducted the test remotely, excessively high page load time might have caused them to cancel or restart the test. In order to avoid this, only a maximum PLT of [1.2] s was chosen. As a result, the minimum MOS value observed is 2.1111 for the maximum PLT.

Figure 9a shows the relationship between SOS and MOS and reveals the diversity in user ratings. The markers ‘\({\square }\)’ depict the tuple (MOS,SOS) for each of the 40 test conditions. For a given MOS the individual user rating is relatively unpredictable due to the user rating diversity (in terms of standard deviation).

Fig. 9
figure 9

Web QoE for PLT only. Results of the web QoE study (Hoßfeld et al. 2011). The page load time was influenced for each test condition and 72 subjects rated the overall quality on a discrete 5-point ACR scale. Each user viewed 40 web pages with different images on the page and PLTs from 0.24 to 1.2 s resulting into 40 test conditions per user. For each test condition, the MOS, SOS, as well as 10 and 90 %-quantiles are computed over the opinions of the 72 subjects. a SOS-MOS Plot. The markers ‘square' depict the tuple (MOS, SOS) for each of the 40 test conditions. The solid blue line shows the SOS fitting function with the SOS parameter a = 0:27. The resulting MSE is 0.01. We observe that the measurements can be well approximated by a binomial distribution with a = 0:25 (MSE = 0.01) plotted as dashed curve. The solid black curve depicts the maximum SOS. b Quantile-MOS Plot. The 10 and 90 %- quantiles ‘square' for the web browsing study as well as the MOS ‘filled diamond’ are given for the different test conditions (increasingly sorted by MOS). There are strong differences between the MOS and the quantiles. The maximum difference between the 90 %-quantile and MOS is 4 − 2:14 = 1:86. The quantiles for the shifted binomial distribution ‘filled circle’ are also given which match the empirically derived quantiles

The results in Fig. 9a confirm the SOS hypothesis and the SOS parameter is obtained by minimizing the least squared error between the subjective data and Eq. 5. As a result, a SOS parameter of \(\tilde{a}=0.27\) is obtained. The mean squarred error between the subjective data and the SOS hypothesis (solid curve) is close to zero (MSE 0.0094), indicating a very good match. In addition, the MOS-SOS relationship for the binomial distribution \((a_B=0.25)\) is plotted as dashed line. To be more precise, if user ratings U follow a binomial distribution for each test condition, the SOS parameter is \(a_B=0.25\) on a 5-point scale. The parameters of the binomial distribution per test condition are given by the fixed number \(N=4\) of rating scale items and the MOS value \(\mu \) which determines \(p=(\mu -1) N\). Since the binomial distribution is defined for values \(x=0,\ldots ,N\), the distribution is shifted by one to have user ratings on a discrete 5-point scale from 1 to 5. Thus, for a test condition, the user ratings U follow the shifted binomial distribution with \(N=4\) and \(p=(\mu -1) N\) for a MOS value \(\mu \), i.e. \(U \sim B(N,(\mu -1)N) + 1\) and \(P(U=i)=\left( {\begin{array}{c}N\\ i-1\end{array}}\right) p^{i-1}(1-p)^{n-i+1}\) for \(i=1,\ldots ,N+1\) and \(\mu \in [1;5]\).

We observe that the measurements can be well approximated by a binomial distribution with \(a_B=0.25\) (MSE = 0.0126) plotted as dashed curve. The SOS parameter of the measurement data is only \(\sqrt{\frac{a}{a_B}}=1.04\) higher than the SOS for the binomial distribution. The SOS parameter a is a powerful approach to select appropriate distributions of the user opinions. In the study here, we observe roughly \(a=0.25\) on a discrete 5-point scale which means that the distribution follows the aforementioned shifted binomial distribution. Thus, for any MOS value, the entire distribution (and deducible QoE metrics like quantiles) can be derived.

Figure 9b shows the measured \(\alpha \)-quantiles ‘\({\square }\)’ as well as the quantiles from the binomial distribution ‘\({\bullet }\)’ compared to the MOS values ‘\({\blacklozenge }\)’. The quantiles for the shifted binomial distribution ‘\({\bullet }\)’ match the empirically derived quantiles very well. The 10 and 90 %-quantiles quantify the opinion score of the 10 % of the most critical and the most satisfied users, respectively. There are strong differences between the MOS and the quantiles. The maximum difference between the 90 %-quantile and MOS is \(4-2.14=1.861\). For the 10 %-quantile, we observe a similarly strong discrepancy, \(2.903-1=1.903\).

This information, while very significant to service providers, is masked out by averaging used to calculate MOS values. As a conclusion from the study, we recommend to report different quantities beyond the MOS to fully understand the meaning of the subjective results. While the SOS values reflect the user diversity, the quantiles help to understand the fraction of users with very bad (e.g. 10 % quantile) or very good quality perception (e.g. 90 % quantile).

Results for video QoE on a continuous rating scale

Figure 10 shows the results of the video QoE study (De Simone et al. 2009). A continous rating scale from 0 to 5 (cf. Fig. 14) was used. The two labs where the study was carried out are denoted as “EPFL” and “PoLiMi” in the result figures. The packet loss in the video transmission was varied in \(p_L \in \{0;0.1;0.4;1;3;5;10\}\) (in %) for four different videos. In total, 40 subjects assessed 28 test conditions. The MOS, SOS, as well as the 10 and 90 %-quantile were computed for each test condition over all 40 subjects from both labs. More details on the setup can be found in "Experimental setup for task-related Web QoE" , "Speech quality on discrete and continuous scale" , "Web QoE and discrete rating scale", "Video QoE and continuous rating scale" in Appendix 1.

Fig. 10
figure 10

Video QoE. Results of the video QoE study (De Simone et al. 2009). A continous rating scales from 0 to 5, cf. Fig. 14, was used in the experiments for subjects evaluating the quality of videos transmitted over a noisy channel (De Simone et al. 2009). The study was repeated in two different labs denoted as ‘EPFL’ and ‘PoLiMi’ in the result figures. The packet loss in the video transmission was varied in \(p_L \in \{0; 0.1;0.4;1;3;5;10\}\) (in %) for four different videos. In total, 40 subjects evaluated 28 test conditions. a SOS-MOS Plot. The markers depict the tuple (MOS, SOS) for each of the 28 test conditions (PoliMi ‘square’ and EPFL ‘diamond’). The dashed lines shows the SOS fitting function with the corresponding SOS parameters for the two labs which are almost identical. When merging the results from both labs, we arrive at the SOS parameter a = 0:10. But the diversity is lower than for web QoE. Subjects are more sure how to rate an impaired video, while the impact of temporal stimulii.e. PLT for web QoE is more difficult to evaluate for subjects. The solid black curve depicts the maximum SOS. b Quantile-MOS Plot. The markers depict the empirically derived 90 %-quantiles ‘filled circle' and 10 %-quantiles ‘open circle’, respectively. Furthermore, we plot the quantiles depending on MOS for user ratings following a truncated normal distribution and SOS parameter a = 0.1, 0.5, 1. The SOS hypothesis returns for each MOS value μ the related SOS value σ which allows to compute the quantiles of the truncated normal distribution, i.e. U ~ N (μ; σ; 0; 5). The solid and dashed lines depict the 90 and 10 %-quantiles, respectively

Fiugre 10a provides a SOS-MOS plot. The markers depict the tuple (MOS, SOS) for each of the 28 test conditions (PoliMi ‘\({\square }\)’ and EPFL ’\({\lozenge }\)’). The dashed lines shows the SOS fitting function with the corresponding SOS parameters for the two labs which are almost identical. When merging the results from both labs, we arrive at the SOS parameter \(a=0.10\). Due to the user diversity, we observe of course positive SOS values for any test condition (the theoretical minimum SOS is zero for the continuous scale), but the diversity is lower than for web QoE. Subjects are presumably more confident on (or familiar with) how to rate an impaired video, while the impact of temporal stimuli i.e. PLT for web QoE is more difficult to evaluate.

For each test condition, we observe a MOS value and the corresponding SOS value according to the SOS parameter. We fit the user ratings per packet loss ratio with a truncated normal distribution in [0; 5] with the measured mean \(\mu \) (MOS) and standard deviation \(\sigma \) (SOS). Thus, the user ratings U follow the truncated normal distribution, i.e. \(U \sim N(\mu ;\sigma ;0;5)\) with \(U \in [0;5]\). We observe a very good match between the empirical CDF and the truncated normal distribution, see Fig. 15b in the appendix. This is not obivous and no trivial result, although the first two moments of both distributions are identical, the underlying distributions could be very different, see  “Motivation”. Thus, together with the SOS parameter a, the user voting distribution is completely specified for any MOS value \(\mu \) on the rating scale i.e. \(\mu \in [0;5]\).

Figure 10b shows the quantiles as a function of MOS. The filled ‘\(\bullet \)’ and non-filled markers ‘\(\circ \)’ depict the empirically derived 90 and 10 %-quantiles for the 28 test conditions, respectively. Furthermore, we plot the quantiles depending on MOS for user ratings U following a truncated normal distribution and SOS parameter \(a=0.1, 0.5, 1\). Note that we measure \({a=0.096}\) in the experiments on video QoE. The SOS parameter 0.5 leads to \(\sqrt{\frac{0.5}{0.1}}={2.2361}\) higher SOS values for an observed MOS. The SOS parameter 1 leads to the maximum possible SOS which is 3.1623 times higher than in the subjective data. Due to the SOS hypothesis and a given SOS parameter a, we obtain for each MOS value \(\mu \) the related SOS value \(\sigma (\mu ;a)\), see (5). Thereby, a MOS value represents the outcome of a concrete test condition. The parameters \(\mu \) and \(\sigma \) are input parameters of the truncated normal distribution which allows us to compute the \(\alpha \)-quantile of the truncated normal distribution, i.e. \(U \sim N(\mu ;\sigma ;0;5)\). The solid and dashed lines depict the 90 and 10 %-quantiles, respectively. We observe that the truncated normal distribution corresponding to the SOS parameter \(a=0.1\) fit very well the empirical quantiles. With the information of the SOS parameter, the quantiles, etc., can be completely derived for any MOS value. Similarly to the discrete rating scale results from the web QoE study, we observe strong differences between the MOS and the quantiles when using a continuous rating scale. The maximum difference between the 90 %-quantile and MOS is \({3.623250}-{2.420470}={1.202780}\). Also on the continuous scale, the MOS masks out such meaningful information for providers.

Results for speech QoE—comparison between continuous and discrete rating scale

When comparing the SOS values from the web and video study, we observe that the discrete rating scale leads to higher SOS values than the continous scale. However, the higher user diversity may be caused by the application (Hoßfeld et al. 2011). Therefore, we briefly discuss the speech QoE study (as already discussed in Sect. “%GoB and %PoW: ratio of (dis-)satisfied users” and described in the "Experimental setup for task-related Web QoE", "Speech quality on discrete and continuous scale" in Appendix 1). Subjects rate the QoE for certain test conditions on a discrete and a continuous scale which allows a comparison.

As a result (cf. Fig. 12a), the SOS parameter \(a_d=0.23\) and \(a_c=0.12\) are obtained for the discrete and the continuous scales, respectively. For the discrete scale, we observe larger SOS values than for the continous scale, which can also be seen by the larger SOS parameter \(a_d>a_c\). In particular, on the discrete scale, the SOS values are larger by a factor of \(\sqrt{frac{a_d}{a_c}} \approx {1.3844}\). This observation seems to be reasonable, as the continuous scale has more discriminatory power than the discrete scale. Subjects can assess the quality more fine granular on the continuous scale by choosing a value \(x \in [i;i+1]\), while the subject has to decide between i and \(i+1\) on a discrete scale. The minimum SOS for a given MOS value is zero for a continuous scale, while the minimum SOS is larger than zero and depends on the actual MOS value, cf. (3).

Although the results seem to be valid from a statistical point of view, the literature shows conflicting results. In Siahaan et al. (2014), subjective studies on the image aesthetic appeal were conducted using a discrete 5-point ACR scale as well as a continuous scale. However, similar SOS parameters were obtained for both rating scales. Péchard et al. (2008) compared two different subjective quality assessment methodologies for video QoE: absolute category rating (ACR) using a 5-point discrete rating scale and subjective assessment methodology for video quality (SAMVIQ) using a continuous rating scale. As a key finding, SAMVIQ is more precise (in terms of confidence interval width of a MOS value) than ACR for the same number of subjects. However, SAMVIQ uses multiple stimuli assessment, i.e. multiple viewing of a sequence. There are further works (Tominaga et al. 2010; Pinson and Wolf 2003; Brotherton et al. 2006; Huynh-Thu and Ghanbari 2005) comparing different (discrete and continuous) rating scales as well as assessment methodologies like SAMVIQ in terms of reliability and consistency of the user ratings. We note, however, that they do not address the issues of using averages to characterize the results of those assessments. A detailed analysis of the comparison of continuous and discrete rating scales and their impact on QoE metrics is left for future work.

Table 2 Description of the subjective studies conducted for analyzing QoE for different applications

Comparison of results

All experiments and some key quantities are summarized in Table 2, which may serve as a guideline to properly describe subjective studies and their results in order to extract as much insight from them as possible. For comparing the key measures across the experiments with different rating scales, the user ratings in all experiments are mapped on a scale from 1 (bad quality) to 5 (excellent quality).

The user rating diversity seems to be lower when using a continuous rating scale than a discrete one. This can be observed from the SOS parameter a, but also the maximum SOS at a certain MOS. It should be noted, however, that in more interactive services such as web QoE, there might be an inherently higher variation of user ratings, due e.g., to uncertainty on how to rate the overall quality.

The MSE-optimal parameter \(\theta \) is determined by minimizing the MSE between the \(\theta \)-acceptability of the measurement data and the \(\%\mathrm {GoB}\)-MOS. The discrete rating scale can only find a discrete value \(\theta \) and therefore stronger deviations between the \(\%\mathrm {GoB}\) estimator and the \(\theta \)-acceptability arise. We see that for the task-related web QoE, the MSE-optimal parameter is \(\theta =3\). This means that the ratio of users rating fair or better match the \(\%\mathrm {GoB}\) curve. For the continuous rating scales, optimal continuous thresholds can be derived. For the speech QoE and the video QoE on continuous scales, a value of \(\theta \) around 3 matches the \(\%\mathrm {GoB}\) curve.

The limitations of MOS are made evident by the minimum \(\%\mathrm {GoB}\) ratio \(P(U \ge 4)\) for all test conditions which lead to a MOS value equal or larger than 4. The ratio shows how many users accept (or do not accept) the condition, although the MOS exceeds the threshold.

Another limitation of the MOS is highlighted by the quantiles. In particular, the maximum difference between the 90%-quantile and the MOS values is shown to reach up to 2 points on the 5-point scale. This highlights the importance of considering QoE beyond the MOS.

Conclusions

In this article, we argued for going beyond MOS when performing subjective quality tests. While MOS is a practical way to convey QoE measures and a simple to interpret scalar value, it hides important details about the results. These details often have a significant impact in terms of the service technical performance, and on the business aspects of the service.

Our contributions are many-fold. Firstly, while there are many works in the literature dealing with subjective and objective quality assessment, they are mostly limited to MOS, while ignoring higher order statistics and the relation between quality and acceptance. Our first contribution is thus that there are other tools available for understanding QoE besides the MOS, their importance, and how they are used. A second contribution is a survey of the available QoE measures, their definition and interpretation. Using these tools brings more insight into QoE analysis. Our third contribution is a showcase, by means of analyzing several concrete use cases, of how these analysis tools are used, highlighting the extra insight they bring beyond that of the MOS. We analyze e.g., the impact of using continuous vs. discrete scales on the accuracy of the assessment, the relation between quality and acceptance.

Concerning acceptability ratings, we note the following difference between acceptability (as an explicit question to the users) and the concept of \(\theta \)-acceptability. In a subjective experiment, each user defines their own threshold reflecting the point where QoE is good enough to accept the service. This is the result of a complex cognitive process. In contrast, \(\theta \)-acceptability considers a globally defined threshold (e.g. defined by the ISP, or whoever designed the subjective test scale used) which is the same for all users. This leads to a discrepancy with the subjective results, which can vary significantly with the application considered. For instance, in the case of Web QoE with a task, the discrepancy is rather large. In the case of speech, the E-Model-inspired \(\%\mathrm {GoB}\) estimator in Eq. (8) maps a MOS value of 1 to a \(\%\mathrm {GoB}\) of 0%, as a speech service is not possible any more if the quality is too degraded, and hence it is unacceptable. In contrast, in the Web QoE case, a very bad QoE can still result in a usable service which is accepted by the end user. Thus, the user can still complete for example the task of finding a wikipedia article, although the page load times are very high. This may explain why 20 % of the users accept the service although they rate the QoE with bad quality (1). From this, we can recommend that acceptability be included explicitly as part of subjective assessment, as it cannot be directly inferred from user ratings on the quality of a service, e.g., on a 5-point MOS scale.

These differences in the way that users accept (or not) the service quality, and how this relates to MOS values can provide key insights to providers when assessing the QoE delivered to their users, and how it may relate to issues such as churn. Asking explicitly about acceptability seems like a necessary step to consider in certain use cases (where business considerations are important, for example). Likewise, thinking in terms of distributions, or at least quantiles, provides more actionable information to service and content providers, as it allows them to better grasp how their users actually perceive the quality of the service, and how many of those users may be happy or unhappy (or, in following with the QoE definition, delighted or annoyed) with it. This implies that existing quality models that provide MOS estimates should be complemented (or eventually replaced) by new models that estimate rating distributions, or key quantiles. These results are directly relevant to several aspects of service provisioning, from the more technical ones, such as network management, to marketing and pricing strategies, to customer support.

In summary, we have made the case for going beyond the MOS, and delving deeper into the analysis of QoE assessment results, with practical applications (e.g., business and engineering considerations on the service providers’ part) in mind.