Skip to main content
Top
Published in: Quality & Quantity 3/2024

Open Access 28-11-2023

The sound of respondents: predicting respondents’ level of interest in questions with voice data in smartphone surveys

Authors: Jan Karem Höhne, Christoph Kern, Konstantin Gavras, Stephan Schlosser

Published in: Quality & Quantity | Issue 3/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Web surveys completed on smartphones open novel ways for measuring respondents’ attitudes, behaviors, and beliefs that are crucial for social science research and many adjacent research fields. In this study, we make use of the built-in microphones of smartphones to record voice answers in a smartphone survey and extract non-verbal cues, such as amplitudes and pitches, from the collected voice data. This allows us to predict respondents’ level of interest (i.e., disinterest, neutral, and high interest) based on their voice answers, which expands the opportunities for researching respondents’ engagement and answer behavior. We conducted a smartphone survey in a German online access panel and asked respondents four open-ended questions on political parties with requests for voice answers. In addition, we measured respondents’ self-reported survey interest using a closed-ended question with an end-labeled, seven-point rating scale. The results show a non-linear association between respondents’ predicted level of interest and answer length. Respondents with a predicted medium level of interest provide longer answers in terms of number of words and response times. However, respondents’ predicted level of interest and their self-reported interest are weakly associated. Finally, we argue that voice answers contain rich meta-information about respondents’ affective states, which are yet to be utilized in survey research.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The use of web surveys has continuously increased during the last years, replacing other, more established survey modes, such as face-to-face and telephone surveys. This trend especially applies to web surveys on smartphones (Gummer et al. 2023, 2019; Peterson et al. 2017; Revilla et al. 2016). For instance, the smartphone rate in the probability-based German Internet Panel (GIP) increased from 4% in September 2012 (first regular GIP wave) to 12% in July 2016 (first GIP wave with a mobile optimized survey design) and further to 43% in May 2023 (last GIP wave available at submission of this article). The reasons for an increasing smartphone rate in web surveys are an increasing mobile (high-speed) Internet rate and an increasing smartphone ownership (Pew Research Center 2018a, 2018b). In addition, smartphones allow respondents to take part in surveys with almost no location and time restrictions (Mavletova 2013), which may increase the attractiveness of using smartphones for web survey completion.
Another appealing aspect of smartphone surveys is that they allow researchers to collect a variety of data from built-in sensors, such as Global Positioning System (GPS) sensor, accelerometer, and microphone, which have the great potential to augment and extend web surveys (Struminskaya et al. 2020). To put it differently, data collected from or via smartphone sensors may help researchers to describe and understand the survey completion process. For instance, GPS data inform about respondents’ geolocation and, thus, they can be used to infer the environmental setting (Kelly et al. 2013; Struminskaya et al. 2020). Similarly, acceleration data can help to learn about different motion conditions of smartphone respondents, such as standing or walking, during survey completion (Kern et al. 2021). Smartphone sensors also provide novel ways to measure respondents’ attitudes, behaviors, and beliefs. More specifically, the built-in microphones of smartphones allow researchers to administer open-ended questions with requests for voice instead of text answers (Gavras and Höhne 2022; Gavras et al. 2022; Revilla and Couper 2021; Revilla et al. 2020; Schober et al. 2015).
Administering open-ended questions with requests for voice answers potentially allows to passively capture an important aspect in the survey answering process: respondents’ interest in the question topic. Respondents’ interest level provides valuable insights on their commitment while answering survey questions (Krosnick 1991). For example, Holland and Christian (2009) found that respondents who are very interested in the question topic are more likely to provide text answers to open-ended questions (less item nonresponse) and that these text answers are of higher quality (more words, more themes, and higher elaborations). Collecting voice answers to open-ended questions in smartphone surveys may provide a new way to infer respondents’ level of interest in situ; i.e., in parallel to the substantive answers to questions. In addition to the spoken content, voice answers include non-verbal cues, such as amplitudes and pitches (Frank et al. 2015; Schober et al. 2015). Developments in Natural Language Processing (NLP) allow researchers to utilize such cues to gather information on affective states and the level of interest of the speaker or respondent (Eyben et al. 2009; Koolagudi and Rao 2012; Poria et al. 2017). In this study, we predict respondents’ level of interest based on their voice answers to open-ended questions in a smartphone survey and investigate the association between respondents’ level of interest and their answer behavior. We address the following two research questions:
(1)
How is respondents’ predicted level of interest in the question associated with answer behavior?
 
(2)
Does the predicted level of interest in the question align with the self-reported survey interest of respondents?
 
In line with our research questions, we first investigate the relationship between respondents’ answer behavior in terms of answer length (measured in words) and response times (measured in seconds) and their predicted level of interest in the question. These two indicators have been proven to be good indicators of answer behavior when it comes to open-ended questions with requests for voice answers (Gavras 2019; Gavras et al. 2022; Revilla and Couper 2021; Revilla et al. 2020). As indicated by previous research, respondents’ interest can positively affect their answer behavior (Holland and Christian 2009; Kunz et al. 2021). We thus expect that higher predicted levels of interest coincide with longer answers to open-ended questions. Second, we make the attempt to validate the interest predictions based on respondents’ voice answers by studying their relationship with self-reported survey interest. We assume that the predicted interest (based on respondents’ voice answers) is positively associated with their self-reported survey interest.
In what follows, we outline the current state of research on smartphone surveys with open-ended questions requesting voice answers from respondents. We then outline the data collection procedure, sample characteristics, questions used in this study, and analytical strategy. Afterwards, we report our results and provide a comprehensive discussion of our findings including perspectives for future web survey research.

1.1 Background and literature

Voice answers collected in smartphone surveys have great potential because they facilitate collecting rich and in-depth information by triggering open narrations (Gavras and Höhne 2022; Revilla et al. 2020). Respondents can express their attitudes with almost no burden; they only need to press a recording button to record their answers (Gavras and Höhne 2022). For text answers, in contrast, respondents need to enter text, which might be problematic for two reasons. On the one hand, some respondents find it difficult to express themselves in a written way (e.g., respondents with literacy issues). On the other hand, it might be burdensome to enter answers in text fields via keyboards. This especially applies to smartphones with virtual on-screen keyboards shrinking the viewing space available for substantive content on the screen (Höhne et al. 2020).
Gavras (2019) and Revilla et al. (2020), for instance, report that voice answers, compared to text answers, are longer in terms of the number of words and characters, indicating that they result in more information on the object of interest. Revilla et al. (2020) also show that even though voice answers are longer than text answers, they are associated with shorter response times than their text counterparts, indicating less respondent burden. Finally, Gavras and Höhne (2022) reveal that voice answers produce (somewhat) higher data quality in terms of criterion validity than text answers. These findings promote the use of open-ended questions with requests for voice answers in future smartphone surveys.
However, voice answers in smartphone surveys struggle with missing data. Gavras and Höhne (2020) reported a dropout rate of about 45% for voice answers, compared to a dropout rate of about 13% for text answers. This is in line with a dropout rate of about 50% for voice answers reported by Lütters et al. (2018). Similarly, voice answers are associated with comparatively high item nonresponse rates: about 25% for voice answers to about 5% for text answers (Gavras et al. 2022) and about 60% for voice answers1 to less than 5% for text answers (Revilla et al. 2020). In addition, Revilla and Couper (2021) tested instructions explaining how to record voice answers in order to decrease item nonresponse but the authors did not find a decreasing effect. Item nonresponse rates were still about 40%.
In examining the association between respondents’ level of interest in the question and answer behavior, we follow the work of Conrad et al. (2013), who used voice data to investigate the correlation between the speech of interviewers and the success of invitations to a telephone survey. The authors found that survey invitations were most successful when interviewers were moderately disfluent. Interestingly, they also found that respondents who produced more backchannels (i.e., a behavior indicating the interest of a listener, such as “uh huh” and “I see”) were more likely to participate in the survey. Accordingly, we assume that respondents’ tonal cues can also be used to investigate their interest when answering questions. Extracting respondents’ level of interest from their voice answers to open-ended questions may provide valuable information to learn about data quality throughout the survey completion process.
So far, respondents’ level of interest is commonly measured by including self-report questions (e.g., as part of the survey evaluation). Typically, such questions have been used to study the relationship between respondents’ self-reported interest and their answer behavior. Holland and Christian (2009), for instance, investigate the association between self-reported interest and answering open-ended questions with requests for text answers (see also Kunz et al. 2021). They show that interest is positively associated with providing substantive answers, increasing data quality.
Even though self-reported interest measures may shed light on respondents’ answer behavior, they are associated with methodological drawbacks. First, the additional inclusion of questions for measuring respondents’ interest increases completion time and, thus, respondent burden. This has the potential to decrease respondent motivation, which, in turn, can lead to superficial answer behavior (Krosnick 1991). Second, and most importantly, questions or scales for measuring respondents’ interest are usually placed at a specific position in the survey, such as the end. Therefore, they only represent a broad, aggregated measure without informing about respondents’ interest in specific questions or survey parts.
Frequently, researchers build on respondents’ answer behavior, such as item nonresponse and speeding (i.e., extremely fast answering without the chance of careful question processing), to draw conclusions about their engagement and interest in question answering (see, for instance, Conrad et al. 2017; Höhne et al. 2017; Zhang and Conrad 2014). Even though such measures are useful to infer respondents’ (cognitive) involvement in question answering, they only represent a vague and indirect proxy. In contrast, predicting respondents’ interest level based on their voice answers may provide more direct information on their engagement and interest and goes beyond conventional measures, expanding the methodological toolkit in web survey research.
To our best of knowledge, no previous studies have attempted to predict respondents’ level of interest based on voice answers to open-ended questions collected through smartphone surveys and investigated the effect of the predicted level of interest on respondents’ answer behavior. We use pre-trained NLP models for inferring the level of interest of respondents; i.e., interest predictions are obtained using models that learned to classify interest based on a different (non-survey) database. Consequently, this study is a very first step into this research direction and primarily represents a proof of concept that investigates the use of pre-trained interest recognition models in the context of smartphone surveys.

2 Method

2.1 Data

Data were collected in the Forsa Omninet Panel (omninet.forsa.de) in Germany in December 2019 and January 2020. The Omninet Panel is offline-recruited. Respondents cannot sign up themselves (preventing mock accounts and duplicates) but are invited via a probability-based telephone sample. The survey mode in the Forsa Omninet Panel is online.
Forsa drew a quota sample from their panel based on age, gender, education, and region (East and West Germany). The quotas were calculated using the German Microcensus, which served as a population benchmark.
The email invitation to the survey included information on the survey duration (about 15 min), the device (i.e., smartphone) to be used for survey completion, and a link to the survey. The first survey page outlined the topic and procedure of the survey and included a statement of confidentiality assuring that the study adheres to existing data protection laws and regulations. In addition, we obtained respondents’ informed consent for collecting, storing, processing, and analyzing their voice answers.
To restrict survey completion to smartphone respondents, we detected respondents’ device at the beginning of the survey. Respondents who attempted to access the survey using a non-smartphone device were prevented from proceeding the survey and were asked to use a smartphone. In addition, we used the open source “Embedded Client Side Paradata (ECSP)” tool developed by Schlosser and Höhne (2018, 2020) for collecting user-agent-strings informing about device properties, such as type and operating system.
In total, 1679 panelists started the survey with requests for voice answers, of which 754 panelists broke-off before being asked any study-relevant questions.2 This leaves us with 925 panelists available for statistical analysis.3

2.2 Sample

On average, the respondents in the sample that is amenable for analysis were born between 1970 and 1974, and about 49% of them were female. About 24% had completed lower secondary school (low education level), about 34% intermediate secondary school (medium education level), and about 42% college preparatory secondary school or university-level education (high education level).

2.3 Questions

In this study, we predicted respondents’ level of interest based on their voice answers to four open-ended questions concerning the evaluation of the following German political parties: CDU/CSU (Christian Democratic Union/Christian Social Union), SPD (Social Democratic Party), Greens (Alliance 90/The Greens), and AfD (Alternative for Germany). These questions were adopted from major social surveys, such as the German Longitudinal Election Study (GLES), and were presented on separate web survey pages (single question presentation) in the center of the web survey. The questions were developed in German, which was the mother tongue of about 98% of the respondents. We employed an optimized survey design, which generally prevents horizontal scrolling facilitating survey navigation and completion. The questions were preceded by an instruction explaining how to record voice answers (see Appendix A for English translations of the voice questions including instruction).
In order to record respondents’ voice answers, we implemented the open source “SurveyVoice (SVoice)” tool developed by Höhne et al. (2021). SVoice can be implemented in browser-based smartphone surveys and records voice answers via the microphone of smartphones, regardless of the operating system. It resembles the voice input function of popular Instant-Messaging Services and uses Hypertext Transfer Protocol Secure (HTTPS) for assuring the secure transmission of voice answers from SVoice to a server. Figure 1 displays screenshots of the four open-ended questions with requests for voice answers.
We also included a self-report question on respondents’ survey interest. This question was asked with a vertically aligned, seven-point, and end-verbalized rating scale without numeric values (see Appendix A for an English translation of the question including answer options). It was placed at the end of the web survey. There were 16 survey questions between the open-ended questions with requests for voice answers and the self-reported survey interest question.

2.4 Analytical strategy

2.4.1 Predicting respondents’ level of interest with OpenEAR

In order to predict respondents’ level of interest based on their voice answers to open-ended questions we use the open source OpenEAR tool developed by Eyben et al. (2009). OpenEAR allows extracting features from audio data, such as signal energy, voice quality, pitch, and spectral features, and includes pre-trained classification models, such as Support-Vector Machines, for predicting various affective states based on these features. More specifically, in this study, we use the Audiovisual Interest Corpus (AVIC) model-set that predicts three levels of interest (disinterest, neutral, and high interest). The prediction models were trained with voice data and labels that were sought to measure the spontaneous interest of speakers in a topic. The training data was collected in a scenario in which subjects listened to a product presentation and then naturally interacted with the presenter by asking questions on the addressed topics (Schuller et al. 2009a). The subjects’ level of interest in the topic was subsequently hand-coded by human annotators for each sub-speaker turn (i.e., short speech segments), resulting in three levels of interest (Schuller et al. 2009b, pp. 552–553): disinterest (i.e., subject is bored with listening and talking about the topic”), neutral (i.e., indifferent), and high interest (i.e., strong wish of the subject to talk and learn more about the topic). Eyben et al. (2009) report state-of-the-art “in-corpus” prediction performance of the pre-trained models on common benchmark tasks when cross-validating predictions with hold-out data from the same database.4

2.4.2 Measures of predicted level of interest

The AVIC model-set of the OpenEAR tool (Eyben et al. 2009) predicts probabilities for each level of interest for each segment (about 2.5 seconds) of voice data input. We calculate the mean of the predicted probabilities over segments for each respondent for all four open-ended questions with requests for voice answers. The resulting (numeric) variables, denoted \({LOI}_{low}, {LOI}_{med}, {\& LOI}_{high}\), represent our first set of predicted level of interest, measured on the question level (or respondent page). We further condensed these measures into a single variable by thresholding the mean predicted probabilities as follows:
$${LOI}_{cat}=\left\{\begin{array}{c}\begin{array}{lll}high& if& {LOI}_{high}\ge {Q}_{{LOI}_{high}}(0.75)\end{array}\\ \begin{array}{c}\begin{array}{ccc}med. high& if& {LOI}_{high}<{Q}_{{LOI}_{high}}\left(0.75\right) \& {LOI}_{high}\ge {Q}_{{LOI}_{high}}\left(0.5\right)\end{array}\\ \begin{array}{lll}med. low& if& {LOI}_{high}<{Q}_{{LOI}_{high}}\left(0.5\right) \& {LOI}_{med}\ge {Q}_{{LOI}_{med}}\left(0.5\right)\end{array}\\ \begin{array}{ccc}low& if& {LOI}_{high}<{Q}_{{LOI}_{high}}\left(0.5\right) \& {LOI}_{med}<{Q}_{{LOI}_{med}}\left(0.5\right)\end{array}\end{array}\end{array}\right.$$
The resulting (categorical) variable represents our second measure of predicted level of respondents’ interest, also measured on the question level.5 Finally, we calculated the mean and variance of \({LOI}_{low}\) and \({LOI}_{high}\) over the four open-ended questions with requests for voice answers for each respondent to create an aggregated level of interest measure on the respondent level.

2.4.3 Association between predicted level of interest and answer behavior

In this study, we measure respondents’ answer behavior in the form of the number of words6 and response times (in seconds). We determine the number of words by counting the number of “tokens” of the transcribed text of each voice answer (see further information below). In contrast, response times are simply extracted from the audio files containing respondents’ voice answers; response times correspond to the length of audio files.
In studying answer length in terms of the number of words and response times, we are interested in how the explanatory power of our voice data-based measures (i.e., the inferred level of respondents’ interest) compares to predictor variables that can be derived solely from the spoken text of respondents’ voice answers. For this purpose, we calculated sentiment scores of respondents’ voice answers. We used Google’s Transcribe API “Speech-to-Text” to automatically transcribe the audio files into text (Google 2020). Proksch et al. (2019, p. 342), for instance, show that the performance of the API does not substantially differ from human transcription in German. They report an average cosine similarity of r > 0.9 between automatically transcribed and human-transcribed political speeches.
We run sentiment analyses to investigate the level of extremity of respondents’ voice answers to the four open-ended questions on political parties. For this purpose, we use the German sentiment vocabulary SentiWS (Remus et al. 2010) in which words are assigned scores ranging from –1 (very negative) to 1 (very positive). The scores indicate the strength of the sentiment-afflicted words. We estimate the extremity of voice answers using the following formula (Lowe et al. 2011):
$$S=log\frac{pos+0.001}{\left|neg\right|+0.001}$$
where pos denotes the weighted sum of positive sentiment words and |neg| denotes the absolute weighted sum of negative sentiment words. We add a small penalty (0.001) to prevent calculation problems when dividing by zero and take the natural logarithm (log) of the results. Finally, we normalize sentiment scores to a scale ranging from 0 (very negative) to 1 (very positive) to facilitate the interpretation of the results.
In order to investigate the association between respondents’ predicted level of interest, sentiment scores, and answer behavior, we run multilevel linear regressions with random intercepts (open-ended questions nested in respondents). We use the log number of words and log response times (in seconds) as dependent variables to account for the strong skewness of the raw data (see Appendix B for descriptive statistics). We include the condensed categorical measure of predicted level of interest (\({LOI}_{cat}\)) and the extracted sentiment scores (continuous scale ranging from 0 to 1) as our main independent variables of interest.7 We use the predicted level of interest “low” as reference. We further include indicators for the four open-ended questions with requests for voice answers (i.e., CDU/CSU, SPD, Greens, and AfD) and a variable that measures whether the current open-ended question refers to the preferred party of the respondent. We additionally control for the following demographics: age (12 ascending categories), female (1 = yes), education: medium (1 = yes) and high (1 = yes) with low as reference, and native German speaker (1 = yes).
We restrict the statistical analyses to voice answers that are longer than or equal to two seconds, respectively.8 First, this is done to exclude voice files that do not contain substantive answers to the survey questions (e.g., empty or incomplete recordings). Second, this is done to exclude voice files that are shorter than the default segment length for OpenEAR output predictions (see above).

2.4.4 Predicted level of interest and self-reported survey interest

To evaluate the association between respondents’ predicted level of interest in the question and self-reported survey interest, we run ordered probit regressions on the respondent level. We use respondents’ survey interest as the dependent variable (seven ordered categories) and the aggregated level of interest measures as independent variables. Particularly, we include the mean and the variance of \({LOI}_{low}\) and \({LOI}_{high}\) over the four survey pages as predictors, respectively. In the next model set, we include the interaction between the mean and variance measures to model the intuition that consistently (i.e., low variance over questions) high predicted interest should align with high self-reported interest, and vice versa.
All data preparations and analyses are conducted with R (version 4.0.3) using the quanteda (version 3.0.0), lme4 (version 1.1-27), and ordinal (version 2019.12-10) packages. Code for obtaining, processing, and analyzing the OpenEAR predictions is available at the following Open Science Framework (OSF) repository: https://​osf.​io/​hj58u/​?​view_​only=​e57db6950de8474a​bc117e459f9440e9​

3 Results

3.1 Distribution of predicted level of interest

In Table 1, we report the average of the mean predicted probabilities (plus standard deviations) of the three numeric measures of interest (\({LOI}_{low}, {LOI}_{med}, {\& \,LOI}_{high}\)) across respondents for each open-ended question on political parties. In Table 2, in contrast, the distribution of the condensed categorical measure of interest (\({LOI}_{cat}\)) for each open-ended question is presented. Overall, a medium or high level of interest is predicted for the majority of voice answers, while a low level of interest is predicted less frequently. This pattern holds for all four open-ended questions with requests for voice answers. Substantively, these results might reflect that attitudes towards political parties represent a rather interesting and engaging topic. Nonetheless, the low variation in predicted levels of interest across political parties is rather unexpected, given the different degrees of polarization that may be triggered by the parties that are covered in this study (i.e., CDU/CSU, SPD, Greens, and AfD).
Table 1
Distribution of predicted probabilities for each level of interest (means and standard deviations)
 
CDU/CSU
SPD
Greens
AfD
Low
0.07 (0.09)
0.09 (0.11)
0.08 (0.08)
0.07 (0.06)
Medium
0.47 (0.19)
0.40 (0.21)
0.49 (0.18)
0.45 (0.20)
High
0.46 (0.22)
0.52 (0.24)
0.43 (0.21)
0.48 (0.23)
N
617
620
619
623
Standard deviations in parentheses
Table 2
Distribution of the combined measure of predicted level of interest based on thresholding predicted probabilities (frequencies and percentages)
 
CDU/CSU
SPD
Greens
AfD
Low
22 (4%)
46 (7%)
25 (4%)
19 (3%)
Medium low
310 (50%)
198 (32%)
340 (55%)
288 (46%)
Medium high
148 (24%)
178 (29%)
142 (23%)
154 (25%)
High
137 (22%)
198 (32%)
112 (18%)
162 (26%)
N
617
620
619
623
Percentages in parentheses

3.2 Association between predicted level of interest and answer behavior

We first investigate whether and to what extent respondents’ predicted level of interest in the question is associated with their answer behavior in terms of number of words and response times, respectively. Both indicators have proven their worth in previous studies on open-ended questions with requests for voice answers (Gavras 2019; Gavras et al. 2022; Revilla and Couper 2019; Revilla et al. 2020).
We start by non-parametrically exploring the association between the predicted probabilities of high interest (\({LOI}_{high}\)) and the log number of words using loess (locally estimated scatterplot smoothing) curves (see Fig. 2). For all four open-ended questions with requests for voice answers, the loess curves show an inverse U-shaped relationship between predicted interest of respondents and answer length. This means that both low and high predicted probabilities of high interest are associated with shorter answers, while medium levels of \({LOI}_{high}\) correspond to longer answers, on average. We observe a similar non-linear relationship between \({LOI}_{high}\) and log response times (see Appendix C for the corresponding loess curves).
The multilevel regression models in Table 3 further investigate the relationship between the predicted level of interest and answer length in terms of the log number of words. In the regression models, we aim to test the explanatory power of the combined measure of predicted interest (\({LOI}_{cat}\)).
Table 3
Multilevel regression models predicting answer length in terms of log number of words
 
Model 1
Model 2
Model 3
Med. low interest
0.136 (0.064)
p = 0.033
 
0.137 (0.063)
p = 0.032
Med. high interest
0.126 (0.067)
p = 0.062
 
0.132 (0.066)
p = 0.047
High interest
− 0.021 (0.071)
p = 0.769
 
0.011 (0.070)
p = 0.876
Sentiment
 
− 1.471 (0.224)
p = 0.000
− 1.339 (0.213)
p = 0.000
Sentiment squared
 
1.523 (0.213)
p = 0.000
1.296 (0.202)
p = 0.000
Constant
2.964 (0.070)
3.303 (0.066)
2.372 (0.297)
Control Variables
No
No
Yes
Observations
2479
2617
2463
Respondents
705
718
699
Level-1 r2
0.12
0.005
0.16
Level-2 r2
0.13
0.14
0.21
Standard errors in parentheses. p-values in italics
In Model 1, we find that, compared to low predicted interest, medium low and medium high levels of interest are associated with longer answers in terms of the log number of words. Notably, we find no substantial effect for a predicted high level of interest. This result is in line with the previously reported inverse U-shaped relationship seen in the loess curves. In summary, this indicates that an increased level of interest corresponds to longer answers, but only up to a certain degree of predicted interest. This may indicate that some forms of negative arousal in respondents’ voice answers might have been misclassified as representing a high level of interest. The predicted level of interest explains about 12% of the level-1 variance in answer length.
In Model 2, we include sentiment scores as predictors to test their capability of explaining answer length in terms of log number of words. The negative sign of the sentiment coefficient and the positive sign of the sentiment squared coefficient indicate that, compared to answers with moderate sentiment levels, answers with both strongly negative and strongly positive sentiments are associated with an increase in the number of words. Nonetheless, compared to Model 1, the level-1 r2 of Model 2 is considerably lower. This indicates that the level of interest that is inferred from respondents’ voice is a stronger predictor of answer length than the sentiment of the spoken text itself.
Finally, in Model 3, we include both the predicted level of interest and sentiment scores as predictors and control for survey page and respondent characteristics. The results correspond to those in Models 1 and 2, with medium low and medium high predicted levels of interest being associated with longer answers. Note that this effect holds while controlling for the political party that is being evaluated and whether it matches respondents’ self-reported party preference. The level-1 r2 in Model 3 increased to 0.16.
In a next step, we study answer behavior in terms of response times (in seconds). The results of the corresponding multilevel regression models are shown in Table 4. In Model 1, we find positive effects for medium low and medium high predicted interest. This again indicates that these levels of interest are associated with substantially longer answers. In Model 2, we model the association between sentiment scores and response times. We find similar patterns as in the previous analyses (see Table 3). Comparing the level-1 r2 values between Model 1 and Model 2, we again observe that the sentiment scores are less predictive of answer length than the inferred level of interest of respondents. In Model 3, we include the predicted level of interest, sentiment scores, and additional control variables. The results correspond to those in Models 1 and 2 and show that the inferred level of interest remains an important predictor of response times, while controlling for the sentiment of respondents’ voice answers as well as survey page and respondent characteristics. Nonetheless, compared to the values of the previous model set in Table 3, the level-1 and level-2 r2 values remain low in Model 3.
Table 4
Multilevel regression models predicting answer length in terms of log response times in seconds
 
Model 1
Model 2
Model 3
Med. low interest
0.149 (0.051)
p = 0.004
 
0.141 (0.051)
p = 0.006
Med. high interest
0.097 (0.054)
p = 0.074
 
0.102 (0.054)
p = 0.058
High interest
− 0.053 (0.057)
p = 0.353
 
− 0.024 (0.056)
p = 0.671
Sentiment
 
− 1.096 (0.175)
p = 0.000
− 1.014 (0.172)
p = 0.000
Sentiment squared
 
1.118 (0.166)
p = 0.000
0.981 (0.163)
p = 0.000
Constant
2.499 (0.058)
2.782 (0.053)
2.068 (0.260)
Control Variables
No
No
Yes
Observations
2479
2617
2463
Respondents
705
718
699
Level-1 r2
0.07
0.01
0.10
Level-2 r2
0.02
0.03
0.06
Standard errors in parentheses. p-values in italics

3.3 Predicted level of interest and self-reported survey interest

Next, we test the association between the predicted level of interest that we inferred from respondents’ voice answers and their self-reported survey interest. As the self-reported survey interest is measured on the respondent level, we turn to our aggregated measures of predicted interest that summarize inferred interest across the four open-ended questions with requests for voice answers. Specifically, we include the mean and the variance (and their interaction) of the predicted probabilities of low interest (\({LOI}_{low}\)) across survey pages as predictors of self-reported survey interest in the models of Table 5. Corresponding models that include the mean and variance (and their interaction) of the predicted probabilities of high interest (\({LOI}_{high}\)) as predictors are presented in Appendix D.9
Table 5
Ordered probit regression models predicting self-reported survey interest
 
Model 1
Model 2
Model 3
Mean of low interest
− 1.488 (0.753)
p = 0.049
− 1.565 (0.758)
p = 0.039
− 1.501 (0.786)
p = 0.057
Variance of low interest
3.471 (3.138)
p = 0.269
− 9.574 (7.538)
p = 0.205
− 8.157 (7.669)
p = 0.288
Interaction mean (low)*variance (low)
 
48.577
(25.712)
p = 0.059
47.181 (26.081)
p = 0.071
Mean of response time
  
0.007 (0.002)
p = 0.0002
Control variables
No
No
Yes
Observations
686
686
683
AIC
2195.63
2194.00
2159.34
BIC
2231.88
2234.78
2227.24
Standard errors in parentheses. p-values in italics
Model 1 in Table 5 shows a negative effect of the mean of the predicted probabilities of low interest on self-reported interest. That is, the higher the average predicted probabilities of low interest, the lower the self-reported survey interest of respondents.
Model 2 shows negative conditional main effects of the mean and variance of the predicted probabilities of low interest and a positive interaction between both terms. This result indicates that similar predicted probabilities of low interest across the open-ended questions (low variance), lower predicted interest coincides with lower self-reported interest. However, this effect is weakened as the variance of the predicted probabilities increases across the open-ended questions. This result matches with the intuition that a consistently predicted low interest for all four open-ended questions should align with a generally low self-reported interest in the survey. However, the observed effects are rather weak.
Model 3 additionally includes the mean response time across the four open-ended questions and socio-demographic characteristics as predictors. Longer average response times are associated with higher self-reported survey interest. At the same time, the effects of the predicted probabilities of low interest remain relatively stable, indicating that the predicted interest variables have distinct effects over and above the effect of response times.

4 Discussion and conclusion

The aim of this study was to investigate the usage of automated interest recognition to predict respondents’ level of interest based on their voice answers in a smartphone survey. For this purpose, we used the open source SurveyVoice (SVoice) tool (Höhne et al. 2021) for recording voice answers and the open source OpenEAR tool (Eyben et al. 2009) for predicting respondents’ level of interest. We argued that the predicted level of interest may be used to study respondents’ answer behavior during survey completion on the survey-page level. Against this background, we explored the association between the predicted level of interest and answer behavior (research question 1) and investigated the link between the predicted level of interest and respondents’ self-reported interest in the survey (research question 2). We found that respondents’ predicted level of interest is non-linearly associated with the number of words and response times: Respondents with a predicted medium level of interest provide longer answers. In addition, the results indicate that respondents’ predicted level of interest is only weakly associated with their self-reported survey interest.
The distribution of the predicted level of interest shows that the bulk of respondents is predicted to have a medium to high interest level. This similarly applies to all four open-ended questions with requests for voice answers. Even though political parties frequently have a rather negative image in public discourse and are frequently deemed a “necessary evil” in modern Western democracies (Dalton and Weldon 2005), respondents’ seem to have a comparatively high level of interest when evaluating them. Nonetheless, this study is a very first step into the direction of automatically predicting respondents’ level of interest based on voice answers. We therefore suggest that future studies keep investigating the usefulness and usability of Natural Language Processing (NLP) tools, such as OpenEAR (Eyben et al. 2009), for predicting respondents’ level of interest in smartphone surveys.
The results on answer behavior show that respondents’ predicted level of interest affects respondents’ answer behavior. This similarly applies to the number of words and response times. Compared to low predicted interest, medium low and medium high predicted interest are positively associated with answer length in terms of number of words and response times. To put it differently, respondents with a predicted low interest produce shorter answers. We also show that sentiment scores are less predictive of answer length than the interest predictions (see level-1 r2 values in Tables 3 and 4). Overall, it appears worthwhile to further investigate the association between respondents’ predicted level of interest and answer behavior.
We also argue that it is important that future research goes a step further by investigating the quality of voice answers across respondents with different predicted levels of interest. For this purpose, researchers could additionally look at the topics of voice answers (Roberts et al. 2014). This would allow to infer more informed conclusions about the association between respondents’ level of interest and answer behavior. In addition, downstream effects with respect to data quality in later survey sections could be analyzed. Eventually, this line of research could investigate the use of automated interest recognition as a tool to monitor respondents’ engagement and motivation during web survey completion. Since interest predictions can be obtained in real-time, this approach might offer an avenue to inform about potential design adjustments during the survey completion process to maintain engagement and motivation and to prevent dropouts.
Respondents’ interest in the survey is an important aspect because it can help to shed light on respondents’ engagement and motivation during survey completion. Most typically, survey interest is measured by using closed-ended questions that are placed at a specific position in the survey so that they represent a global measure that does not inform about respondents’ interest in specific questions. In this study, we tried to tackle this limitation by automatically predicting respondents’ level of interest from their voice answers to open-ended questions. In line with our second research question, we investigated the alignment of respondents’ predicted level of interest and self-reported survey interest. The overall results, however, indicate a weak association between both measures. One potential reason for this finding is that respondents’ predicted level of interest was determined on a question level, whereas self-reported survey interest was measured on a survey level at the end of the survey. As outlined in the method section, there were 16 questions between the open-ended questions with requests for voice answers (for which we predicted respondents’ level of interest) and the self-report question on survey interest. It is possible that the two measures capture different facets of interest and thus we encourage future research to employ a more tailored study design. Specifically, it would be worthwhile to place the self-report measure closer to the questions for which the level of interest is predicted. This way the interest measures focus on the same questions or part of the survey.
A more general point associated with survey measures including self-report questions is whether and to what extent respondents provide true answers (or values). It is commonly assumed that survey measures are biased by, for example, deficiencies associated with the measurement instrument (e.g., insufficiently designed rating scales) or inaccurate answers (e.g., affected by social desirability concerns). Thus, there can be a mismatch between survey answers and true values. Such a mismatch can negatively affect measurement quality and the conclusions that can be drawn from empirical findings. For a brief discussion of the concept of true values, we refer interested readers to Lavrakas (2008).
This study has some limitations that provide avenues for future research. First, we drew a quota sample from a non-probability access panel in Germany. Since spoken language is not isolated from cultural aspects (e.g., pronunciation and intonation) this may impede the generalizability of our findings (see Koolagudi & Rao 2012; Poria et al. 2017). Future studies may use voice data that were collected—from a probability-based panel—in a cross-cultural setting to draw more robust conclusions about respondents’ answer behavior and interest levels. In doing so, it would be worthwhile to take respondents’ personality traits into account by, for example, employing the Big Five Inventory (see Rammstedt et al. 2014).
Second, we only used four open-ended questions with requests for voice answers dealing with the evaluation of German political parties. In our opinion, it is worthwhile to employ questions that contain a more diverse set of topics. It might also be interesting to employ questions with more sensitive topics, such as extremism and populism.
Third, in this study, we measured self-reported survey interest with a seven-point rating scale running from “Very interested” to “Not at all interested”. However, the OpenEAR tool by Eyben et al. (2009) predicts the following levels of interest: disinterest, neutral, and high interest. From a methodological point of view, it would be worthwhile to harmonize the two measures in future studies because it would allow to draw more robust conclusions about their alignment.
Fourth, it is important to mention that we applied the interest recognition models in a “cross-corpus” setting; that is, predictions were obtained for naturalistic voice data using models that were trained with a different database. Such prediction tasks are considerably more challenging than “in-corpus” predictions and building recognition models that are particularly tailored to voice data from smartphone surveys might result in more robust predictions. Relatedly, in computational linguistics and machine or deep learning, there is a general discussion on the feasibility and accuracy of predicting affective states and emotions based on non-verbal cues extracted from spoken language. For a more comprehensive discussion, we refer interested readers to Khalil et al. (2019).
The collection of voice answers to open-ended questions in smartphone surveys extends the existing methodological toolkit and potentially results in more in-depth information on respondents’ attitudes, behaviors, and beliefs (Gavras 2019; Gavras et al. 2022; Revilla et al. 2020). However, research on the usefulness and usability of voice answers is still in its infancy. This especially applies to the investigation of respondents’ level of interest and its association with answer behavior. This study was a very first step into this research direction and illustrates the research potentials that voice answers in smartphone surveys offer.

Declarations

Conflict of interest

There are no conflicts of interest or competing interests.

Ethical approval

The study was conducted in line with existing ethical research standards.
Consent for participation was obtained through the survey company.
We have consent to publish this study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Appendix A

English translations of the instruction, the four open-ended questions with requests for voice answers, and the self-report question on survey interest.

Instruction

Next, we would like to ask you some questions on political issues and parties. You will be asked to provide the answers in your own words. You can record your answers via the microphone of your smartphone.
Press and hold the microphone icon while recording your answer.
Once you have recorded your answer, you can stop pressing the microphone icon. A tick will indicate that you have successfully recorded your answer.
After successful recording, click on “Next” to continue with the survey as usual.

Open-ended questions with requests for voice answers

What do you think about the CDU/CSU?
What do you think about the SPD?
What do you think about the Greens?
What do you think about the AfD?
Additional instruction: Please press the microphone icon while recording your answer

Self-report question on survey interest

Overall, how interesting did you find the survey?
Answer scale: 1 “Very interesting” to 7 “Not at all interesting”
Note. The question order in the smartphone survey corresponds to the presentation order in Appendix A. These four questions were preceded by two other open-ended questions with requests for voice answers, which are not subject of this article. The first one dealt with the most important political problem in Germany and the second one dealt with the performance of the German chancellor (Angela Merkel). The self-report question on survey interest was asked with a vertically aligned, seven-point, end-verbalized rating scale without numeric values. The original German wordings are available from the first author on request.

Appendix B

See Tables
Table 6
Descriptive statistics for number of words
 
CDU/CSU
SPD
Greens
AfD
Mean
33.25
33.69
40.14
36.85
5% quantile
2
2
2
2
Median
19
21
21
19
95% quantile
104
101.80
130.35
119.70
Stand. dev
45.66
42.87
55.99
56.78
Skewness
5.14
4.88
5.42
6.78
N
655
665
674
654
6,
Table 7
Descriptive statistics for log number of words
 
CDU/CSU
SPD
Greens
AfD
Mean
2.98
3.02
3.12
2.98
5% quantile
1.10
1.10
1.10
1.10
Median
3.00
3.09
3.09
3.00
95% quantile
4.65
4.63
4.88
4.79
Stand. dev
1.08
1.09
1.14
1.18
Skewness
− 0.24
− 0.34
− 0.21
− 0.11
N
655
665
674
654
7,
Table 8
Descriptive statistics for response times in seconds
 
CDU/CSU
SPD
Greens
AfD
Mean
18.63
18.10
22.45
20.59
5% quantile
2.82
2.61
2.63
2.47
Median
11.35
11.35
13.39
11.09
95% quantile
54.73
55.12
66.77
66.65
Stand. dev
23.61
21.67
34.76
28.10
Skewness
4.37
4.73
8.68
5.03
N
655
665
674
654
8 and
Table 9
Descriptive statistics for log response times in seconds
 
CDU/CSU
SPD
Greens
AfD
Mean
2.56
2.55
2.68
2.58
5% quantile
1.34
1.28
1.29
1.24
Median
2.51
2.51
2.67
2.49
95% quantile
4.02
4.03
4.22
4.21
Stand. dev
0.87
0.88
0.93
0.96
Skewness
0.42
0.31
0.35
0.40
N
655
665
674
654
9.

Appendix C

See Fig. 3.

Appendix D

See Table
Table 10
Ordered probit regression models predicting self-reported survey interest
 
Model 1
Model 2
Model 3
Mean of high interest
0.232 (0.217)
p = 0.285
0.238 (0.289)
p = 0.409
0.052 (0.294)
p = 0.861
Variance of high interest
− 2.769 (1.691)
p = 0.102
− 2.574 (6.390)
p = 0.688
− 1.503 (6.425)
p = 0.816
Interaction mean (high)*variance (high)
 
− 0.372 (11.755)
p = 0.975
− 1.018 (11.788)
p = 0.932
Mean of response time
  
0.006 (0.002)
p = 0.001
Control variables
No
No
Yes
Observations
686
686
683
AIC
2196.27
2198.27
2164.87
BIC
2232.52
2239.05
2232.77
Standard errors in parentheses. p-values in italics
10.
Footnotes
1
This item nonresponse rate only refers to the Android (voice input) condition but not to the iOS (dictation) condition. For the iOS condition, the item nonresponse rate was less than 5% (Revilla et al. 2020).
 
2
Some other respondents (about 50%) were randomly assigned to an identical smartphone survey employing open-ended questions with requests for text instead of voice answers. Statistical tests revealed no significant differences between the two experimental conditions (text and voice) with respect to age, gender, and education.
 
3
We face item nonresponse of about 26%. The results of logistic regressions on item non-response indicate no significant differences with respect to age, gender, and education.
 
4
Eyben et al. (2009) report a weighted average recall rate of 74.5 based on a tenfold Cross-Validation. For the level “high interest”, recall represents the fraction of individuals that were correctly predicted as highly interested out of all individuals that were in fact highly interested.
 
5
We also condensed the three numeric measures into one variable by directly taking the level of interest with the highest predicted probability of each question as the observed category. However, this measure led to a very sparsely populated level of “low interest” and thus limited variability.
 
6
The reason for using words instead of characters is that (strong) accents and dialects can affect the number of characters (e.g., omitting the final letters of a word) when automatically transcribing voice answers. This would decrease the accuracy of the answer length.
 
7
We additionally include a quadratic term for sentiment because it can be assumed that answers with strongly negative or strongly positive sentiments differ from answers with moderate sentiments.
 
8
We conducted several robustness checks varying the minimum number of words and response time lengths, respectively. The main conclusions did not change.
 
9
In the models in Appendix D, we cannot observe a similar effect pattern for aggregated measures of the predicted probabilities of high interest. At best, a negative effect of the variance measure can be observed in Model 1. In all three models, there is little evidence that a (consistent) increase in the predicted probabilities of high interest coincides with a considerable increase in self-reported survey interest.
 
Literature
go back to reference Eyben, F., Wöllmer, M., Schuller, B.: OpenEAR: introducing the Munich open-source emotion and affect recognition toolkit. In: Paper Presented at the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, Amsterdam (2009). https://doi.org/10.1109/ACII.2009.5349350 Eyben, F., Wöllmer, M., Schuller, B.: OpenEAR: introducing the Munich open-source emotion and affect recognition toolkit. In: Paper Presented at the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, Amsterdam (2009). https://​doi.​org/​10.​1109/​ACII.​2009.​5349350
go back to reference Frank, M.G., Griffin, D.J., Svetieva, E., Maroulis, A.: Nonverbal elements of the voice. In: Kostić, A., Chadee, D. (eds.) The social Psychology of Nonverbal Communication, pp. 92–113. Palgrave Macmillian, London (2015) Frank, M.G., Griffin, D.J., Svetieva, E., Maroulis, A.: Nonverbal elements of the voice. In: Kostić, A., Chadee, D. (eds.) The social Psychology of Nonverbal Communication, pp. 92–113. Palgrave Macmillian, London (2015)
go back to reference Gavras, K.: Voice recording in mobile web surveys: evidence from an experiment on open-ended responses to the “final comment”. Paper presented at the General Online Research Conference, Cologne, Germany (2019) Gavras, K.: Voice recording in mobile web surveys: evidence from an experiment on open-ended responses to the “final comment”. Paper presented at the General Online Research Conference, Cologne, Germany (2019)
go back to reference Gavras, K., Höhne, J.K., Blom, A., Schoen, H.: Innovating the collection of open-ended answers: the linguistic and content characteristics of written and oral answers to political attitude questions. J. R. Stat. Soc. (Ser. A) 185(3), 872–890 (2022). https://doi.org/10.1111/rssa.12807CrossRef Gavras, K., Höhne, J.K., Blom, A., Schoen, H.: Innovating the collection of open-ended answers: the linguistic and content characteristics of written and oral answers to political attitude questions. J. R. Stat. Soc. (Ser. A) 185(3), 872–890 (2022). https://​doi.​org/​10.​1111/​rssa.​12807CrossRef
go back to reference Lavrakas, P.J.: True value. In: Lavrakas, P.J. (ed.) Encyclopedia of Survey Research Methods, p. 910. Sage, Thousand Oaks (2008)CrossRef Lavrakas, P.J.: True value. In: Lavrakas, P.J. (ed.) Encyclopedia of Survey Research Methods, p. 910. Sage, Thousand Oaks (2008)CrossRef
go back to reference Peterson, G., Griffin, J., LaFrance, J., Li, J.: Smartphone participation in web surveys. In: Biemer, P.B., de Leeuw, E., Eckman, S., Edwards, B., Kreuter, F., Lyberg, L.E., Tucker, N.C., West, B.T. (eds.) Total Survey Error in Practice, pp. 203–233. John Wiley & Sons, Hoboken (2017)CrossRef Peterson, G., Griffin, J., LaFrance, J., Li, J.: Smartphone participation in web surveys. In: Biemer, P.B., de Leeuw, E., Eckman, S., Edwards, B., Kreuter, F., Lyberg, L.E., Tucker, N.C., West, B.T. (eds.) Total Survey Error in Practice, pp. 203–233. John Wiley & Sons, Hoboken (2017)CrossRef
go back to reference Remus, R., Quasthoff, U., Heyer, G.: SentiWS—a publicly available German-language resource for sentiment analysis. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, Valletta, Malta. https://aclanthology.org/L10-1339/ (2010). Accessed 02 Dec 2022 Remus, R., Quasthoff, U., Heyer, G.: SentiWS—a publicly available German-language resource for sentiment analysis. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, Valletta, Malta. https://​aclanthology.​org/​L10-1339/​ (2010). Accessed 02 Dec 2022
go back to reference Schlosser, S., Höhne, J.K.: Embedded client side paradata (ECSP). Zenodo (2018). https://doi.org/10.5281/zenodo.1218941 Schlosser, S., Höhne, J.K.: Embedded client side paradata (ECSP). Zenodo (2018). https://​doi.​org/​10.​5281/​zenodo.​1218941
go back to reference Schlosser, S., Höhne, J.K.: Embedded client side paradata (ECSP). Zenodo (2020). https://doi.org/10.5281/zenodo.3782591 Schlosser, S., Höhne, J.K.: Embedded client side paradata (ECSP). Zenodo (2020). https://​doi.​org/​10.​5281/​zenodo.​3782591
Metadata
Title
The sound of respondents: predicting respondents’ level of interest in questions with voice data in smartphone surveys
Authors
Jan Karem Höhne
Christoph Kern
Konstantin Gavras
Stephan Schlosser
Publication date
28-11-2023
Publisher
Springer Netherlands
Published in
Quality & Quantity / Issue 3/2024
Print ISSN: 0033-5177
Electronic ISSN: 1573-7845
DOI
https://doi.org/10.1007/s11135-023-01776-8

Other articles of this Issue 3/2024

Quality & Quantity 3/2024 Go to the issue

Premium Partner