2.2.1 CDRs
Many works in the literature are based on the analysis of mobile phone data, the so-called CDRs (Call Detail Records) of calling and texting activity of users, because they guarantee the repeatability of experiments in different countries and on different scales given the worldwide diffusion of mobile phones [
56].
CDRs collect geographical, temporal, and interaction information on mobile phone use [
57‐
62], hence providing a comprehensive picture of human behavior at a societal scale. Each time an individual makes a call, the mobile phone operator registers the connection between the caller and the callee, the duration of the call, and the coordinates of the phone tower communicating with the served phone. Table
2 illustrates an example of the structure of CDRs.
Table 2
Example of Call Detail Records (CDRs). Every time a user makes a call, a record is created with timestamp, the phone tower serving the call, the caller identifier and the callee identifier (a). For each tower, the latitude and longitude coordinates are available to map the tower on the territory (b)
2007/09/10 23:34 | 36 | 4F80460 | 4F80331 |
2007/10/10 01:12 | 36 | 2B01359 | 9H80125 |
2007/10/10 01:43 | 38 | 2B19935 | 6W1199 |
\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |
36 | 49.54 | 3.64 |
37 | 48.28 | 1.258 |
38 | 48.22 | -1.52 |
\(\vdots \) | \(\vdots \) | \(\vdots \) |
Note that CDRs suffer from different types of bias [
63,
64]. For example, the position of a user is known at the granularity level of phone towers, and only when they make a phone call. Moreover, phone calls are sparse in time, i.e., the time between consecutive calls follows a heavy tail distribution [
65,
66]. In other words, since users are inactive most of their time, CDRs allow reconstructing only a subset of a user’s behavior.
CDRs are used to monitor several dimensions of well-being, notably health (A1), job opportunities (A2), socioeconomic development (A3), environment (A4), and safety (A5).
CDRs provide one of today’s most exciting opportunities to study human mobility and its influence on disease dynamics (A1). Many researchers use mobile phone data for public health, as the analysis of individual and population mobility patterns is more objective and with finer spatiotemporal resolution in comparison to traditional methods. Furthermore, mobile network data can also provide insights into human behavior that can support the assessment and monitoring of the health of specific communities at risk, thus paving the way toward improved health promotion and prevention [
67]. Taking into consideration that the spatiotemporal evolution of human mobility and the related fluctuations of population density are essential drivers of disease outbreaks, Finger et al. [
68] use CDRs to track the cholera outbreak in 2005 in Senegal. Findings show that a mass gathering taking place during the initial phase of the outbreak has an essential impact on the course of the disease. Besides, Kafsi et al. [
69] contribute to the fight against epidemics of infectious diseases using CDRs provided by France Telecom-Orange. They use 2.5 billion calls made by 5 million users in the Ivory Coast, recorded over 5 months, from December 2011 to April 2012, to study and model behavioral patterns of the affected population and propose several strategies for personalized behavioral recommendations to reduce the infections. Lima et al. [
70] use the same data set to build a model that describes how diseases circulate the country as people move between regions, and they enhance the model with a concurrent process of relevant information spreading. This process corresponds to people disseminating disease prevention information, e.g., hygiene practices, vaccination campaign notices, and others, within their social network. Finally, Madan et al. [
71] use CDRs and mobile phone-based co-location sensing to measure characteristic behavior changes in symptomatic individuals. These behavior changes are reflected in their total communication, interactions with respect to time of day, diversity, and entropy of face-to-face interactions and movement. Using these extracted mobile features, they manage to predict the health status of an individual, without having actual health measurements from the subject.
Besides, researchers use CDRs to study job opportunities (A2). Pappalardo et al. [
72] use CDRs to study the link between human mobility and the employment rate of French cities, finding a strong correlation between measures of mobility entropy and the unemployment rate in urban environments. Toole et al. [
73] show that changes in the calling behaviors of individuals, aggregated at regional level, can improve forecasts of macro unemployment rates. Sunds et al. [
74], use CDRs to create a model which predicts unemployment with a 70.4% of accuracy. They also provide promising support to the collection of data for populations in developing countries, which are often under-represented in official surveys.
Most of researchers use CDRs to investigate socioeconomic development (A3). A seminal work analyzes landline calls and a nationwide mobile phone data set to show that, in the UK, regional communication diversity is positively associated with a socioeconomic ranking [
75]. Other works address the issue of mapping poverty [
76] and other socioeconomic determinants [
77] with mobile phone communication data, combined with airtime credit purchases data in the Ivory Coast [
78]. Blumenstock et al. [
79,
80] show preliminary evidence of a relationship between individual wealth and the history of mobile phone transactions. Frias-Martinez et al. [
81‐
84] analyze the relationship between human mobility and the socioeconomic status of urban zones, presenting which mobility indicators correlate best with socioeconomic levels and building a model to predict the socioeconomic level from mobile phone traces. Pappalardo et al. [
85] analyze mobile phone data and extract meaningful mobility measures for cities, discovering an interesting correlation between human mobility aspects and socioeconomic determinants. Lotero et al. [
86] analyze the architecture of urban mobility networks in two Latin-American cities from the multiplex perspective. They discover that the socioeconomic characteristics of the population have an extraordinary impact on the layer organization of these multiplex systems. In a successive work, Lotero et al. [
86] analyze urban mobility in Colombia representing cities by mobility networks. They encode the origin-destination trips performed by a subset of the population corresponding to a particular socioeconomic status and they show that spatial and temporal patterns vary across these socioeconomic groups. Amini et al. [
87] use mobile phone data to compare the human mobility patterns of a developing country (the Ivory Coast) and a developed country (Portugal). They show that cultural diversity in developing regions can present challenges to mobility models defined in less culturally diverse regions. Smith-Clarke et al. [
88] analyze the aggregated mobile phone data of two developing countries and extract features that are strongly correlated with poverty indexes derived from official statistics census data.
Moreover, researchers use CDRs to monitor the quality of the environment and its impact on people’s lives (A4). For example, Picornell et al. [
89] evaluate the population exposure to
\(\mathrm{NO}_2\) on a research published recently. They use CDRs from one of the three most important Spanish mobile phone network operators (MNOs), with around 30% market share. The analysis is conducted for the capital of Spain, Madrid, for the 17th of November 2014, as a typical day in terms of population mobility and
\(\mathrm{NO}_2\) levels. Comparing the results with traditional census-based methods, they demonstrate relevant discrepancies at disaggregated levels and underline the importance of integrating CDRs data for the evaluation of population exposure to
\(\mathrm{NO}_2\). Lu et al. [
90] study people’s behavior affected by climate stress. In particular, by exploring the individuals’ behavioral response to the Cyclone Mahasen, which struck Bangladesh in May 2013, they find out that anomalous patterns of mobility and calling frequency correlate with rainfall intensity, showing the affected regions and when the storm moves. Lu and Bengtsson [
91,
92] analyze the movement of 1.9 million mobile phone users before and after the 2010 Haiti earthquake, and they show that CDRs can be a valid data source for estimates of population movements during disasters. Wilson et al. [
93] build a tool within nine days of the Nepal earthquake of 2015, to provide spatiotemporally detailed estimates of population displacements from CDRs based on movements of 12 million mobile phones users. Nyarku et al. [
94] use CDRs to explore whether mobile phones could be reliably used to monitor individual exposure to selected air pollutants when moving between indoor and outdoor microenvironments. In particular, data are collected from two BROAD life mobile phones, which are equipped with sensors for direct measurements of air pollutants. The two phones bring similar results, both for particles and formaldehyde, making them potentially suitable for applications in polluted environments, even if there seem to be some exceptions where the readings of the two phones do not correspond well to each other. Liu et al. [
95] map personal trajectories using mobiles in an urban environment to assess the impact of traffic-related air pollution in society. They estimate traffic pollution exposure to individuals based on the exposure along the individual human trajectories in the estimated pollution concentration fields by utilizing modeling tools and manage to identify trajectory patterns of particularly exposed human groups. In addition, Decuyper et al. [
96] use CDRs to study food security indicators finding a strong correlation between the consumption of vegetables rich in vitamins and airtime purchase.
Other studies focus on the safety dimension (A5). Bogomolov et al. [
97] use CDRs for 3 weeks from the 9th to the 15th of December 2012 , and from the 23rd December 2012 to the 5th of January 2013, in combination with demographic data from December 2012 to January 2013, to predict crime in the city of London. Experimental results show 70% of accuracy in predicting whether an area could be a crime hotspot or not. Similarly, Ferrara et al. [
98] study criminal networks to detect and characterize criminal organizations in networks reconstructed from the CDRs. They also introduce an expert system to support law enforcement agencies in unveiling the underlying structure of criminal networks.
2.2.2 GPS and transportation data
Since the 1990s, Global Positioning Systems (GPS) have been used for tracking the movements of the individuals [
99‐
102]. In particular, GPS data provide time and location coordinates information, which can be used to link locations with environments and to calculate the speed of movements [
103]. For insurance reasons, some vehicles have a black box installed. The device records the position of the vehicle at regular intervals and sends it to the database. Table
3 illustrates an example of the structure of GPS records.
Table 3
Example of GPS records
63 | 2014-06-18 06:31:24 | 43.557703 | 10.337913 |
63 | 2014-06-18 06:31:26 | 43.557725 | 10.33794 |
63 | 2014-06-18 06:31:27 | 43.557735 | 10.337955 |
\(\vdots \) | \(\vdots \) | \(\vdots \) | \(\vdots \) |
GPS data can also cover rural areas, as opposed to other data, mostly collected among citizens of urban areas [
104]. Comparing to the traditional ways of measuring mobility, usually by self-reports assessed with questionnaires, GPS does not bring any biases and misclassification, [
104,
105], as it eliminates the social desirability usually brought by self-report participants [
106,
107]. Another advantage of GPS data is that they provide real-time monitoring. However, while there are studies based on GPS data covering hundreds of thousands of individuals [
108] most of the GPS studies are conducted with fewer than 300 participants [
104,
109], usually due to privacy issues. Apart from this drawback, when a GPS is used indoors, the spatial accuracy of the measurements is fairly detected [
110], which creates problems in specific fields, such as on epidemiology research.
GPS data are used to explore several dimensions of objective well-being, notably health (B1), socioeconomic development (B3), and safety (B5).
Table 4
The table contains a subset of the information returned by a Twitter API
240556 | null | #ny #dinner | [10214;452879] | ..... | {.....} | ... |
4261063 | NY | null | null | ..... | {.....} | ... |
72096 | 42.10;10.2 | #wellbeing | [964215] | ..... | {.....} | ... |
Health (B1) exploration has also attracted the interest of researchers. For example, Saelens et al. [
111] track the movements of an individual through GPS devices and bring to the surface growing evidence that transit users are more physically active than non-transit users, which could potentially lead to the health improvement of the first ones. Similarly, Rundle et al. [
112] explore health in terms of physical activity, and conclude that neighborhood walkability influences other residents’ choice of space utility and is also associated with higher weekly physical activity. Additionally, Sadler et al. [
113] use GPS data to understand children’s exposure to junk food in Canada and compare the results to a validated food environment database. They demonstrate that official results underestimate exposure to junk food up to 68%, which should be taken into consideration by policy-makers. Finally, Canzian and Musolesi [
114] analyze mobility patterns from GPS traces to answer whether mobile phones can be used to monitor individuals affected by depressive mood disorders. They develop a smartphone application that periodically collects the locations of the users and the answers to daily questionnaires that quantify their depressive mood. They find a significant correlation between mobility trace characteristics and the depressive moods of individuals.
Some of these works using GPS data focus on exploring socioeconomic development (B3). Marchetti et al. [
115] perform a study at regional level, analyzing GPS tracks from cars in Tuscany to extract measures of human mobility at province and municipality level. They find that there is a strong correlation between the mobility measures and a poverty index independently surveyed by the Italian official statistics institute. Smith et al. [
116] use an automated fare collection data set of journeys made on the London rail system to build a classifier that identifies areas of the city with high economic deprivation. They highlight that, given its high precision, the classifier provides potential benefits for city planning and policy-making. Lathia et al. [
117] use the same data set to find that more deprived areas tend to receive passenger flow from a higher number of other areas compared to less deprived areas, also uncovering some evidence of social segregation.
Another objective well-being dimension that is explored with GPS data is safety (B5). Robinson et al. [
118] compare the spatial distribution of objective crime incidents and self-reported physical activity among adolescents in Massachusetts, between 2011 and 2012, and show that there is a positive association between them (
\(r=0.72\),
\(p<0.0001\)). Ariel et al. [
119] use GPS data to replicate findings published from US official research on the effect of hot spots policing for the prevention of crime in England and Wales and demonstrate that victim-generated crimes (the primary outcome measured in previous studies) increase in both the near vicinity and in catchment areas.
Social media, such as Twitter, Facebook, and Instagram, can be considered as a digital database of information about online users, hence rendering individuals’ online activities accessible for analysis. Given this enormous potential, researchers, governments, and corporations are turning their interest on social media to understand human behavior and interactions better [
120]. Among all social media, Twitter is the most popular, since it provides public access to data through APIs with the least restrictive policy. The Twitter APIs return information about locations, date of the event, interactions with other users, or tags inserted in the tweet. Twitter also returns some information about the user profile. Table
4 illustrates an example of the structure of Twitter records.
Despite their indubitable usefulness, social media data may also encounter some concerns [
121]. First of all, they may reflect social desirability biases, since individuals manage their online profiles [
122]. Besides, social media users may not be as representative of the general population as traditional anonymized self-reports conducted through a chosen representative sample [
123].
All dimensions of objective well-being are monitored through social media data, i.e., health (C1), job opportunities (C2), socioeconomic development (C3), environment (C4), safety (C5) and politics (C6).
Several studies provide valuable insights into how the analysis of social media data can lead to next-generation automated methodologies for public health (C1). As an example, Eichstaedt et al. [
123] use Twitter data, in combination with atherosclerotic heart disease (AHD) mortality rates and country-level socioeconomic variables. They predict country-level heart disease mortality since the language expressed on Twitter reveals important psychological characteristics that are significantly associated with heart disease mortality risk. Besides, De Choudhury et al. [
124] use Twitter data in combination with traditional depression screening test data for the detection and diagnose of the individuals’ major depressive disorders and even to predict the likelihood of depression of individuals. Signorini et al. [
125] use data from Twitter to track rapidly-evolving public sentiment concerning H1N1 and to measure actual disease activity. They show that Twitter can be used as a measure of public interest or concern about health-related events and that estimates of influenza-like illness derived from Twitter chatter accurately track reported disease levels. Paul et al. [
126] incorporate in their forecasting models the historical influenza data and Twitter data. Lampos et al. [
127] measure the prevalence of flu-like symptoms in the general UK population, based on the contents of Twitter, searching for symptom-related statements, turning this information into a flu-score and they obtain on average a statistically significant linear correlation which is higher than 95%. In a later work, the authors [
128] instead of choosing the keywords and phrases themselves, they use machine learning algorithms to find out which words in the database of tweets occurred more often at times of elevated levels of flu, and they obtained very positive results. They claim that flu epidemics can be detected based on Twitter content. Chen and Yang [
129] use individuals’ tweets with spatiotemporally tagged information to demonstrate that people’s healthy diet is elicited by exposure to their immediate food environment.
Regarding the monitoring of job opportunities (C2), Llorente et al. [
130] quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to unemployment. For this purpose, they examine country-wide Twitter data describing 19 million geo-located messages and find that the regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates. Antenucci et al. [
131] use data from Twitter, from July 2011 to early November 2013, to create indexes of job loss, job search, and job posting. They derive signals by counting job-related phrases in tweets such as “lost my job”. They construct social media indexes from the principal components of these signals and manage to track events that affect the job market in real-time, such as Hurricane Sandy and the federal government shutdown.
A large number of works in the literature focus on monitoring socioeconomic development from social media data (C3). Bollen et al. [
132], in a further study, analyze data from Twitter and consider the emotions of traders, rather than their information gathering processes, suggesting that changes in the calmness of Twitter messages could be linked to changes in stock market prices. Still, regarding socioeconomic development, social media data are also extensively used to nowcast and forecast stock market prices and traded volumes. Seminal works in this field leverage information contained within investment discussion boards and blogs. For example, Bar-Haim et al. [
133] use StockTwits data to uncover relevant correlations between Web-derived indicators and the stock market. In detail, they leverage sentiment scores of messages shared in the Yahoo message boards to find correlations with the stock market. In a different web platform study, De Choudhury et al. [
134] try to find correlations between the stock market and blog communications. Last, Cresci et al. [
135,
136] assess the risks and vulnerability of stock markets to automation, manipulation, and disinformation, with the ultimate goal of safeguarding people’s investments.
Researchers also use social media for the exploration of the environment dimension (C4). Avvenuti et al. [
137] claim that the analysis of social media proves valuable for quickly acquiring situational awareness and estimates of the impact of disasters. As an example of the predictive power of social media, Kryvasheyeu et al. [
138], Avvenuti et al. [
139] and Mendoza et al. [
140] demonstrate the viability of predicting or nowcasting the damage produced by earthquakes by analyzing social media communications in the aftermath of the event. The results of these models can also be displayed in real-time, interactive maps that highlight stricken areas and provide support to emergency responders. Notable examples of this kind are the systems developed by Avvenuti et al. [
141,
142]. Preis et al. [
143] find that the number of photos taken and subsequently uploaded to Flickr with titles, descriptions, or tags related to Hurricane Sandy bears a striking correlation to the atmospheric pressure in the US state New Jersey. They claim that appropriate leverage of such information could be useful to policy-makers and emergency crisis managers.
Safety is another dimension that can be monitored using data from social media (C5). For example, Chen et al. [
144] use Twitter data and create a model that predicts the specific time and location a crime occurs. This model combines kernel density estimation based on historical crime incidents and prediction via linear modeling with sentiment and weather predictors. By adding the latter determinants, they show that their model improves significantly with respect to existing models. Similarly, Boni et al. [
145] use spatio-temporally tagged tweets and create a model for crime prediction. In particular, they combine real crime data with individuals’ micro-level movement patterns extracted from Twitter and demonstrate improved predictions. Likewise, Kadar et al. [
146] describe urban crime by using Foursquare and considering these data as a measurement for the ambient population of a neighborhood, to further describe crime levels. They also confirm that such models improve the traditional models, based on census data. Additionally, the city of Chicago applies text analytics on Twitter and 311 (the local emergency number) records to detect and prevent phenomena like rat infestations and to track civil unrest and violent crimes (CrimeScan and CityScan software) [
147‐
149].
Finally, the politics dimension (C6) is extensively studied, in particular, during the last years with the rise of the political crisis across the world. Colleoni et al. [
150] investigate the political homophily on Twitter to classify users as Democrats or as Republicans based on their tweets. They show that, in general, the former exhibit higher levels of political homophily than the latter. Goh et al. [
151] use Facebook pages of a group of 12 politicians and demonstrate that political engagement can be achieved by creating social media consumption habits, as supported by the habit formation in consumption from macroeconomics. Similarly to the field of socioeconomic and financial analyses, social media data can be easily manipulated also for achieving political goals [
152,
153]. As such, results of political analyses based on social media should be carefully weighed to minimize issues related to biases and manipulations.
2.2.4 Health and fitness data
These data mainly consist of Electronic Health Records (EHRs) and mobile application data that are mainly used for monitoring the health dimension (D1). EHRs, initially created for the facilitation of the billing and patient care, are widely used for clinical studies and clinical risk prediction. Table
5 reports an example of clinical records concerning the hospitalization of some patients.
Table 5
The table shows an example of clinical records, including the pathology for which a patient is admitted to the hospital, the duration of hospitalization and the medicines she/he took
01/02/2019 | 01/02/2019 | Asthma | m1,m2,m3 |
03/02/2019 | 08/03/2019 | Head trauma | m5 |
Out of a systematic review, Goldstein et al. [
154] demonstrate both opportunities and challenges of EHRs. On the one hand, compared to the traditionally used cohort data developed and collected for research purposes (such as the Framingham Heart Study [
155]), EHRs are cost-effective. In contrast with cohort data, EHRs can indeed be used for multiple health studies and, since they are collected at a high frequency, they allow a better prediction of near-term risk of events. On the other hand, EHRs include only individuals that have been ill or at least have had a clinic visit, which could generate a problem of representativeness. Moreover, they are not publicly available and might include limited time slots.
Table 6
Subset of the main fields provided by GDELT platform
815176338 | Arrest, detain | − 70 | 20180110 | US | |
815176339 | Use conventional military force | − 30 | 20180110 | UK | ... |
815176340 | Consider policy option | + 25 | 20180110 | IT | ... |
Researchers use EHRs to monitor several aspects of personal health (D1). For example, Sultana et al. [
156] use the Integrated Primary Care Information (IPCI) database to look for elements that could contribute to traditional methodologies. For example, multimorbidity and polypharmacy are elements that could help in identifying frailty methodologies. They demonstrate that the Mini-Mental State Examination score, which is the most commonly recorded data item, could be potentially used as a frailty identifier. Ghaderighahfarokhi et al. [
157] use medical records of newborns in the educational Hospitals affiliated to the Ilam University of Medical Sciences (from April 2015 to April 2016) to identify accurate predictors of Low Birth Weight (LBW). They demonstrate that LBW is a multi-factorial condition requiring a systematic and accurate program to be reduced, such as education through mass media, repeated monitoring of pregnancy, and others. Metzger et al. [
158] use EHRs with Emergency Department patient visits in 2012, from Lyon University Hospital, to demonstrate that machine learning can contribute to more accurate estimations of suicide attempts in France, in relation to the current national surveillance system based on manual coding by emergency practitioners. Mhaskar et al. [
159] investigate the 30 minutes prediction of blood glucose levels based on Continuous Glucose Monitoring (CGM). In particular, they use data from the DirecNet Central Laboratory, containing time series for 25 patients, who are less than 18 years old. By training a deep learning model on a data set designed to explore the performance of CGM devices in children with Type I diabetes, they demonstrate how deep neural networks can outperform shallow networks on this task. In addition, Santillana et al. [
160] use a clinician’s database, named as UpToDate, to predict influenza epidemics in the United States promptly. They show that digital disease surveillance tools based on experts’ databases may be able to provide an alternative, reliable, and stable signal for accurate predictions of influenza outbreaks. Besides EHRs, mobile app data, such as lifestyle habits data concerning eating and physical activity behaviors, are used for the monitoring of objective well-being in terms of health (D1). These data demonstrate for once more that smartphones can contribute to research with valuable new insights, although they might apply biases towards people with lower socioeconomic status or towards people who are more interested in their health. In addition, such data collected through web surveys for research purposes might bring the disadvantages discussed before. A critical study using mobile app data is conducted by Althoff et al.[
161]. They use a data set consisted of physical activity for 717,527 Apple iPhone smartphone users of the Azumio Argus app, which tracks users’ diet and fitness and other healthy behaviors, between July 2013 and December 2014. They demonstrate inequality in how the activity is distributed within countries and that this inequality is a better predictor of obesity than average activity level. Similarly, Hayeri [
162] uses continuous glucose monitors (CGM) and fitness wearables (Fitbit) to predict blood glucose values. The study uses data gathered from each participant for 60 days, where the data from the first 30 days are used to train the algorithm and the remaining 30 days to test the predictions. On average, the software is able to predict a user’s future glucose values with a 93% accuracy rate for 60-mins ahead of time.
2.2.5 News
News data sources, such as the GDELT database [
163], contain information extracted from the news of newspapers around the world. News records generally describe a variety of subject domains (e.g., economic events, political events), represent a wide range of targets (e.g., opposing politicians) [
164] and are continuously updated, containing even archived historical news of the last decades. Nevertheless, such data contain three main biases [
165]: the gatekeeping bias, i.e., the editors or the journalists decide on which event to publish; the coverage bias, related to the coverage of an event (e.g., western countries are over-covered, whereas African countries are under-covered); the statement bias, when the content written by the journalist, even if tried to be objective, is favorable or unfavorable towards certain events. Table
6 shows an example of news records.
News records are used to measure health (E1), socioeconomic development (E3), environment (E4), and politics (E6) dimensions of objective well-being.
Emerging infectious diseases and the rise of modern technology have generated new demands and possibilities for disease surveillance and response (E1). Growing numbers of outbreak reports must be assessed rapidly so that control efforts can be initiated. For example, the World Health Organization (WHO) sets up a process for timely disease outbreak verification to convert large amounts of data from some 600 sources, including all major news wires, newspapers, and biomedical journals, into accurate information for suitable action [
166,
167]. Brownstein et al. [
168] in a similar effort, create HealthMap, a freely accessible, automated real-time system that monitors, organizes, integrates, filters, visualizes, and disseminates online information about emerging diseases. Wilson et al. [
169] use the HealthMap project to monitor listeriosis. Chunara et al. [
170] use social and news media to validly estimate the 2010 Haitian cholera outbreak.
News records on financial affairs and financial markets are intrinsically interlinked (E3). Alanyali et al. [
171] quantify the relation between movements in financial news and movements in financial markets by exploiting a corpus of six years of financial news from 2007 to 2012 from the Financial Times. Their results suggest that greater interest in a company in the news is related to greater interest in the corresponding company in stock markets. Lillo et al. [
172] show that the flux of news of the previous day affects the trading activity of companies, households, and foreign investors and the dynamics of volatility.
News can also help capturing the environmental dimension of well-being (E4). As an example, Kleinschmit et al. [
173] investigate 394 articles on forest and climate change published in the Swedish newspaper Dagens Nyheter from 1992 to 2009. They show that there has been an increasing discussion on forests in a changing climate over the last 18 years from both scientists and politicians. The increased number of these news events correlate with real environmental events happening internationally. Similarly, Boykoff [
174] uses data extracted from the Vanderbilt University Television News Archive, consisting of television news from US news broadcasts (e.g., ABC World News Tonight) for the period between 1995 and 2004. He demonstrates that 70% of the US television news provide balanced coverage on anthropogenic contributions to climate change compared to natural radiative forcing. He also shows that there is a significant difference between this television coverage and scientific consensus on the topic.
News records are also used to understand the coverage of political issues (E6). Van Aelst and De Swert [
175] use daily news of politics of campaign periods, extracted from the Electronical News Archive over the 2003 to 2006 period, and show that campaign periods have a high impact on the amount, style and actors of the political news in Belgium. To the best of our knowledge, the dimension politics (E6) has not yet been adequately explored through news data and constitutes inspiration for future research.
2.2.6 Scanner data
Scanner data are generated by point-of-sales terminals in shops and provide information at the level of the single product. Sales terminals record each transaction, and the resultant data can provide considerable insights into consumer purchasing patterns. They can be obtained from a wide variety of retailers: supermarkets, pharmacies, do-it-yourself stores, home electronics or clothing shops, and many others [
176].
Scanner data are used from social researchers, as they can offer useful detailed information and the possibility to model the dynamic behavior of households, as well as to control for unobservable time-invariant characteristics [
177]. Also, scanner data provide information over long periods of time than only one day or a couple of weeks. This happens because the final data used are produced from customers that purchase several items on each store visit, for several store visits, over a period of time [
178,
179]. It is also worth mentioning that scanner data can contribute to the improvement of the quality of the Harmonized Index of Consumer Prices (HICP) [
180]. However, using scanner data is challenging since researchers are dependent on the retailer’s permission, and they should also overcome the legal constraints in order to obtain them [
179]. Table
7 shows an example of supermarket records.
Table 7
Subset of the main fields provided by a supermarket database for the purchases in different shops
2018020156287 | 109745368 | 2018-02-01 17:30:14 | Pisa, Italy | 2018020101567 | [bread, milk, eggs, tissues] |
2018020578256 | 104827423 | 2018-02-05 10:14:57 | Torino, Italy | 2018020500234 | ... |
2018020743624 | 012753862 | 2018-02-07 19:57:00 | Florence, Italy | 2018020721987 | ... |
Scanner data are used to measure health (F1), socioeconomic development (F3), and environment (F4) dimensions of objective well-being.
To begin with, researchers use scanner data to monitor several aspects of public health (F1). For example, pharmaceutical sales may be used to predict changes in clinical conditions with a useful time lead. Magruder et al. [
181] find a 90% correlation between flu-related drug sales and physician diagnoses of acute respiratory conditions, at several subregions of the National Capital Area. They show that these sales occur approximately three days before the physician-patient encounter. Scanner data are also used to study the nutrients and saturated fat of several food categories and their implications on personal health. For example, Griffith et al. [
177] use supermarket scanner data from the UK to study the nutrients in foods. They show that there is a lot of variation in nutrients at individual product level, even with food categories such as butter, which are very narrow. Bonnet et al. [
182] use data from French supermarkets to explore consumer behavior with respect to the consumption of saturated fat, while Griffith et al.[
183] model the potential impact of a tax on saturated fats. Finally, Janssen et al. [
184] use scanner data from the Nielsen Consumer Panel data set that covers the years from 2004 to 2017. They aim to identify households with a pregnant household member and also to estimate the effect during and after pregnancy on alcohol purchases and relative expenditure on fruit and vegetables. Results show that during and after pregnancy, households reduce their alcohol purchases by 22–27%. In contrast, the relative expenditure on fruit and vegetables does not increase during pregnancy but decreases post-pregnancy by 19%.
The majority of studies with scanner data focus on exploring the socioeconomic development (F3). Van der et al. [
186] introduce a new method for computing the Dutch Consumer Price Index (CPI) based on supermarket scanner data. In the meanwhile, in 2017, Eurostat issued a practical guide for processing supermarket scanner data to calculate the CPIs of EU countries in order to ensure the comparability of the values across Europe, as well as to modernize the official statistics [
179]. Silver et al. [
187] outline the potential use of scanner data from retailers for the measurement of inflation. They use monthly scanner data for television sets in 1998 in the UK to study the two primary forms of bias in CPIs. Moreover, Pennacchioli et al. [
188] study the retail activity of the customer subset of an Italian supermarket chain. They discover that highly ranked customers, with more sophisticated needs, tend to buy niche products, i.e., low-ranked products. On the other hand, low-ranked, low purchase volume customers tend to buy only high-ranked products, very popular products that everyone buys. In addition, Sobolevsky et al. [
189] use a complete set of bank card transactions in 2011 in Spain and demonstrate that there is a clear correlation between individual spending behavior and official socioeconomic indexes denoting the quality of life.
Finally, researchers use scanner data to monitor the impact of humans on the environment (F4). Panzone et al. [
190] use scanner data from the largest UK food retailer for the creation of an Environmentally Sensitive Shopper (ESS) index measuring the environmental sustainability of food consumption at household level. In addition, Gadema et al. [
191] use data from UK supermarket shoppers to examine whether carbon footprinting and labeling food products are tools that could facilitate consumers to make greener purchasing decisions. They claim that this could be a sensible way to potentially achieve a low carbon future. Food waste is a significant problem in modern society and carries considerable social, economic, and environmental costs. For example, Brancoli et al. [
192] use scanner data to analyze the impacts of food waste at a supermarket in Sweden. They discover the importance of not only measuring food waste in terms of mass but also in terms of environmental impacts and economic costs. They also show that meat and bread waste contribute the most to the environmental footprint of the supermarket. Last, Scholz et al. [
193] analyze food waste data of six Swedish supermarkets from 2010 to 2012 in terms of mass and carbon footprint. They calculate the wastage carbon footprint for fresh products such as meat, deli, cheese, dairy, and fruits and vegetables.
2.2.7 Web search queries
Web search queries data report the frequency of specific terms over time, entered into a web search engine from users to satisfy their information needs. Data are represented as time series of the frequency, and therefore we do not provide an example of search queries records in this paper.
Comparing to other data sources that require customized and often complicated collection strategies, search data can be collected for many domains simultaneously. They can also be easily analyzed across several countries or regions in real-time. Search data are often helpful in making forecasts. However, their utility for predicting real-world events is based on convenience, speed, and flexibility and has less to do with their superiority over other data sources. Goel et al. [
194] provide a useful survey in this area and describe some of the limitations of this data source. First, for different domains, the size of the relevant population varies considerably, along with difficulty in identifying relevant queries. Additionally, in specific domains, searching may be more closely tied to the measured outcomes than in others.
Web search queries data are used to measure health (G1), job opportunities (G2), socioeconomic development (G3), safety (G5), and politics (G6) dimensions of objective well-being.
Public health is a dimension of well-being that is explored through web search queries (G1). In order to improve early detection, researchers monitor health-seeking behavior in the form of web search queries, which are submitted by millions of users around the world every day. For example, Cooper et al. [
195] study Yahoo! search activity related to cancer in the USA. They find out that the Yahoo! search activity associated with cancer correlates with the estimated cancer incidence and estimated cancer mortality. Polgreen et al. [
196] show that search volume for handpicked influenza-related queries is correlated with the reported number of cases over the period from 2004 to 2008. Hulth et al. [
197] find similar results in a study of search queries submitted on a Swedish medical Web site. Yuan et al.[
198] monitor influenza epidemics in China with search queries from Baidu. Additionally, an automated procedure for identifying informative queries is described by Ginsberg et al. [
199]. Based on that, Google Flu Trends [
200] was introduced by Google in 2008 to provide real-time estimates of flu incidence for more than 25 countries and to help predict outbreaks of flu. Nsoesie et al. [
201] present a framework for near real-time forecast of influenza epidemics using web-based estimates of influenza activity from Google Flu Trends for 2004–2005, 2007–2008, and 2012–2013 flu seasons. Yang et al. [
202] use Google Flu Trends and historical data to infer the evolving epidemiological features of influenza and its impacts among the large population during 2003–2013, including the 2009 pandemic. Wilson et al. [
203] use data from Google Flu Trends to study the spread of the pandemic H1N1 influenza in New Zealand during 2009. Furthermore, Chan and Althouse [
204,
205] use Google queries to monitor Dengue epidemics, Dukic et al. [
206] to predict hospitalizations for methicillin-resistant Staphylococcus aureus infections and Ocampo et al. [
207] for malaria surveillance. Moreover, Yang et al. [
208] evaluate the association between suicide and Google searches trends for 37 suicide-related terms representing major known risks of suicide in Taipei City, Taiwan, from 2004 to 2009. Their results show that a set of suicide-related search terms, the trends of which either temporally coincided or preceded trends of suicide data, are associated with suicide death. Searches for “major depression” and “divorce”, for example, account for at most, 30.2% of the variance in suicide data. McCarthy [
209] uses annually-averaged Google search activity for “suicide” from the same period, from 2004 to 2009 to study suicide rate data in the United States. The study shows that searches for most medical, familial, and socioeconomic terms precede suicide deaths, and most searches for psychiatric-related terms coincide with suicide data. In a later work, Kristoufek et al. [
210], use Google data from 2004 to 2013 in combination with suicide occurrences data to estimate the number of suicide occurrences in England. Finally, Adler et al. [
211] combine official statistics on demographic information with data generated through search queries from Bing, between November 2016 and February 2017, to gain insight into suicide rates per state in India. In this way, their search data work as a proxy for unmeasured (hidden) factors corresponding to suicide rates.
Table 8
Example of the information provided by users of influenzanet
784590 | 35 | M | 2017-12-03 | 38.0\(^\circ \) | [cough, sore throat] |
275173 | 28 | F | 2018-01-05 | 36.6\(^\circ \) | [no symptoms] |
428415 | 64 | M | 2018-04-13 | 38.2\(^\circ \) | [tired, runny nose] |
The first to explore the job opportunities dimension (G2), are Ettredge et al. [
212] as they find that counts of the top 300 search terms during from 2001 to 2003 are correlated with US Bureau of Labor Statistics unemployment figures. Later on, Askitas et al. [
213], D’Amuri et al. [
214], Suhoy et al. [
215] confirm the value of search data in forecasting unemployment in the US, Germany, and Israel. Baker et al. [
216] use Google search data to examine how job search responds to extensions of unemployment payments. Finally, McLaren et al. [
217] summarise how online search data can be used for economic nowcasting by central banks. They show that the volume of online searches can be used as indicators of economic activity, more specifically for unemployment and housing markets in the United Kingdom.
Researchers use search queries to monitor socioeconomic development (G3) as well. Choi and Varian [
218,
219] consider Google Trends as a source of data on real-time economic activity, and they show that by using its query indices accurate predictions can, for example, be made for retail, automotive, etc., and could be helpful for short-term economic prediction or nowcasting. Koop and Onorante [
220] use Dynamic Model Selection (DMS) methods, which allow for model switching between time-varying parameter regression models. They extend the DMS methodology by allowing Google variables to determine the nowcasting model to be used at each point in time. Guzman [
221] examines Google data as a predictor of inflation. Additionally, Preis et al. [
222] provide evidence that search engine query data and US stock market fluctuations are correlated. In a later [
223] work, they analyze changes in Google query volumes for search terms related to finance, and they find patterns that may be interpreted as “early warning signs” of stock market moves. Furthermore, Curme et al. [
224] present a method that allows identifying topics for which levels of online interest change before large movements of the Standard & Poor’s 500 index (S&P 500). They find that search volumes from Google related to politics and business can be linked to subsequent stock market moves. This demonstration of a connection between stock market transaction volume and search volume is also replicated using Yahoo! data, where Bordino et al. [
225] show that query volumes precede in many cases peaks of trading by one day or more. Finally, Moat et al. [
226] show that data on views of Wikipedia pages can also be related to market movements, providing evidence that increases in the number of views of financially related pages on Wikipedia can be detected before stock market falls.
Search data are also used for the exploration of safety (G5). Qi et al. [
227] show that a simple low-level indicator of civil unrest can be obtained from online data at an aggregate level through Google Trends or similar tools. The study covers countries across Latin America from 2011 to 2014 in which diverse civil unrest events took place. In each case, they find that the combination of the volume and momentum of searches from Google Trends surrounding pairs of simple keywords, tailored for the specific cultural setting, provide useful indicators of periods of civil unrest. Qi et al. [
228] study online search activity from Google Trends surrounding the topics of social unrest over several countries in Latin America from 2011 to 2014. They find that the volume and momentum of searches surrounding mass protest language, can detect—and may even pre-empt—the macroscopic on-street activity. They also find that the most crucial search keywords differ subtlety from country to country, even though the language may be the same. They explain this by the fact that civil unrest is a time-varying coordinated interaction between individuals, groups, or populations within a given cultural and socioeconomic setting.
Finally, the politics dimension is explored with search data (G6). Chykina et al. [
229] study how Google Trends can be used to examine issue salience for hard-to-survey mass populations in the US, from 2010 to 2017. They apply this method to immigrant concerns over deportation. They show that anxieties over removal increase in response to (potential) policy changes, such as immigration policies that are considered in the wake of Donald Trump’s election. Reilly et al. [
230] use Google search activity for ballot measures’ names and topics in a state one week before the 2008 Presidential election, and they find that they correlate with actual participation on those ballot measures. Their result demonstrates that the more Internet searches there are for a ballot measure, the less likely voters are to roll-off (not answering the question) and establish the validity for this data for a critical topic in state politics research.
2.2.8 Crowdsourced data
Kleemann and Rieder [
231], in 2008, have defined crowdsourcing as the “the intentional mobilization for commercial exploitation of creative ideas and other forms of work performed by consumers”. In other words, crowdsourcing involves obtaining work, information, or opinions from a large group of people who submit their data via the Internet, smartphone apps, etc. Naturally, crowdsourcing brings several advantages. Crowdsourcing can provide researchers with a huge amount of data, which can be accessed quickly and at a relatively low cost. Besides, comparing to traditional research (such as studies using traditional surveys), the use of crowdsourcing can provide researchers with data from samples that are more diverse [
232]. However, crowdsourcing yields various challenges, as well. Firstly, crowdsourcing may bring relatively low-quality results, e.g., a participant of a crowdsourced study may intentionally give wrong answers. Secondly, mobile platforms pose new challenges for crowdsourced data management. Table
8 shows an example of crowdsourced data.
Crowdsourced data are used to capture all dimensions of objective well-being, i.e., health (H1), job opportunities (H2), socioeconomic development (H3), environment (H4), safety (H5) and politics (H6) dimensions of well-being.
To improve early detection, researchers started monitoring the health of individuals (H1) through crowdsourced self-reporting mobile apps, such as Influenzanet (Europe) [
233], Flutracking (Australia) [
234], and Flu Near You (United States) [
235]. Hashemian et al. [
236] introduce iEpi, an end-to-end system for epidemiologists and public health workers to collect, visualize, and analyze contextual microdata through smartphones. Additionally, Madan et al. [
237] use data from a smartphone application provided to university students to study their health state. Participants fill out self-report surveys related to their health habits, diet, exercise, weight changes, daily symptoms related to common colds, fever, influenza, and mental health. The researchers find that phone-based features can be used to predict changes in health, such as common colds, influenza, and stress. For longer-term health outcomes such as obesity, they find that weight changes of participants are correlated with exposure to peers who gain weight in the same period. Finally, Martinucci et al. [
238] study Gastroesophageal Reflux Disease (GERD) symptoms among Italian university students from a data set collected from a web-app. The app allows users a self–diagnosis for the gastrointestinal disturbances through a simple questionnaire and data about the students’ food consumption at the university canteen. They show that 792 students reported typical GERD symptoms to occur at least weekly. Among all users, females, smokers, and high in BMI students tend to show increased GERD values.
Researchers use crowdsourced data to explore the job opportunities dimension (H2) and the direct socioeconomic benefits associated with it. For example, Green et al. [
239] use the crowdsourced employer review website named Glassdoor, an online crowdsourced employer branding platform, to explore employees’ satisfaction and work–life balance. This exploration is preliminary for the direct economic benefit and most important finding of the study; companies experiencing improvements in employer ratings are significantly associated with future stock returns, comparing to companies with declines in employer rating. Similarly, Dabirian et al. [
240] analyze reviews of the highest and lowest-ranked employers on Glassdoor. Using IBM Watson to analyze the data, they show how employers could use crowdsourced employer branding intelligence to turn into a workplace that attracts highly qualified employees. Furthermore, Könsgen et al. [
241] analyze employee reviews data, listed on the German employee review site named Kununu.de, combined with
\(2\times 2\times 2\) between-subjects experimental design. Results show that such studies can complement the research on the online reputation by underlying the relevance of discrepant reviews for job candidates’ application intentions.
Crowdsourced data are also used to estimate the socioeconomic (H3) well-being. For example, Tingzon et al. [
242] show the feasibility to map poverty by combing crowdsourced geospatial information with nighttime lights, daytime satellite imagery, and human settlement data. In particular, they use the popular geospatial data crowd-sourcing platform named OpenStreetMap [
243] to map poverty in the Philippines. Similarly, Piaggesi et al. [
244] use OpenStreetMap [
243] crowdsourced data merged with official data at a city scale. They demonstrate the possibility of estimating the socioeconomic conditions of different neighborhoods of five different cities in North and South America. In order to increase the efficiency of direct money transfers to impoverished villages in Kenya and Uganda, Abelson et al. [
245] develop and deploy a crowdsourcing interface to obtain labeled satellite imagery training data. They train and deploy a predictive model for detecting impoverished villages. Their estimations are leveraged to build a fine-scale heat map of poverty that is used to recommend donations to the most impoverished villages.
Crowdsourcing is also used to capture the environmental dimension of well-being (H4). There are plenty of examples of crowdsourcing platforms for emergency management, such as Ushahidi [
246], where volunteers provide updated environmental information in the aftermath of mass emergencies. These platforms are shown to contribute significantly to organizing a prompt emergency response [
247]. Another category of crowdsourced platforms is the so-called citizens’ observatories [
248], a community-based network of environmental monitoring and information systems. On these platforms volunteers monitor and provide data about a plethora of environmental dimensions, such as comprising water availability and water quality, air pollution, land use, and flood risk management [
249]. As an example, Schneider et al. [
250] combine crowdsourced data from the EU-funded CITI-SENSE project, which measures the air-quality with data obtained from statistical or deterministic air quality models. Their goal is to present a novel data fusion-based technique for combining real-time crowdsourced observations with model output that maps the urban air quality in detail. This could help users find the least polluted routes or control their exposure to pollution while moving around the city. Besides, Meier et al. [
251] use crowdsourced atmospheric data from Netatmo weather stations in the city of Berlin, as well as available metadata to explore the urban atmosphere. Results show a distinctive urban heat island pattern in Berlin during the night and are also validated, confirming that crowdsourced atmospheric data can contribute to advancement in climate research. Similarly, Chapman et al. [
252] use Netatmo weather station crowdsourced data to quantify the urban heat island in the city of London over the summer of 2015. Their results are similar to previous studies with official data and are therefore validated.
Crowdsourced data are considered an important data source for studying safety (H5). Suzanne Goodney et al. [
253] map violence against women with the use of a crowdsourced app named as Safecity.in, which includes anonymous reporting of violence against women. The goal of the study is to highlight the importance of crowd mapping violence, as it can make women aware of potentially dangerous locales, encourage violence reporting, and provide advice on practical solutions for navigating street harassment and assault in public buses. Furthermore, Gosselt et al. [
254] use the Internet Movie Database (
www.imdb.com) to study the violent behavior and victimization of male and female film characters over time in the United States. In particular, using IMDb synopsis texts, they analyze reviewers’ movie descriptions. They demonstrate that both perpetrators and victims are mainly male, as well as that violence becomes less severe and more often non-deadly over the years. Researchers underline the future potentiality of using such data sources to explore matching results with actual crime figures. Additionally, Ozkan et al. [
255] use crowdsourced police-involved killings data from FatalEncounters.org, as well as media data, to control whether police killings is counted and reported correctly in the aforementioned unofficial data, as compared to official data in the city of Dallas. Results mostly show consistency between all data sources. In conjunction with social media and crowdsourcing data sources, as well as environmental and safety dimensions, Avvenuti et al. [
256] collect targeted and detailed information from people involved in natural disasters through crowdsourcing surveys via social media. These data are used to monitor unfolding disasters better and to monitor their consequences (i.e., damage caused)
Last, crowdsourced data are also used to study the politics dimension (H6) of objective well-being. For instance, crowdsourced data have been used within NGOs to set strategic priorities and involvement in the referendum activities based on participants’ responses to a survey [
257]. Yasseri and Bright [
258] use Wikipedia traffic data for electoral prediction. In particular, they get insights about changes in overall turnout at elections and changes in vote share for certain parties. Furthermore, Gellers [
259] explores whether crowdsourcing can overcome the democratic deficit in global environmental governance. He uses data from the United Nations MY World survey, a multi-year (2012–2015) global poll designed to identify post-2015 development priorities, as well as e-discussions data, organized by the UNDG and the thematic consultation on environmental sustainability ran from November 2012 to July 2013. Results suggest that although crowdsourcing may present an attractive technological approach to enhance participation in global governance, ultimately, the representativeness of this participation and the legitimacy of the policy results depend on the way the contributions are sought and filtered by international organizations.