1 Introduction
The 2030 Agenda for Sustainable Development [
1] reflects a unique commitment of the world’s countries to work towards a set of Sustainable Development Goals (SDGs). These 17 ambitious goals come with a set of indicators to serve as a kind of scorecard to measure progress against. Furthermore, to aid in outcome-oriented decision making to improve lives, the data on development progress should be up-to-date and disaggregated across various dimensions, including gender.
Unfortunately, especially for those countries in most need of development, high quality and up-to-date data on the SDGs is hard to come by. For example, for SDG #1 “No poverty”, of 7 South and 19 South-East Asian countries only 4 and 9 countries respectively have poverty data collected since 2015 [
2]. Furthermore, poverty data disaggregated by gender is even less available [
3].
To overcome challenges related to the timeliness of data, researchers have investigated the use of non-traditional data sources for the purpose of mapping poverty levels [
4]. Nightlights data from satellites have been used as a proxy of human well-being [
5] and for mapping poverty globally [
6,
7] and at sub-national levels [
8,
9] as night light, typically linked to electricity usage, correlates with economic activity [
10‐
12]. Other work has examined the use of daytime satellite imagery for poverty mapping [
13,
14], tracking human development indicators [
15] and for estimating household level poverty for rural locations based on land use information extracted from satellite images [
16]. Beyond satellite imagery, mobile phone Call Detail Record (CDR) data have been used in predictive models to map aggregate population level socioeconomic characteristics [
17,
18] and poverty levels in a variety of countries [
18‐
20] as well as at the individual level for mobile phone subscribers [
21]. Other research has combined satellite imagery with CDR data [
22,
23] and with crowd-sourced geographic information from OpenStreetMap (OSM) [
24].
In this work, we evaluate the potential value that publicly accessible, anonymous advertising data holds for the mapping of wealth and poverty. Concretely, we use data from Facebook’s Marketing API on how many Facebook users match certain criteria. These audience estimates, which are traditionally used for advertising campaign planning purposes, have shown promising results for tasks such as estimating stocks of migrants [
25,
26] and generating measures of digital gender inequalities [
27,
28].
We test this approach for creating small area estimates (SAE) across Philippines and India. As ground truth we use an asset-based measure of poverty, the Wealth Index (WI), derived from the Demographic and Health Surveys (DHS) for each country. According to PEW surveys, 58% and 24% of adults in Philippines and India respectively use Facebook [
29] which enables testing this approach in two countries with relatively high and low penetration of Facebook usage. We generate a dataset containing estimates of the proportion of Facebook users utilizing different internet connection types, mobile operating systems and device types.
We use these audience estimates to obtain insights into the spatial distribution of Facebook users, including information on (i) iOS vs. Android devices usage, or (ii) 2G vs. 4G connectivity. We demonstrate that these insights provide strong signals for the distributions of wealth and poverty.
Furthermore, these audience estimates can be disaggregated by gender, age or self-declared education level, creating opportunities for more disaggregated estimates of asset ownership and wealth. Focusing on the example of gender, we show how in countries with gender equal Facebook usage, such as the Philippines, it seems feasible to derive gender disaggregated models for poverty. However, in India, where the gender selection bias is too strong, our approach fails to provide plausible gender disaggregated poverty estimates.
2 Materials and methods
2.1 The Demographic and Health Survey (DHS)
The Demographic and Health Survey (DHS) collects survey data in many countries around the globe with the aim of providing nationally representative data on health and population. The survey consists of several types of questionnaires including a household questionnaire that collects data for the household unit in addition to individual questionnaires which collect data on eligible women and men from the surveyed households. In addition to health related information, the household survey also collects data on household ownership of various assets such as televisions and bicycles, housing materials as well as access to water and sanitation facilities. The data on asset ownership is used to compute the Wealth Index for each surveyed household through a Principal Component Analysis (PCA) [
30]. The Wealth Index is a real-valued score that takes both negative and positive values with higher values indicating higher wealth. The Wealth Index is the ground truth measure of poverty we use in this study. The data used here are from the 2017 DHS survey for Philippines [
31] and the 2015-16 DHS survey for India [
32].
In the reported DHS data, households are grouped into units called clusters with geographic location reported for these clusters in the form of the latitude and longitude coordinates of its center. In order to preserve respondent confidentiality, the actual coordinates undergo a spatial perturbation process before being reported; location coordinates are perturbed up to 2 km for urban clusters and up to 5 km for rural clusters with a further 1% of rural clusters displaced up to 10 km.
As the analysis here is done at the cluster level, the Wealth Index values reported for surveyed households were averaged across all households in a cluster to get an aggregated mean Wealth Index value for the cluster. Table
1 provides a summary breakdown of the survey cluster locations from the Demographic and Health Survey (DHS) for each country. Geographic coordinates were not reported for some clusters (36 in the Philippines and 131 in India). These clusters with missing coordinates could not be used in the analysis as Facebook data could not be collected for them. Some clusters had to be excluded due to sparsity of the Facebook data (8 in the Philippines and 350 in India). The row indicated in bold face in Table
1 shows the subset of clusters that were used in the analysis. Data from 1205 survey clusters in the Philippines and 28,043 in India were used in the analysis.
Table 1
Breakdown of the data for each country for clusters with at least one surveyed household
Number of DHS clusters | 1249 | 28,524 |
Clusters missing geo-location | 36 | 131 |
Geo-located DHS clusters | 1213 | 28,393 |
Clusters with <100 FB users 18+ | 8 | 350 |
Clusters with≥100 FB users 18+ | 1205 | 28,043 |
Clusters with >1000 FB users 18+ | 1043 | 25,316 |
Median number of households surveyed (DHS) | 23 | 21 |
Tables S1 and S2 in the Additional file
1 report the summary statistics of the DHS Wealth Index distribution for different subsets of clusters in both countries. The clusters used in the analysis had on average slightly higher Wealth Index (Philippines: mean = 5599; India: mean = 1346) than among all the clusters (Philippines: mean = 4130; India: mean = 783) but roughly similar spread of the distribution (Philippines: standard deviation for all clusters = 71,532, for clusters used in the analysis = 70,626; India: standard deviation for all clusters = 79,299, for clusters in the analysis = 79,390). The excluded clusters had lower Wealth Index scores on average (Philippines: mean
\(=-36{,}105\); India: mean
\(=-32{,}035\)) than the overall group of clusters.
DHS survey datasets can be accessed for research purposes from the DHS website
1 after creating an account and requesting access for the desired surveys.
2.2 Facebook’s marketing platform
Facebook’s marketing platform makes a rich array of targeting options available to advertisers. Using this platform, advertisements can be targeted based on various user characteristics including geographic location, demographics such as age and gender as well as the type of devices and networks that are used to access the social media platform. To enable advertisers with budgeting their ads, the platform provides an estimate of aggregate number of users (called the Monthly Active Users (MAU)) matching a given targeting criteria. For example, in the Philippines there are an estimated 63 million Monthly Active Users on Facebook who are aged 18+.
2
In this study we investigate how data collected from this platform on the types of networks/devices used by the Facebook users in a given location can be used to predict the socioeconomic situation in that location. For each of the geo-located DHS clusters, we collected data on estimates of Monthly Active Users using a variety of network and device types for the 18+ Facebook user population. Since DHS cluster locations are reported as spatially perturbed latitude and longitude coordinates, we collected data for a given radius around the reported coordinates so that the original location is included in the area for which data is collected. In the Philippines we collected data for a 2 km radius around urban clusters and a 5 km radius around rural clusters. In India we used a radius of 5 km and 10 km for urban and rural clusters respectively; this was done to alleviate data sparsity issues due to the lower Facebook penetration in India. The Additional file
1, Sect. 1.2 provides more details on the choice of the radius of data collection.
Table
2 provides a list of network and device types for which data were collected. These include various Network types, mobile operating systems, high-end Apple and Samsung devices plus a variety of other device types. For the high-end devices, the Apple and Samsung devices released in the last two years prior to the data collection were targeted.
3 For the list of network/device types, features were generated by computing the fraction of Facebook users who used that network/device type to access Facebook. These are the features used in the predictive models to predict the Wealth Index. In addition to the above-mentioned features, we also include the Facebook penetration as a feature in the model. This variable is the number of Monthly Active Facebook users aged 18+ as a fraction of the total population in a given cluster location where the cluster population was computed using high-resolution population estimates from WorldPop [
33].
Table 2
List of features derived from the Facebook advertising audience estimate data. All features, with the exception of Facebook penetration, are the fraction of Facebook users in the targeted location who use a given network/device type to access Facebook. All data are for users aged 18+. The Facebook penetration is the number of users divided by the total population of the location; where there were more estimated users than the estimated population the value was capped at 1. Note that according to the Facebook audience estimates, of all users who use a smartphone, the percentage who do not use either of the three specified Mobile OS types (Android, iOS, Windows) are 61% (India) and 51% (Philippines); of all users, the percentage who do not use either of the four specified network types (2G, 3G, 4G, WiFi) to access Facebook are 25% (India) and 37% (Philippines)
| Facebook penetration | 0.664 | 0.555 |
Network access | 2G Network | 0.115 | 0.346 |
3G Network | −0.378 | 0.296 |
4G Network | 0.693 | 0.003 |
WiFi | 0.740 | 0.524 |
Mobile OS | Android | 0.449 | 0.510 |
iOS | 0.663 | 0.567 |
Windows phones | 0.387 | 0.357 |
High-end phones | Apple iPhone X | 0.573 | 0.435 |
Apple iPhone X/8/8 Plus | 0.628 | 0.454 |
Samsung Galaxy phone S9+ | 0.540 | 0.391 |
Samsung Galaxy phone S8/S8+/S9/S9+ | 0.643 | 0.499 |
Samsung Galaxy phone S8/S8+/S9/S9+ or Apple iPhone X/8/8 Plus | 0.669 | 0.524 |
Other device types | All mobile devices | 0.264 | −0.061 |
Feature phones | 0.096 | 0.163 |
Smartphone and tablets | 0.217 | −0.072 |
Tablet | 0.492 | 0.423 |
Cherry mobile | −0.275 | – |
VIVO mobile devices | 0.539 | 0.024 |
Huawei mobile devices | 0.534 | 0.292 |
Oppo mobile devices | 0.499 | 0.129 |
Oppo/VIVO/Cherry devices | 0.184 | 0.013 |
Samsung Android devices | 0.123 | 0.087 |
For clusters where the number of estimated Monthly Active Facebook users exceeded the estimated offline population, the Facebook penetration values were set to 1. There are two possible reasons why the Facebook user population may exceed the offline population. First, the offline population of a cluster may be under-counted as we used high-resolution gridded population estimates to calculate the cluster population. In a study evaluating the methodology that was used to generate these population estimates [
34], relative Root Mean Squared Error (as a percentage of the mean population size of the respective census units) ranging from 39% in Cambodia to 91% in Kenya were reported when comparing the high-resolution population estimates aggregated to the level of census units to census populations. Second, the Facebook user population may be over-counted as about 10% of Facebook accounts are estimated to be duplicate accounts (such as pet accounts, duplicate for-my-family vs. for-my-private friends accounts) and some fraction of fake accounts [
35].
For locations and targeting criteria with low number of users, the marketing platform does not return estimates of monthly active users below 1000. For such instances, to alleviate data sparsity, we attempted to estimate the number of users following the approach in [
36] which gives an estimate in the hundreds
\((0, 100, 200, \ldots , 900)\) for such locations. Using this data augmentation approach resulted in a small improvement in modeling performance. Details of this data augmentation approach as well as its effect on model performance are explained in the Additional file
1, Sect. 1.6.
The data used in the main analysis is for the age 18+ user demographic on Facebook. Data were also collected for different age brackets, by gender and by self-declared education status to test the potential to produce demographically disaggregated estimates. With the exception of the age-disaggregated data collections, all other data collections (disaggregated by gender/education) were for the 18+ age group. Data for the Philippines were collected over the period March-April 2019 and data for India were collected over the period June-September 2019. Data collection was done using ‘pySocialWatcher’,
4 a Python based wrapper library that automates the data collection process by using Facebook’s Marketing Application Programming Interface (API) [
37].
2.3 Population data
Population data were acquired for the DHS cluster locations using population estimates released by Worldpop [
33,
38]. Worldpop provides high-resolution population estimates for countries around the world. The population data are provided for an approximately 100 m resolution grid of the entire country for the year 2015. For each cluster, the estimated population living in that cluster was computed by adding together the population counts for all grid cells that fell within a given radius of the cluster coordinates, matching the radius for which the Facebook data were collected. The population data were used to compute (i) the Facebook penetration and (ii) the log of population density for each cluster. These variables were used as predictive features in the models predicting the Wealth Index.
2.4 Regional indicators
In addition to the Facebook features and population density, regional indicator variables were used as additional features in the models. These are binary variables that indicate whether a given DHS cluster falls within a given administrative region in the country. We used the level 1 administrative division that were reported in the DHS data. Including these features allows a model to account for regional level variations. There were a total of 17 administrative regions in the Philippines and 36 in India. As both India and the Philippines are large countries, different regions may exhibit different dynamics of poverty. The addition of regional indicator variables can enable models to account for possible region specific trends in the data. Generally, the inclusion of the regional indicator variables resulted in improved model performance.
2.5 Models for predicting the Wealth Index
We evaluated the performance of (i) linear regression models selected using LASSO and (ii) tree based regression models to predict the Wealth Index using data from the available set of covariates. The distribution of Wealth Index for the clusters used in the analysis is reported in Tables S1 and S2 in the Additional file
1. The Wealth Index is a real-valued score ranging from negative to positive values with higher values being better. The linear LASSO models were fitted using ‘glmnet’
5 and the tree models were fitted using ‘gbm’
6 package in the R programming language; the ‘gbm’ package fits regression trees using gradient boosting. Models were fitted and evaluated separately for each country using data from that country.
Model parameters were tuned using cross validation. For the tree models, the optimal number of trees was chosen through cross validation for up to a maximum of 5000 trees. Each model was fit and evaluated using 10-fold cross validation. The predictions over the cross validation folds were then used to evaluate the cross-validated \(R^{2}\) which captures the proportion of the variation in the Wealth Index that is explained by the model predictions. In addition to \(R^{2}\) values, we also compute and report the Root Mean Squared Error (RMSE) metric for all models using the cross-validated predictions.
4 Discussion
Our results demonstrate the potential of social media advertising data from Facebook’s marketing platform to capture geographic variations in wealth and poverty levels. The analysis indicates that the types of devices and network connections accessed by the Facebook user population act as proxies for socioeconomic status of a given location. Such an approach can be used to estimate the levels of socioeconomic well-being at high spatial resolutions. The results from India where just about a quarter of the population use Facebook suggest that this approach could be useful even in countries with low penetration of Facebook users.
The analysis here looked at data from a single snapshot. Furthermore, the DHS ground truth data was not aligned in terms of collection period with the Facebook data. For the purpose of long term monitoring of poverty for the Sustainable Development Goals, it is important to understand the temporal stability of the models as well as whether and how changes in the device types accessed by Facebook users reflects changes in the socioeconomic situation of a particular location. This would be a potential area for future exploration as more data, both in terms of ground truth and in terms of social media, becomes available.
Beyond aggregate estimates of the geographic variation in socioeconomic well-being, the potential to use demographically disaggregated social media data to create disaggregated estimates such as by gender, age and education was explored as well. While it was not possible to directly validate these estimates due to lack of ground truth, as shown by the results for Philippines and India, one must take into account potential selection biases for different demographic groups when interpreting such predictions.
Selection bias also affected a small number of DHS clusters that were dropped from the analysis due to data sparsity (see Sect.
2.1 and Tables S1 and S2 in the Additional file
1). These clusters had lower than average Wealth Index.
Especially for sparsely populated areas, social media data could be further combined with data from other sources, in particular satellite data, for the purpose of monitoring socioeconomic well-being. Such an approach can combine the strengths of different data sources to boost predictive accuracy. In particular, it combines satellite data’s spatial resolution and truly global coverage with Facebook’s data’s demographic disaggregation capabilities and the direct links to a particular type of asset ownership—a mobile phone. Such a combination provides an interesting avenue for exploration in future work.