1 Introduction
UserID | Event Time | Cell Tower | Caller | Callee | In/Out | Duration (s) |
---|---|---|---|---|---|---|
38DA6 | 2015-05-01 18:26:50 | 1921 | 38DA6 | 163B7 | Out | 52 |
78EC3 | 2015-05-01 14:16:09 | 2189 | 53808 | 78EC3 | In | 600 |
9FAFE | 2015-05-01 23:20:09 | 2189 | 9FAFE | 7BBF1 | Out | 41 |
708A2 | 2015-05-01 08:21:10 | 1988 | 96EC4 | 708A2 | In | 37 |
A27AD | 2015-05-01 21:51:09 | 2189 | EA33F | A27AD | In | 108 |
9F3C7 | 2015-05-01 13:21:25 | 20102 | C5691 | 9F3C7 | In | 134 |
D4578 | 2015-05-01 17:03:46 | 20103 | D4578 | 5B9A3 | Out | 30 |
F904A | 2015-05-01 23:24:03 | 1998 | F904A | F3C88 | Out | 10 |
4CCEA | 2015-05-01 20:11:38 | 21104 | 4CCEA | 5EF18 | Out | 438 |
A77B8 | 2015-05-01 09:40:26 | 21104 | A77B8 | BD3E5 | Out | 33 |
D5761 | 2015-05-01 20:34:40 | 1999 | DAA24 | D5761 | In | 600 |
CTR
), which allows completing all missing trajectory data with reasonable accuracy. CTR
enables human mobility analyses over much larger populations, and without biases due to incomplete information. We apply CTR
to the problem of reconstructing trajectories from CDR, since they represent the primary type of movement data employed for large-scale human mobility analysis [1, 2]. However, our method is general and can be applied to other classes of mobile phone data; specifically, by running CTR
on data originally collected at a frequency of a few hundreds samples per day, one could aim at completing individual trajectories at high temporal resolutions, e.g., of minutes. The design, evaluation, and application of our solution yield the following contributions. -
First, we provide evidence of the severe sparsity that affects mobile phone data, by analyzing the CDR of 1.8 million users collected during three consecutive months. We quantify the phenomenon by means of relevant metrics, including the duration, sampling frequency, and completeness of individual trajectories in the data. In particular, our results show that legacy preprocessing techniques that discard users based on arbitrary completeness thresholds, in fact, ignore a vast user population with substantial and potentially serviceable mobility information. Details are in Sect. 3.
-
Second, we introduce our novel approach,
CTR
, which leverages well-known features of human mobility to customize tensor factorization in a way that befits our problem. This original methodology letsCTR
transform sparse mobile phone data into seamless individual trajectories that span the full dataset duration, for all users. Details are in Sects. 4 and 5. -
Third, we validate the proposed strategy with ground truth data. Comprehensive performance evaluation shows that
CTR
achieves full reconstruction of individual trajectories on an hourly basis, with a median displacement between 1 and 2 network cells that depends on the sparsity of the original CDR data. This effectively means that, in the reconstructed data, a user is typically placed in the correct cell or in one that is very close to it. Such a level of accuracy is acceptable for metropolitan-scale analyses (where the urban surface is typically covered by hundreds of cells) and is excellent for national-scale studies (as inter-city mobility is perfectly captured). Details are in Sect. 6. -
Fourth, we demonstrate the importance of trajectory reconstruction in CDR-driven human mobility analysis. Specifically, we revisit three seminal studies [3‐5] by using the complete mobility of 1.7 million users, instead of the incomplete trajectories of a small fraction of especially active users as in the original works. Our results show that key results in these studies may change, even quite dramatically, in presence of complete mobility, proving that trajectory reconstruction is an indispensable first step for analyses of human mobility that rely on mobile phone data. Details are in Sect. 7.
2 Related work
2.1 Time discretization
2.2 User filtering
2.3 Trajectory reconstruction
3 Sparsity in CDR-based trajectory data
3.1 Trajectory completeness
-
Seamless trajectories with completeness equal to 1 are extremely rare in mobile phone data, under any combination of \(\mathcal{T}\) and τ. Even with short observation periods (\(\mathcal{T}=7\) days) and low resolution (\(\tau =2\) hours), not a single complete trajectory is extracted from our reference dataset, despite the large user population. In fact, less than 1% of users would satisfy a minimum completeness requirement of 0.5, which allows half of the locations to be unknown; and, more than 20% of trajectories do not even meet the degree of completeness 0.1, i.e., miss 9 locations out of 10. The fact that such poor figures refer to the best case, i.e., the less demanding combination of \(\mathcal{T}\) and τ, illustrates well the severity of the sparsity problem in CDR datasets.
-
The duration of the observation period has a marginal impact on trajectory completeness. More precisely, and as one would expect, slightly higher completeness is recorded for trajectories that span a shorter observation period when \(\mathcal{T} \in [30,90]\) days; however, and quite interestingly, completeness hardly varies when \(\mathcal{T} \leq 30\) days. We argue that the result can be linked to the weekly periodicity of human activities (which entail a reduced difference for \(\mathcal{T} \in [7,30]\) days), plus seasonal effects such as summer holidays that only intervene over long timescales (\(\mathcal{T} \geq 30\) days in our dataset).
-
The temporal resolution has a stronger effect on completeness. For instance, when \(\mathcal{T}=30\) days, only 8% of the trajectories have completeness above 0.1 when \(\tau =15\) minutes, whereas the same percentage grows to 26%, 53% and 75% when \(\tau =30\), 60 and 120 minutes, respectively. To better understand the scaling of completeness with the temporal resolution, we show in Fig. 2 the mean and median values of completeness, computed across all users, versus τ. We conclude that completeness grows almost linearly (Pearson correlation coefficients are at 0.981 and 0.983, for the mean and median, respectively) with τ in the range 15 to 120 minutes, under all observation periods \(\mathcal{T}\). The takeaway message is that one cannot easily escape the tradeoff that higher temporal resolution incurs into curbed completeness in CDR data.
3.2 A statistical model of completeness
Duration \(\mathcal{T}\) | Resolution τ |
Weibull
| Lognormal | Gamma | Pareto | Levy | Power law | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\(D_{\mathrm{KS}}\)
|
\(R^{2}\)
|
\(D_{\mathrm{KS}}\)
|
\(R^{2}\)
|
\(D_{\mathrm{KS}}\)
|
\(R^{2}\)
|
\(D_{\mathrm{KS}}\)
|
\(R^{2}\)
|
\(D_{\mathrm{KS}}\)
|
\(R^{2}\)
|
\(D_{\mathrm{KS}}\)
|
\(R^{2}\)
| ||
7 d | 15 min | 0.0334 |
0.9990
|
0.0318
| 0.9973 | 0.3710 | 0.4679 | 0.0495 | 0.9969 | 0.2509 | 0.8046 | 0.3538 | 0.5031 |
30 min |
0.0302
|
0.9993
| 0.0345 | 0.9967 | 0.0548 | 0.9962 | 0.1716 | 0.8973 | 0.2649 | 0.7848 | 0.3587 | 0.4894 | |
60 min |
0.0278
|
0.9990
| 0.0372 | 0.9958 | 0.0475 | 0.9977 | 0.1198 | 0.9516 | 0.2888 | 0.7506 | 0.4162 | 0.2041 | |
120 min |
0.0375
|
0.9972
| 0.0443 | 0.9938 | 0.0629 | 0.9853 | 0.1043 | 0.9768 | 0.3238 | 0.6977 | 0.2456 | 0.7450 | |
15 d | 15 min | 0.0264 |
0.9986
|
0.0233
| 0.9984 | 0.3594 | 0.5007 | 0.0627 | 0.9893 | 0.2620 | 0.7853 | 0.3949 | 0.4120 |
30 min |
0.0208
|
0.9995
| 0.0264 | 0.9980 | 0.0271 | 0.9991 | 0.0724 | 0.9851 | 0.2765 | 0.7647 | 0.3576 | 0.4975 | |
60 min |
0.0188
|
0.9997
| 0.0279 | 0.9974 | 0.0257 | 0.9987 | 0.0935 | 0.9719 | 0.2997 | 0.7315 | 0.3176 | 0.6076 | |
120 min |
0.0254
|
0.9987
| 0.0332 | 0.9961 | 0.0913 | 0.9572 | 0.1880 | 0.8780 | 0.3349 | 0.6792 | 0.2721 | 0.7116 | |
30 d | 15 min | 0.0239 |
0.9985
|
0.0207
|
0.9985
| 0.3514 | 0.5261 | 0.0700 | 0.9835 | 0.2619 | 0.7872 | 0.3829 | 0.4365 |
30 min | 0.0205 |
0.9992
| 0.0216 | 0.9983 |
0.0203
| 0.9991 | 0.1528 | 0.8976 | 0.2763 | 0.7661 | 0.3912 | 0.4149 | |
60 min |
0.0149
|
0.9996
| 0.0263 | 0.9975 | 0.0289 | 0.9982 | 0.0895 | 0.9702 | 0.3003 | 0.7346 | 0.3131 | 0.6212 | |
120 min |
0.0239
|
0.9984
| 0.0315 | 0.9962 | 0.0995 | 0.9479 | 0.1069 | 0.9538 | 0.3337 | 0.6893 | 0.3154 | 0.6176 | |
60 d | 15 min | 0.0266 |
0.9980
|
0.0239
| 0.9977 | 0.1990 | 0.8284 | 0.0458 | 0.9934 | 0.2527 | 0.8076 | 0.3850 | 0.4156 |
30 min |
0.0233
|
0.9985
| 0.0234 | 0.9976 | 0.0286 | 0.9974 | 0.1944 | 0.8669 | 0.2649 | 0.7903 | 0.3897 | 0.4212 | |
60 min |
0.0207
|
0.9985
| 0.0264 | 0.9970 | 0.0310 | 0.9980 | 0.0772 | 0.9769 | 0.2883 | 0.7622 | 0.3451 | 0.5635 | |
120 min |
0.0245
|
0.9975
| 0.0316 | 0.9958 | 0.0754 | 0.9750 | 0.0946 | 0.9617 | 0.3209 | 0.7196 | 0.3491 | 0.5070 | |
90 d | 15 min |
0.0298
|
0.9954
| 0.0329 |
0.9954
| 0.3619 | 0.5248 | 0.0322 | 0.9954 | 0.2371 | 0.8390 | 0.3866 | 0.4528 |
30 min | 0.0336 |
0.9958
|
0.0335
| 0.9950 | 0.0583 | 0.9864 | 0.0398 | 0.9942 | 0.2487 | 0.8264 | 0.3819 | 0.4774 | |
60 min |
0.0305
|
0.9960
| 0.0355 | 0.9944 | 0.0454 | 0.9913 | 0.0611 | 0.9843 | 0.2705 | 0.8020 | 0.3231 | 0.6123 | |
120 min |
0.0369
|
0.9940
| 0.0379 | 0.9932 | 0.0486 | 0.9927 | 0.0760 | 0.9743 | 0.2998 | 0.7687 | 0.3359 | 0.5885 |
4 Missing location inference
4.1 Trajectory reconstruction problem
4.2 Design rationale
-
Prevalence of static phases. People tend to spend a substantial amount of their time at a few fixed locations (e.g., workplace, school, shopping mall, sports center), where they linger for long continuous periods in the order of hours. Transitions among these locations are instead fairly rapid: in fact, people typically perceive movement phases between points of actual interest as a waste of time and strive to reduce them to a minimum. This results in a pattern of long static phases with fast movements in between [18]. The prevalence of static behaviors allows adopting temporal resolutions that are granular enough to capture the important stay points of each individual, yet are still tractable for reconstruction. For instance, in the example of Fig. 4, the user tends to stay for long times at the same location during working hours, and considering \(\tau =1\) hour as done in the plot does not lose substantial positioning information during such hours.×
-
Overnight invariance. A user is typically at the same location, i.e., at home, during nighttime, as also demonstrated by the fact that even sparse CDR data can be effectively employed to identify dwelling units [14, 23]. The observation that most nighttime locations match is important for trajectory reconstruction since mobile phone events tend to be especially scarce overnight, yet knowledge gaps can be filled with a limited amount of observed data during night hours. As an illustrative example, the user in Fig. 4 is always found at the same place early in the morning, late in the evening, and once overnight: it is apparent that such a location can be sensibly extended to all night hours, which are otherwise very poorly sampled.
-
Regularity of movement. Human mobility is strongly regular, from multiple perspectives. People frequently return to a same, limited set of locations [3, 24], which results in repetitive sequences of visits to a few places [7, 24], and infrequent trips to other locations in a (typically large) geographical region [24]. Regularity also occurs in time, as the locations above are visited in highly periodic patterns [4]. When considered jointly, these results have critical implications for trajectory reconstruction: they suggest that sparse location data may still be highly informative of the mobility of one individual if such data are sufficiently distributed in time to capture diverse moments of the day and week. This is typically the case for mobile phone data, which feature a combination of temporal irregularity of the sampling and long observation periods \(\mathcal{T}\). As an example, in the case of Fig. 4, the fact that the user generates CDR events at different moments of the morning, working hours, and evening allows depicting a fairly complete picture of her mobility pattern during a typical day, by observing mobile communication samples for a sufficiently long amount of time (15 days in the considered sample).
5 Context-enhanced trajectory reconstruction
CTR
) is a novel trajectory reconstruction approach that solves the problem in (1), by receiving the observed trajectory \(L_{\varOmega }\) as the input, and generating an estimated complete trajectory \(\hat{L}_{\mathcal{T}}\). To this end, CTR
builds upon the observations in Sect. 4.2, via the three steps detailed next. The notation employed throughout the discussion is summarized in the Abbreviations section.5.1 Nighttime trajectory reconstruction
CTR
aims at reconstructing missing portions of a trajectory during the night hours. The rationale is that, as seen in Sect. 4.2, these are easily reconstructed from the extractable knowledge about users’ home locations, which is also straightforward to derive from mobile phone data. Specifically, we assume that nighttime spans between 10 PM and 9 AM of the subsequent day: this period is adapted to the typical schedule in the Latin America country where our data is collected, but are easily adjusted to any other settings with statistical data on the local population habits. For each user in the dataset, we proceed as follows.5.2 Seamless trajectory reconstruction with tensor factorization
CTR
aims at reconstructing the trajectory during the remaining time intervals in \(\varOmega ^{C}\), by tailoring state-of-the-art tensor factorization techniques to our problem. Tensor factorization exploits redundancy to recover missing data; in our context, redundancy is created by the regularity of human movement, which creates repeated patterns of visited locations over the many days and weeks covered by a CDR dataset, as discussed in Sect. 4.2.-
First, the trajectory is split into multiple one-day sub-trajectories. Each one-day sub-trajectory is then converted into a one-dimensional vector: for instance, the sub-trajectory of the jth day of the ith week is denoted by \(\mathbf{x}_{i,j}\) and satisfies , where \(\mathbf{l}_{{k}}^{(x)}\) and \(\mathbf{l}_{{k}}^{(y)}\) are the two coordinates that identify the location at time step k, for \(\mathbf{l}_{{k}}=(\mathbf{l}_{{k}} ^{(x)}, \mathbf{l}_{{k}}^{(y)})\).
-
Second, we enter all of the one-day sub-trajectories of a user trajectory into the location tensor by organizing them into a matrix of one-day sub-trajectories for all \(N_{w}\) weeks in the observation period \(\mathcal{T}\), i.e., \(\mathcal{X}= [\mathbf{x}_{i,j}]_{N_{w} \times N_{d}} = \{\mathcal{X}_{i,j,k}\}_{N_{w} \times N_{d} \times 2N _{\tau }}\).
5.3 Homogeneous quantization of locations
CTR
aims at associating each location estimated as per Sect. 5.2 to the position of a real-world cellular tower.6 Validation
CTR
solution, we employ a set of ground-truth trajectories with completeness equal to 1, presented in Sect. 6.1. We generate incomplete trajectories from the ground-truth data, following the same sampling process observed in our reference CDR dataset, as described in Sect. 6.2. Then, complete trajectories are reconstructed with CTR
from the downsampled ones, and compared against the ground-truth, in Sect. 6.3: the level of agreement lets us comment on the quality of the reconstruction.6.1 Fine-grained ground-truth trajectories
6.2 Incomplete trajectories
CTR
trajectory reconstructions in a wide diversity of settings.6.3 Quality of CTR
trajectory reconstruction
CTR
on the incomplete CDR-like data and update the incomplete CDR-like trajectories as \(\hat{L}_{ \mathcal{T}}=\{ \mathbf{l}_{i}\mid i \in \varOmega \} \cup \{ \hat{\mathbf{l}}_{i} \mid i \in \varOmega ^{C}\}\). We then assess the quality of the reconstructed trajectories \(\hat{L}_{\mathcal{T}}\) against the complete ground-truth known locations \(L_{ \mathcal{T}}{}\).CTR
is shown against the completeness of the input CDR-like data. The candlesticks report the mean and median cell displacement, as well as the 10th, 25th, 75th, and 90th percentiles. The horizontal red line highlights a cell displacement of 1: below this value, the estimation error is lower than spatial precision of the original mobile phone data, i.e., the geographical coverage radius of the cell tower the user is presently associated to. Therefore, the error of the completion process is smaller than that inherent to the original data.
CTR
with added information to fill gaps in the data, leading to an improved performance where the typical estimation error is pushed closer to the correct cell. Interestingly, the gain of accuracy is higher at low completeness levels, and a trajectory with just 5% completeness already reduces the median cell displacement to less than 1.5. Some variability is observed around the median, which is however natural, given the heterogeneity that characterizes the mobility patterns and communication activities of different users.CTR
is acceptable for metropolitan-scale analyses: in urban areas, network cells usually span a few hundreds of square meters and cover, e.g., (portions of) individual neighborhoods, hence the displacement in Fig. 8 would still allow locating users fairly precisely and investigating mobility flows at an inter-neighborhood level. The reconstructed trajectory precision is instead excellent for regional- or national-scale studies, since cell displacements of 1 or 2—over large surfaces covered by tens of thousands of cell towers—allow to capture human mobility at, e.g., inter-city level, perfectly.7 Revisiting key results in the literature
CTR
on the reference dataset presented in Sect. 3 and reconstruct all trajectories which have completeness higher than 1% in the original data. Selecting such a lower threshold on completeness bounds the typical cell displacement to a small value according to the validation in Sect. 6, hence ensure high confidence in the quality of the estimated locations; moreover, a 1% completeness threshold allows retaining 95% of the user population. Overall, we reconstruct the complete trajectories of around 1.7 million individuals in a large geographic region, with a temporal resolution of \(\tau =1\) hour during three months.CTR
, and discuss the eventual differences that we observe in the results. The three analyses are separately presented next.7.1 Laws of individual mobility
-
As a preliminary result in their analysis, Gonzalez et al. find that the travel distances and radiuses of gyration aggregated over the whole user population follow truncated power laws \((x+x_{0})^{- \beta }e^{-x/k}\). We confirm that this is also the case under complete trajectory data, as shown in the top plots of Fig. 9. The exponent values are also consistent, as Gonzalez et al. find \(\beta _{\Delta _{r}}=1.77\) and \(\beta _{r_{g}} = 1.65\), whereas \(\beta _{\Delta _{r}}=1.78\) and \(\beta _{r_{g}} = 1.68\) from our complete trajectory data. However, we remark a sensible difference in the cutoff values. Travel distances can be sensibly higher in complete data; also, \(k_{r_{g}}\) is at 400 km in [3], which is not far from the 340 km from our original (i.e., incomplete) CDR-based trajectories but is reduced in the complete data, where the cutoff occurs instead at 175 km. We ascribe the difference to the fact that completion often leads to add missing locations that are far from the linear interpolation between CDR positions, such as those generated by infrequent but recurrent long-haul trips. Our conclusion is that, CDR data sparsity risks to both underestimate long trips, and overestimate the region within which the Lévy flight behavior of human mobility occurs.×
-
Concerning individual movements, we can reproduce the truncated power-law behavior found by Gonzalez et al. with both our sparse CDR-based trajectories and the complete ones reconstructed via
CTR
. This is illustrated in the bottom plots of Fig. 9. In the original work, the authors find an exponent \(\alpha \approx 1.2\), and we estimate the α parameter at 1.8 and 1.4 for original and complete trajectories. These exponents are qualitatively consistent since they all are in the \((1,3)\) range that characterizes Lévy walks [30]. -
We ascribe the quantitative differences with respect to [3] to the inherent specificity of each dataset. Although we also use locations from voice calls as in [3], the data refer to very different cultural and economic settings (i.e., a European country and a South American one) and geographical span (our country covers a territory that is four times wider than the largest country in Europe).
7.2 Uniqueness of individual trajectories
7.3 Next-location predictability
original
in Fig. 11. The mean predictability is at 81%, which is high yet quite far from the 93% found by Song et al.: As users are selected in the exact same way in the two cases, we ascribe the difference to the diverse mobility habits of the user populations in the two datasets, which are collected in different countries, at different scales, and during different time periods.
CTR
. The curve labeled as complete
in Fig. 11 is considerably shifted to the right, with a much-reduced variance around the peak, now at 94%. Our hypothesis for this outcome is that the filtering introduced in [4] dramatically reduces the set of users, favoring individuals who are very active from a mobile communication viewpoint. It has been repeatedly demonstrated that interactions with the cellular network are more frequent for users with higher mobility [31‐33]. As a result, the sparse nature of CDR data lets Song et al. introduce an unwanted bias in their study, which is ultimately focused on very mobile individuals whose displacements are more difficult to anticipate. Instead, considering a much larger population lets us account for the vast majority of fairly static users, and reveals that people’s movements are on average even easier to predict than estimated in [4].complete/filtered
is very consistent with that derived with the sparse CDR-based trajectories of those subscribers, i.e., the original
curve. The result supports our intuition that it is the population sampling and not the reconstruction process that determines the striking difference in the figure.8 Conclusions
CTR
, to reconstruct a seamless trajectory from sparse CDR data, and validated our methodology using ground-truth movement patterns. We have also demonstrated the importance of trajectory reconstruction by revisiting well-known results on human mobility based on raw CDR, and showing that complete trajectories can in some cases affect the outcome of those analyses substantially.