Skip to main content
Log in

A probabilistic approach to mining mobile phone data sequences

  • Original Article
  • Published:
Personal and Ubiquitous Computing Aims and scope Submit manuscript

Abstract

We present a new approach to address the problem of large sequence mining from big data. The particular problem of interest is the effective mining of long sequences from large-scale location data to be practical for Reality Mining applications, which suffer from large amounts of noise and lack of ground truth. To address this complex data, we propose an unsupervised probabilistic topic model called the distant n-gram topic model (DNTM). The DNTM is based on latent Dirichlet allocation (LDA), which is extended to integrate sequential information. We define the generative process for the model, derive the inference procedure, and evaluate our model on both synthetic data and real mobile phone data. We consider two different mobile phone datasets containing natural human mobility patterns obtained by location sensing, the first considering GPS/wi-fi locations and the second considering cell tower connections. The DNTM discovers meaningful topics on the synthetic data as well as the two mobile phone datasets. Finally, the DNTM is compared to LDA by considering log-likelihood performance on unseen data, showing the predictive power of the model. The results show that the DNTM consistently outperforms LDA as the sequence length increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  1. Ashbrook D, Starner T (2003) Using GPS to learn significant locations and predict movement across multiple users. Personal Ubiquitous Comput 7(5):275–286

    Article  Google Scholar 

  2. Bao T, Cao H, Chen E, Tian J, Xiong H (2010) An unsupervised approach to modeling personalized contexts of mobile users. In: IEEE International Conference on Data Mining (ICDM), pp 38–47

  3. Becker RA, Cáceres R, Hanson K, Loh JM, Urbanek S, Varshavsky A, Volinsky C (2011) Route classification using cellular handoff patterns. In: International Conference on Ubiquitous Computing (UbiComp), pp 123–132

  4. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  5. Candia J, Gonzalez MC, Wang P, Schoenharl T, Madey G, Barabasi AL (2008) Uncovering individual and collective human dynamics from mobile phone records. J Phys A Math Theor 41(22):224015–224025

    Google Scholar 

  6. Do T, Gatica-Perez D (2011) Groupus: smartphone proximity data and human interaction type mining. In: Proceedings of IEEE international symposium on wearable computers (ISWC). San Francisco, USA

  7. Eagle N, Pentland A (2009) Eigenbehaviors: identifying structure in routine. Behav Ecol Sociobiol 63(7):1057–1066

    Article  Google Scholar 

  8. Farrahi K (2011) A probabilistic approach to socio-geographic reality mining. Ph.D. thesis, Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland. doi:10.5075/epfl-thesis-5018. URL http://library.epfl.ch/theses/?nr=5018

  9. Farrahi K, Gatica-Perez D (2010) Mining human location-routines using a multi-level topic model. In: Socialcom symposium on social intelligence and networking (Socialcom SIN). Minneapolis, USA

  10. Farrahi K, Gatica-Perez D (2010) Probabilistic mining of socio-geographic routines from mobile phone data. IEEE J Sel Top Signal Process (J-STSP) 4(4):746–755

    Article  Google Scholar 

  11. Farrahi K, Gatica-Perez D (2012) Extracting mobile behavioral patterns with the distant n-gram topic model. In: International symposium on wearable computers (ISWC), pp 1–8

  12. Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782

    Article  Google Scholar 

  13. Görnerup O (2012) Scalable mining of common routes in mobile communication network traffic data. Pervasive. Newcastle upon Tyne, pp 99–106

  14. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(Suppl. 1):5228–5235

    Google Scholar 

  15. Hightower J, Consolvo S, Lamarca A, Smith I, Hughes J (2005) Learning and recognizing the places we go. In: International Conference on Ubiquitous Computing (UbiComp), pp 159–176

  16. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI). Stockholm, Sweden, pp 289–296

  17. Huynh T, Fritz M, Schiele B (2008) Discovery of activity patterns using topic models. In: International Conference on Ubiquitous Computing (UbiComp), pp 10–19

  18. Kang JH, Welbourne W, Stewart B, Borriello G (2005) Extracting places from traces of locations. ACM SIGMOBILE Mob Comput Commun Rev 9(3):58–68

    Article  Google Scholar 

  19. Kiukkonen N, Blom J, Dousse O, Gatica-Perez D, Laurila J (2010) Towards rich mobile phone datasets: Lausanne data collection campaign. In: Proceedings of ACM international conference on pervasive services (ICPS). Berlin, Germany

  20. Mackay DJC (2003) Information theory, inference, and learning algorithms. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  21. Marmasse N, Schmandt C (2000) Location-aware information delivery with commotion. Springer, Berlin, pp 157–171

    Google Scholar 

  22. Montoliu R, Gatica-Perez D (2010) Discovering human places of interest from multimodal mobile phone data. In: Proceedings of ACM international conference on mobile and ubiquitous multimedia (MUM). Cypress, Limassol

  23. Patterson D, Liao L, Fox D, Kautz H (2003) Inferring high-level behavior from low-level sensors. In: International Conference on Ubiquitous Computing (UbiComp), pp 73–89

  24. Petterson J, Smola AJ, Caetano TS, Buntine WL, Narayanamurthy S (2010) Word features for latent dirichlet allocation. Adv Neural Inf Process Syst (NIPS) 23:1921–1929

    Google Scholar 

  25. Phithakkitnukoon S, Horanont T, Lorenzo GD, Shibasaki R, Ratti C (2010) Activity-aware map: Identifying human daily activity pattern using mobile phone data. In: Proceedings of the first international conference on human behavior understanding. Springer, Berlin, pp 14–25

  26. Varadarajan J, Emonet R, Odobez JM (2012) Sparsity in topic models. In: Rish I, Lozano A, Cecchi G, Niculescu-Mizil A (eds) Practical applications of sparse modeling: biology, signal processing and beyond. MIT Press, Cambridge

  27. Wallach H (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the international conference on machine learning (ICML). Pittsburgh, USA

  28. Wang X, McCallum A, Wei X (2007) Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: IEEE international conference on data mining (ICDM).Washington, USA, pp 697–702

  29. Yavas G, Katsaros D, Ulusoy O, Manolopoulos Y (2005) A data mining approach for location prediction in mobile environments. Data Knowl Eng 54(2):121–146

    Article  Google Scholar 

  30. Zheng J, Ni LM (2012) An unsupervised framework for sensing individual and cluster behavior patterns from human mobile data. In: International Conference on Ubiquitous Computing (UbiComp)

  31. Zheng Y, Zhang L, Xie X, Ma WY (2009) Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of the 18th international conference on world wide web. ACM, New York, pp 791–800

Download references

Acknowledgments

This research was funded by the SNSF HAI project and the LS-CONTEXT project funded by Nokia Research. K. Farrahi also acknowledges the Socionical project and the Pervasive Computing Group at JKU, Linz. We thank Olivier Bornet (Idiap) for help with location data processing and visualization, Gian Paolo Perrucci (Nokia Research Lausanne) for insights on routine visualization, and Trinh-Minh-Tri Do (Idiap) for discussions on sequence modeling methods.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katayoun Farrahi.

Appendix: Derivation of the distant n-gram topic model

Appendix: Derivation of the distant n-gram topic model

From the graphical model in Fig. 1, we can determine the following relationship:

$$ \begin{aligned} p({\bf z, q}|\alpha,\beta) &= p({\bf z}, \user2{w}_{\bf 1}, \ldots, \user2{w}_{\user2{N}}|\alpha, \beta) \\ &= p(\user2{w}_{\bf 1}, \ldots, \user2{w}_{\user2{n}}|{\bf z}, \alpha, \beta)\cdot p({\bf z}|\alpha, \beta) \\ &= p( \user2{w}_{\bf 2}, \ldots,\user2{w}_{\user2{n}}|\user2{w}_{\bf 1}, {\bf z}, \alpha, \beta) \cdot p(\user2{w}_{\bf 1}|{\bf z}, \alpha, \beta) \cdot p({\bf z}|\alpha, \beta) \\ &= p({\bf z}|\alpha)p(\user2{w}_{\bf 1}|{\bf z},\beta_1)\prod_{j=2}^N p(\user2{w}_{\user2{j}}|{\bf z}, \user2{w}_{\bf 1},\beta_j)\\ &= \int\limits_{\varvec{\Uptheta}} p({\bf z}|\varvec{\Uptheta})p(\varvec{\Uptheta}|\alpha)d\varvec{\Uptheta} \cdot \int\limits_{\varvec{\Upphi}_{\bf 1}} p(\user2{w}_{\bf 1}|{\bf z},\varvec{\Upphi}_{\bf 1})p(\varvec{\Upphi}_{\bf 1}|\beta_1)d\varvec{\Upphi}_{\bf 1}\\ & \quad \cdot \prod_{j=2}^n \int\limits_{\varvec{\Upphi}_{\bf j}} p(\user2{w}_{\user2{j}}|\user2{w}_{\bf 1}, {\bf z},\varvec{\Upphi}_{\bf j})p(\varvec{\Upphi}_{\bf j}|\beta_j)d\varvec{\Upphi}_{\bf j}\\ &= \prod_{m=1}^M \left(\frac{1}{B(\alpha)} \int \prod_{k=1}^T \varvec{\theta}_{m,k}^{n_m^{k}+\alpha-1} d\varvec{\theta}\right) \cdot \prod_{k=1}^T \left( \frac{1}{B(\beta_1)} \int \prod_{t=1}^V \varvec{\phi}_{1_{k,t}}^{n_k^t + \beta_1 - 1} d\varvec{\phi}_1\right) \\ & \quad\cdot \prod_{j=2}^n \prod_{k=1}^T \frac{1}{B(\beta_j)} \left( \int \prod_{t_1=1}^V \prod_{t_2=1}^V \varvec{\phi}_{j_{k, t_1, t_2}}^{n_{k_i^{\prime}}^{(t_1, t_2)_j} + \beta_j - 1} d\phi_j \right) \\ &= \prod_{m=1}^M \frac{B(n_m + \alpha)}{B(\alpha)}\cdot \prod_{k=1}^T \left(\frac{B(n_k + \beta_1)}{B(\beta_1)}\cdot \prod_{j=2}^n\frac{B(n_{k_j^{\prime}} + \beta_j)}{B(\beta_j)} \right) \end{aligned} $$

The joint probability of observations and latent topics can be obtained by marginalizing over the hidden parameters \(\varvec{\Uptheta}\) and \(\varvec{\Upphi}\). These relations are then used for inference and parameter estimation where \(p({\bf z}|\alpha), p(\user2{w}_{\bf 1}|{\bf z}, \beta_1)\), and \(p(\user2{w}_{\user2{j}}|\user2{w}_{\bf 1}, \beta_j)\) are derived in [8] resulting in the following.

$$ p({\bf z}|\alpha) = \prod_{m=1}^M \frac{B(n_m + \alpha)}{B(\alpha)} $$
(11)
$$ p(\user2{w}_{\bf 1}|{\bf z},\beta_1) = \prod_{k=1}^T \frac{B(n_k + \beta_1)}{B(\beta_1)} $$
(12)
$$ p(\user2{w}_{\user2{j}}|\user2{w}_{\bf 1}, {\bf z},\beta_j) = \prod_{k=1}^T \frac{B(n_{k_j^{\prime}} + \beta_j)}{B(\beta_j)}, \quad 1<j\leqq n $$
(13)

We then derive the model parameters based on the MCMC approach of Gibbs sampling [14].

$$ p(z_i = k|{\bf z}_{-i},{\bf q},\alpha,\beta) = \frac{p({\bf z},{\bf q}|\alpha,\beta)}{p({\bf z}_{-i},{\bf q}|\alpha ,\beta)} $$
(14)

using the knowledge z i , or \({\bf w}_{x_{-i}}\) indicate that token i is excluded from the topic or label w x

$$ \propto \frac{B(n_m + \alpha)}{B(n_{m_{-i}} + \alpha)} \cdot \frac{B(n_k + \beta_1)}{B(n_{k_{-i}} + \beta_1)} \cdot \prod_{j=2}^n \frac{B(n_{k_j^{'}} + \beta_j)}{B(n_{k_{j,-i}^{'}} + \beta_j)} $$
(15)

Note the proportionality stems from the terms \(\user2{w}_{1_i}\) and \(\user2{w}_{j_i}\)

$$ \begin{aligned} &\propto (n_{m,-i}^k + \alpha) \cdot \frac{n_{k,-i}^{w_1} + \beta_1}{\sum_{w_1=1}^V n_{k,-i}^t + \beta_1}\\ & \quad \cdot \prod_{j=2}^n \frac{n_{k,-i}^{(w_1, w_2)_j} + \beta_j}{\sum_{w_1=1}^V \sum_{w_2=1}^V n_{k,-i}^{(w_1, w_2)_j} + \beta_j}\nonumber \end{aligned} $$
(16)

where \(n_{x}^{(y)} = n_{x,-i}^{(y)} + 1\) if x = x i and y = y i and \(n_{x}^{(y)} = n_{x,-i}^{(y)}\) in other cases.

Where \(n_k = \{n_k^{w_1}\}_{w_{1}=1}^V\) and \(n_{k_j^{'}} = \{n_{k^{'}}^{(w_1,w_2)_j}\}_{w_1=1, w_2=1}^{w_1=V, w_2=V}\).

We use the properties \(B(x) = \frac{\prod_{k=1}^{dim x} \Upgamma(x_k)}{\Upgamma(\sum_{k=1}^{dim x} x_k)}\), and \(\Upgamma(y) = (y-1)!\)

The model parameters can then be estimated as follows:

$$ \varvec{\theta}_m^k = \frac{n_m^k + \alpha}{\sum_{k=1}^T(n_m^k + \alpha)} $$
(17)
$$ \varvec{\phi}_{1, k}^t = \frac{n_k^t + \beta_1}{\sum_{w_1=1}^V(n_k^{w_1} + \beta_1)} $$
(18)
$$ \varvec{\phi}_{j, k}^{(w_1, w_2)_j} = \frac{n_k^{(w_1, w_2)_j} + \beta_j}{\sum_{w_1=1}^V \sum_{w_2=1}^V (n_k^{(w_1, w_2)_j} + \beta_j)}. $$
(19)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farrahi, K., Gatica-Perez, D. A probabilistic approach to mining mobile phone data sequences. Pers Ubiquit Comput 18, 223–238 (2014). https://doi.org/10.1007/s00779-013-0640-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00779-013-0640-8

Keywords

Navigation