A probabilistic approach to mining mobile phone data sequences

Farrahi, Katayoun; Gatica-Perez, Daniel

doi:10.1007/s00779-013-0640-8

A probabilistic approach to mining mobile phone data sequences

Original Article
Published: 20 February 2013

Volume 18, pages 223–238, (2014)
Cite this article

Personal and Ubiquitous Computing Aims and scope Submit manuscript

Katayoun Farrahi¹ &
Daniel Gatica-Perez^2,3

1958 Accesses
22 Citations
1 Altmetric
Explore all metrics

Abstract

We present a new approach to address the problem of large sequence mining from big data. The particular problem of interest is the effective mining of long sequences from large-scale location data to be practical for Reality Mining applications, which suffer from large amounts of noise and lack of ground truth. To address this complex data, we propose an unsupervised probabilistic topic model called the distant n-gram topic model (DNTM). The DNTM is based on latent Dirichlet allocation (LDA), which is extended to integrate sequential information. We define the generative process for the model, derive the inference procedure, and evaluate our model on both synthetic data and real mobile phone data. We consider two different mobile phone datasets containing natural human mobility patterns obtained by location sensing, the first considering GPS/wi-fi locations and the second considering cell tower connections. The DNTM discovers meaningful topics on the synthetic data as well as the two mobile phone datasets. Finally, the DNTM is compared to LDA by considering log-likelihood performance on unseen data, showing the predictive power of the model. The results show that the DNTM consistently outperforms LDA as the sequence length increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Probabilistic Approach for Discovering Daily Human Mobility Patterns with Mobile Data

Mobility Pattern Identification Based on Mobile Phone Data

Discovering locations and habits from human mobility data

Article 07 September 2020

References

Ashbrook D, Starner T (2003) Using GPS to learn significant locations and predict movement across multiple users. Personal Ubiquitous Comput 7(5):275–286
Article Google Scholar
Bao T, Cao H, Chen E, Tian J, Xiong H (2010) An unsupervised approach to modeling personalized contexts of mobile users. In: IEEE International Conference on Data Mining (ICDM), pp 38–47
Becker RA, Cáceres R, Hanson K, Loh JM, Urbanek S, Varshavsky A, Volinsky C (2011) Route classification using cellular handoff patterns. In: International Conference on Ubiquitous Computing (UbiComp), pp 123–132
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Candia J, Gonzalez MC, Wang P, Schoenharl T, Madey G, Barabasi AL (2008) Uncovering individual and collective human dynamics from mobile phone records. J Phys A Math Theor 41(22):224015–224025
Google Scholar
Do T, Gatica-Perez D (2011) Groupus: smartphone proximity data and human interaction type mining. In: Proceedings of IEEE international symposium on wearable computers (ISWC). San Francisco, USA
Eagle N, Pentland A (2009) Eigenbehaviors: identifying structure in routine. Behav Ecol Sociobiol 63(7):1057–1066
Article Google Scholar
Farrahi K (2011) A probabilistic approach to socio-geographic reality mining. Ph.D. thesis, Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland. doi:10.5075/epfl-thesis-5018. URL http://library.epfl.ch/theses/?nr=5018
Farrahi K, Gatica-Perez D (2010) Mining human location-routines using a multi-level topic model. In: Socialcom symposium on social intelligence and networking (Socialcom SIN). Minneapolis, USA
Farrahi K, Gatica-Perez D (2010) Probabilistic mining of socio-geographic routines from mobile phone data. IEEE J Sel Top Signal Process (J-STSP) 4(4):746–755
Article Google Scholar
Farrahi K, Gatica-Perez D (2012) Extracting mobile behavioral patterns with the distant n-gram topic model. In: International symposium on wearable computers (ISWC), pp 1–8
Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782
Article Google Scholar
Görnerup O (2012) Scalable mining of common routes in mobile communication network traffic data. Pervasive. Newcastle upon Tyne, pp 99–106
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(Suppl. 1):5228–5235
Google Scholar
Hightower J, Consolvo S, Lamarca A, Smith I, Hughes J (2005) Learning and recognizing the places we go. In: International Conference on Ubiquitous Computing (UbiComp), pp 159–176
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI). Stockholm, Sweden, pp 289–296
Huynh T, Fritz M, Schiele B (2008) Discovery of activity patterns using topic models. In: International Conference on Ubiquitous Computing (UbiComp), pp 10–19
Kang JH, Welbourne W, Stewart B, Borriello G (2005) Extracting places from traces of locations. ACM SIGMOBILE Mob Comput Commun Rev 9(3):58–68
Article Google Scholar
Kiukkonen N, Blom J, Dousse O, Gatica-Perez D, Laurila J (2010) Towards rich mobile phone datasets: Lausanne data collection campaign. In: Proceedings of ACM international conference on pervasive services (ICPS). Berlin, Germany
Mackay DJC (2003) Information theory, inference, and learning algorithms. Cambridge University Press, Cambridge
MATH Google Scholar
Marmasse N, Schmandt C (2000) Location-aware information delivery with commotion. Springer, Berlin, pp 157–171
Google Scholar
Montoliu R, Gatica-Perez D (2010) Discovering human places of interest from multimodal mobile phone data. In: Proceedings of ACM international conference on mobile and ubiquitous multimedia (MUM). Cypress, Limassol
Patterson D, Liao L, Fox D, Kautz H (2003) Inferring high-level behavior from low-level sensors. In: International Conference on Ubiquitous Computing (UbiComp), pp 73–89
Petterson J, Smola AJ, Caetano TS, Buntine WL, Narayanamurthy S (2010) Word features for latent dirichlet allocation. Adv Neural Inf Process Syst (NIPS) 23:1921–1929
Google Scholar
Phithakkitnukoon S, Horanont T, Lorenzo GD, Shibasaki R, Ratti C (2010) Activity-aware map: Identifying human daily activity pattern using mobile phone data. In: Proceedings of the first international conference on human behavior understanding. Springer, Berlin, pp 14–25
Varadarajan J, Emonet R, Odobez JM (2012) Sparsity in topic models. In: Rish I, Lozano A, Cecchi G, Niculescu-Mizil A (eds) Practical applications of sparse modeling: biology, signal processing and beyond. MIT Press, Cambridge
Wallach H (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the international conference on machine learning (ICML). Pittsburgh, USA
Wang X, McCallum A, Wei X (2007) Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: IEEE international conference on data mining (ICDM).Washington, USA, pp 697–702
Yavas G, Katsaros D, Ulusoy O, Manolopoulos Y (2005) A data mining approach for location prediction in mobile environments. Data Knowl Eng 54(2):121–146
Article Google Scholar
Zheng J, Ni LM (2012) An unsupervised framework for sensing individual and cluster behavior patterns from human mobile data. In: International Conference on Ubiquitous Computing (UbiComp)
Zheng Y, Zhang L, Xie X, Ma WY (2009) Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of the 18th international conference on world wide web. ACM, New York, pp 791–800

Download references

Acknowledgments

This research was funded by the SNSF HAI project and the LS-CONTEXT project funded by Nokia Research. K. Farrahi also acknowledges the Socionical project and the Pervasive Computing Group at JKU, Linz. We thank Olivier Bornet (Idiap) for help with location data processing and visualization, Gian Paolo Perrucci (Nokia Research Lausanne) for insights on routine visualization, and Trinh-Minh-Tri Do (Idiap) for discussions on sequence modeling methods.

Author information

Authors and Affiliations

JKU University Linz, Linz, Austria
Katayoun Farrahi
Idiap Research Institute, Martigny, Switzerland
Daniel Gatica-Perez
EPFL, Lausanne, Switzerland
Daniel Gatica-Perez

Authors

Katayoun Farrahi
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gatica-Perez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katayoun Farrahi.

Appendix: Derivation of the distant n-gram topic model

From the graphical model in Fig. 1, we can determine the following relationship:

$$ \begin{aligned} p({\bf z, q}|\alpha,\beta) &= p({\bf z}, \user2{w}_{\bf 1}, \ldots, \user2{w}_{\user2{N}}|\alpha, \beta) \\ &= p(\user2{w}_{\bf 1}, \ldots, \user2{w}_{\user2{n}}|{\bf z}, \alpha, \beta)\cdot p({\bf z}|\alpha, \beta) \\ &= p( \user2{w}_{\bf 2}, \ldots,\user2{w}_{\user2{n}}|\user2{w}_{\bf 1}, {\bf z}, \alpha, \beta) \cdot p(\user2{w}_{\bf 1}|{\bf z}, \alpha, \beta) \cdot p({\bf z}|\alpha, \beta) \\ &= p({\bf z}|\alpha)p(\user2{w}_{\bf 1}|{\bf z},\beta_1)\prod_{j=2}^N p(\user2{w}_{\user2{j}}|{\bf z}, \user2{w}_{\bf 1},\beta_j)\\ &= \int\limits_{\varvec{\Uptheta}} p({\bf z}|\varvec{\Uptheta})p(\varvec{\Uptheta}|\alpha)d\varvec{\Uptheta} \cdot \int\limits_{\varvec{\Upphi}_{\bf 1}} p(\user2{w}_{\bf 1}|{\bf z},\varvec{\Upphi}_{\bf 1})p(\varvec{\Upphi}_{\bf 1}|\beta_1)d\varvec{\Upphi}_{\bf 1}\\ & \quad \cdot \prod_{j=2}^n \int\limits_{\varvec{\Upphi}_{\bf j}} p(\user2{w}_{\user2{j}}|\user2{w}_{\bf 1}, {\bf z},\varvec{\Upphi}_{\bf j})p(\varvec{\Upphi}_{\bf j}|\beta_j)d\varvec{\Upphi}_{\bf j}\\ &= \prod_{m=1}^M \left(\frac{1}{B(\alpha)} \int \prod_{k=1}^T \varvec{\theta}_{m,k}^{n_m^{k}+\alpha-1} d\varvec{\theta}\right) \cdot \prod_{k=1}^T \left( \frac{1}{B(\beta_1)} \int \prod_{t=1}^V \varvec{\phi}_{1_{k,t}}^{n_k^t + \beta_1 - 1} d\varvec{\phi}_1\right) \\ & \quad\cdot \prod_{j=2}^n \prod_{k=1}^T \frac{1}{B(\beta_j)} \left( \int \prod_{t_1=1}^V \prod_{t_2=1}^V \varvec{\phi}_{j_{k, t_1, t_2}}^{n_{k_i^{\prime}}^{(t_1, t_2)_j} + \beta_j - 1} d\phi_j \right) \\ &= \prod_{m=1}^M \frac{B(n_m + \alpha)}{B(\alpha)}\cdot \prod_{k=1}^T \left(\frac{B(n_k + \beta_1)}{B(\beta_1)}\cdot \prod_{j=2}^n\frac{B(n_{k_j^{\prime}} + \beta_j)}{B(\beta_j)} \right) \end{aligned} $$

The joint probability of observations and latent topics can be obtained by marginalizing over the hidden parameters $\varvec{\Uptheta}$ and $\varvec{\Upphi}$. These relations are then used for inference and parameter estimation where $p({\bf z}|\alpha), p(\user2{w}_{\bf 1}|{\bf z}, \beta_1)$, and $p(\user2{w}_{\user2{j}}|\user2{w}_{\bf 1}, \beta_j)$ are derived in [8] resulting in the following.

$$ p({\bf z}|\alpha) = \prod_{m=1}^M \frac{B(n_m + \alpha)}{B(\alpha)} $$

(11)

$$ p(\user2{w}_{\bf 1}|{\bf z},\beta_1) = \prod_{k=1}^T \frac{B(n_k + \beta_1)}{B(\beta_1)} $$

(12)

$$ p(\user2{w}_{\user2{j}}|\user2{w}_{\bf 1}, {\bf z},\beta_j) = \prod_{k=1}^T \frac{B(n_{k_j^{\prime}} + \beta_j)}{B(\beta_j)}, \quad 1<j\leqq n $$

(13)

We then derive the model parameters based on the MCMC approach of Gibbs sampling [14].

$$ p(z_i = k|{\bf z}_{-i},{\bf q},\alpha,\beta) = \frac{p({\bf z},{\bf q}|\alpha,\beta)}{p({\bf z}_{-i},{\bf q}|\alpha ,\beta)} $$

(14)

using the knowledge z _−i, or ${\bf w}_{x_{-i}}$ indicate that token i is excluded from the topic or label w _x

$$ \propto \frac{B(n_m + \alpha)}{B(n_{m_{-i}} + \alpha)} \cdot \frac{B(n_k + \beta_1)}{B(n_{k_{-i}} + \beta_1)} \cdot \prod_{j=2}^n \frac{B(n_{k_j^{'}} + \beta_j)}{B(n_{k_{j,-i}^{'}} + \beta_j)} $$

(15)

Note the proportionality stems from the terms $\user2{w}_{1_i}$ and $\user2{w}_{j_i}$

$$ \begin{aligned} &\propto (n_{m,-i}^k + \alpha) \cdot \frac{n_{k,-i}^{w_1} + \beta_1}{\sum_{w_1=1}^V n_{k,-i}^t + \beta_1}\\ & \quad \cdot \prod_{j=2}^n \frac{n_{k,-i}^{(w_1, w_2)_j} + \beta_j}{\sum_{w_1=1}^V \sum_{w_2=1}^V n_{k,-i}^{(w_1, w_2)_j} + \beta_j}\nonumber \end{aligned} $$

(16)

where $n_{x}^{(y)} = n_{x,-i}^{(y)} + 1$ if x = x _i and y = y _i and $n_{x}^{(y)} = n_{x,-i}^{(y)}$ in other cases.

Where $n_k = \{n_k^{w_1}\}_{w_{1}=1}^V$ and $n_{k_j^{'}} = \{n_{k^{'}}^{(w_1,w_2)_j}\}_{w_1=1, w_2=1}^{w_1=V, w_2=V}$.

We use the properties $B(x) = \frac{\prod_{k=1}^{dim x} \Upgamma(x_k)}{\Upgamma(\sum_{k=1}^{dim x} x_k)}$, and $\Upgamma(y) = (y-1)!$

The model parameters can then be estimated as follows:

$$ \varvec{\theta}_m^k = \frac{n_m^k + \alpha}{\sum_{k=1}^T(n_m^k + \alpha)} $$

(17)

$$ \varvec{\phi}_{1, k}^t = \frac{n_k^t + \beta_1}{\sum_{w_1=1}^V(n_k^{w_1} + \beta_1)} $$

(18)

$$ \varvec{\phi}_{j, k}^{(w_1, w_2)_j} = \frac{n_k^{(w_1, w_2)_j} + \beta_j}{\sum_{w_1=1}^V \sum_{w_2=1}^V (n_k^{(w_1, w_2)_j} + \beta_j)}. $$

(19)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farrahi, K., Gatica-Perez, D. A probabilistic approach to mining mobile phone data sequences. Pers Ubiquit Comput 18, 223–238 (2014). https://doi.org/10.1007/s00779-013-0640-8

Download citation

Received: 02 October 2012
Accepted: 21 November 2012
Published: 20 February 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s00779-013-0640-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A probabilistic approach to mining mobile phone data sequences

Abstract

Access this article

Similar content being viewed by others

A Probabilistic Approach for Discovering Daily Human Mobility Patterns with Mobile Data

Mobility Pattern Identification Based on Mobile Phone Data

Discovering locations and habits from human mobility data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Derivation of the distant n-gram topic model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A probabilistic approach to mining mobile phone data sequences

Abstract

Access this article

Similar content being viewed by others

A Probabilistic Approach for Discovering Daily Human Mobility Patterns with Mobile Data

Mobility Pattern Identification Based on Mobile Phone Data

Discovering locations and habits from human mobility data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Derivation of the distant n-gram topic model

Appendix: Derivation of the distant n-gram topic model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation