1 Introduction
1.1 Main contributions
-
We introduce BNPM: a Bayesian non-parametric model for spatio-temporal behavior modeling on the subgroup level. BNPM can handle diverse, uncertain, large scale and multi-modal information in collective spatio-temporal data.
-
We define a new evaluation method for exceptional model mining. The global distribution is generated by the mixture of behavioral patterns in BNPM. By comparing the posterior distribution of a candidate subgroup with the global distribution, we can quantify the exceptionality of subgroups.
-
We conduct various experiments on four real-world datasets. The results show that our method is effective and efficient for finding exceptional social posts on the subgroup level.
2 Related work
2.1 Anomaly detection
2.2 Exceptional model mining
2.3 Spatio-temporal modeling
3 Preliminaries
4 Subgroup-level spatio-temporal modeling (BNPM)
Notation | Description |
---|---|
n | Number of subgroups |
m | Number of geo-tagged social media posts |
\(d_i\)
| Number of posts belongs to subgroup i |
D
| Description of a subgroup |
\(r_{ij}\)
| Social media post j in subgroup i |
\(l_{ij}=(x,y)\)
| Spatial location of post j in subgroup i |
\(t_{ij}=t\)
| Time of post j in subgroup i |
\(w_{ij}=\{w_1, \dots , w_q\}\)
| Texts of post j in subgroup i |
\(n_k\)
| Number of subgroups in component k |
\(z_i\)
| Component assignment of subgroup i |
K
| Number of components |
V
| Vocabulary of the whole words |
\(\alpha \)
| Concentration parameter of CRP |
\(\beta _k\)
| Probability to choose component k |
\(\mu _{i},\varSigma _{i}\)
| Mean and covariance of spatial locations in subgroup i |
\(\upsilon _i,\sigma _i\)
| Mean and variance of time in subgroup i |
\(\theta _i\)
| Word distribution for posts in subgroup i |
\(\mu _{0z_i},\lambda _{z_i},W_{z_i},\nu _{z_i}\)
| Normal–inverse–Wishart (\(\mathcal {NIW}\)) prior for \(\mu _{i},\varSigma _{i}\) |
\(\upsilon _{0z_i},\kappa _{z_i},\rho _{z_i},\psi _{z_i}\)
| Normal-Gamma (\(\mathcal {NG}\)) prior for \(\upsilon _i,\sigma ^2_i\) |
\(\theta _{0z_i}\)
| Dirichlet prior for \(\theta _i\) |
4.1 The Bayesian non-parametric model
4.2 Inference method
4.3 Subgroup evaluation method
5 Experiments
Dataset | \(\#\) Tweets | \(\#\) Users | Timeframe | \(\#\) Attributes |
---|---|---|---|---|
London | 169,033 | 48,232 | April 2016 | 10 |
New York | 210,820 | 87,510 | April 2016 | 10 |
Tokyo | 201,643 | 49,214 | April 2016 | 10 |
Shenzhen | 303,161 | 100,000 | October 2016 | 8 |
D | \(\varphi _{sd}(D)\) | \(\frac{|D|}{|\varOmega |}\) | High-frequency words |
---|---|---|---|
\(D_1\) | 0.79 | 0.04 | New song, come on, music, support, like, rank |
\(D_2\) | 0.64 | 0.04 | Thailand, selfie, holiday, Weibo, tour, photography |
\(D_3\) | 0.62 | 0.03 | New song, come on, music, support, like, rank |
\(D_4\) | 0.61 | 0.03 | Team, investment, customer, finance, refine, ability |
\(D_5\) | 0.51 | 0.04 | Stadium, sports, run, insist, seaside, struggle |
5.1 London and Shenzhen
D | \(\varphi _{sd}(D)\) | \(\frac{|D|}{|\varOmega |}\) | High-frequency words |
---|---|---|---|
\(D_1\) | 0.95 | 0.03 | London, Chelsea, Stamford, bridge, football, bar |
\(D_2\) | 0.90 | 0.07 | Stockmarket, trade, stock, intern, broker, forecast |
\(D_3\) | 0.88 | 0.07 | Street, kingcross, station, camdenlock, transport, driver |
\(D_4\) | 0.86 | 0.05 | Hackney, gym, class, image, orange, boss |
\(D_5\) | 0.85 | 0.04 | History, restaurant, sweet, healthy, cover, Paddington |