1 Introduction
-
RQ1: Does users’ demographics correlate with their CPS behaviors? If yes, what is the predictability of users’ demographics given their CPS behaviors?
-
RQ2: What is the degree to which the logs reveal users’ self-declared CPS behaviors?
-
RQ3: How do the log-based CPS behaviors perform in terms of demographic predictability, comparing with their self-declared counterparts?
-
To the best of our knowledge, this is the first study to investigate the predictability of user demographics by considering CPS behaviors.
-
A comprehensive analysis of users’ CPS behaviors and their demographics.
-
We demonstrate the degree to which log-based CPS behaviors reveal users’ self-declared CPS behaviors.
-
The predictability of users demographics are examined and compared by using both self-declared CPS behaviors and log-based behaviors.
2 Related work
2.1 Cyber behavior
2.2 Physical behavior
2.3 Social behavior
2.4 Gaps
3 Data acquisition & processing
3.1 Questionnaire data
Attribute
|
Description
|
Possible values
|
---|---|---|
Age | Age | 18-24 yrs |
25-39 yrs | ||
40-54 yrs | ||
55+ yrs | ||
Education | Education level | Secondary/high school |
Diploma/university degree | ||
Honors degree | ||
Master degree | ||
Higher than Master degree | ||
Income | Annual income | 0-$18,200 |
$18,201-$37,000 | ||
$37,001-$80,000 | ||
$80,001-$180,000 | ||
$180,001 and over | ||
Parent | Having children? | Yes |
No | ||
User type | Are you? | Inner Sydney resident |
Rest of Sydney resident | ||
Central Business District Worker | ||
Domestic tourist | ||
International tourist |
Attribute
|
Description
|
Possible values
|
---|---|---|
Online duration | Percentage of time spent online | (0-100%) |
WiFi frequency | Frequency of using WiFi in a visit | Seldom |
Occasionally | ||
Often | ||
Every visit | ||
What to browse | What to browse online | BrightCloud category |
How many searches | How many queries issued in a visit | >0 |
What to search | What to search online | BrightCloud category |
Attribute
|
Description
|
Possible values
|
---|---|---|
Frequency | Visiting frequency | Daily |
Weekly | ||
Bi-weekly | ||
Monthly | ||
Yearly | ||
Sporadically | ||
Weekdays | Days of visits | Mon-Sun |
Duration | Duration of visit | Numeric values |
Interests | Interests in shop categories | Mall owner defined shop categories |
Attribute
|
Description
|
Possible values
|
---|---|---|
Social | Coming with who? | Alone |
With child/children | ||
With an adult | ||
With adults |
3.2 Log data
3.2.1 Association log
3.2.2 Browsing log
4 Self-reported behaviors
4.1 Tendencies in durations
Demographics
|
Physical duration
|
Online duration
| ||
---|---|---|---|---|
F
.
|
Sig.
|
F
.
|
Sig.
| |
Age | 4.367 |
0.005
| 3.576 |
0.014
|
Education | 5.534 |
0.000
| 3.955 |
0.004
|
Income | 2.997 |
0.018
| 2.624 |
0.034
|
Parental status | 10.612 |
0.001
| 2.999 | 0.084 |
User type | 2.995 |
0.018
| 1.454 | 0.215 |
4.2 Tendencies by content categories
-
Age: Five Web categories are significantly associated with age. As the age increases, popularity of social network decreases while News increases. Finance and Business show a similar trend to News, and Real Estate has a peak at the age group of 25-39 year old. For physical shop categories (Figure 2(b)), people’s interests in Jewelry decreases from younger to older respondents, while their interests in locations with shops categorised under the Children category peaks at age 25-39. The social dynamics of mall visits also change with age (Figure 2(c)). The likelihood that a user will visit With an Adult decreases with age. The 40-54 year-old age group tends to visit the mall With Kids, 18-24 and the 55+ year old users are comparatively more likely to visit With a Group.×
-
Education: People with Higher than Master degree have significantly different interests to groups with lower education status (Figures 2(d) and 2(e)). The most popular categories of physical locations visited, content browsed and queryied are all different across education levels. Yet, there is no pattern in the social group status of the visits significantly associated with Education, thus Figure 2(f) is blank.
-
Income: Only two Web content categories are significantly associated with visitors stratified by Income (Figure 2(g)), although the change across groups is relatively small. Moreover, for social behaviours (Figure 2(i)), With Kids and With a Group show similar trends to Work and Travel in cyber categories, respectively. No physical shop category was significantly associated with Income, thus Figure 2(h) is left blank.
-
Parental Status: People Having Kids tend to browse less on Social Networks, and search more Society; perhaps unsurprisingly they significantly shop more a store selling things for Children than those with No Kids, see Figure 2(j) and 2(k). Those Having Kids have a higher probability of visiting the mall With Kids, and relatively lower probability of visiting with others or single.
-
User Type: For cyber categories (Figure 2(m)), Domestic and International tourists are more interested in Tourism, Travel, Local Information, Travel & Recreation and Food & Drink than the other groups. They are, however, also less interested in shopping. CBD worker visitors are more interested in Shopping and Entertainment & Arts, while the rest of Sydney resident are mostly interested in Entertainment & Arts and Food & Drink and the inner Sydney resident are more interested in Food & Drink and Shopping.When it comes to interest in categories of physical shops (Figure 2(n)), it also appears that tourists (including domestic and international) are less interested in the Food & Drink shops, although they tend to search Food & Drink on the Web. Local residents (inner city residents and residents from rest of Sydney) show more interests in Fashion than CBD workers and tourists. Domestic tourists show the highest interests in Leisure, then followed by Rest of Sydney Resident, CBD Workers, Inner Sydney Resident and International Tourist.While visiting the mall alone (Single) is popular across all user types (Figure 2(o)), CBD workers are the group most likely to visit the mall alone. Tourists (including domestic and international) are more likely to visit With a Group, but domestic tourist also tend to be accompanied by children. Rest of Sydney Resident also tend to visit with children or in a group, even compared to Inner Sydney Resident.
Demographics
|
Attributes
|
Category
|
\( \boldsymbol{\chi^{2}} \)
|
p
-value
|
---|---|---|---|---|
Age | Cyber | Social Network (browsing) | 52.0510 | 0.0000 |
News (querying) | 18.8830 | 0.0000 | ||
Finance (querying) | 10.2200 | 0.0170 | ||
Real Estate (querying) | 7.7560 | 0.0499 | ||
Business (querying) | 13.8110 | 0.0030 | ||
Physical | Jewellery | 7.7480 | 0.0499 | |
Children | 8.3800 | 0.0390 | ||
Social | With Kids | 76.743 | 0.0000 | |
With an adult | 13.509 | 0.0040 | ||
With a group | 10.199 | 0.0170 | ||
Education | Cyber | Tourism (browsing) | 24.7210 | 0.0000 |
Local Services (querying) | 12.1910 | 0.0160 | ||
Travelling (querying) | 12.2590 | 0.0160 | ||
Health & Beauty (querying) | 11.6060 | 0.0210 | ||
Physical | Food & Drink | 10.4070 | 0.0340 | |
Children | 17.8750 | 0.0010 | ||
Income | Cyber | Work (browsing) | 14.7190 | 0.0050 |
Travel (querying) | 10.3620 | 0.0350 | ||
Social | With Kids | 14.795 | 0.0050 | |
With a Group | 18.914 | 0.0010 | ||
Parental Status | Cyber | Social Network (browsing) | 11.3700 | 0.0010 |
Society (querying) | 6.2010 | 0.0130 | ||
Physical | Children | 65.5470 | 0.0000 | |
Social | Single | 11.411 | 0.0010 | |
With Kids | 147.475 | 0.0000 | ||
With an Adult | 4.574 | 0.0320 | ||
With a Group | 4.044 | 0.0440 | ||
User Type | Cyber | Tourism (browsing) | 83.9740 | 0.0000 |
Travel (browsing) | 39.8930 | 0.0000 | ||
Shopping (browsing) | 11.8410 | 0.0190 | ||
Local Information (querying) | 26.8420 | 0.0000 | ||
Entertainment & Arts (querying) | 9.9750 | 0.0410 | ||
Travel & Recreation (querying) | 38.6290 | 0.0000 | ||
Food & Drink (querying) | 12.2080 | 0.0160 | ||
Physical | Fashion | 10.0990 | 0.0390 | |
Food & Drink | 10.4410 | 0.0340 | ||
Leisure | 11.2550 | 0.0240 | ||
Social | Single | 19.659 | 0.0010 | |
With Kids | 11.602 | 0.0210 | ||
With a Group | 17.449 | 0.0020 |
5 Logs vs. questionnaire
5.1 Cyber-physical behaviors from logs
-
median(gaps): the median of the gaps (in days) between two consecutive visits of the same user are used to estimate the visiting frequency to the mall.
-
Weekdays occurrence: the number of occurrence of the days in a week when the user visit recorded in AL.
-
Time in AL: the total time the user are connected to the WiFi system.
-
Time@ShopCat: the time spent in each shop category.
-
WiFi frequency: the ratio of the number of visits accessing the Web over the number of visits in AL.
-
# of queries: the number of issued queries, which are extracted manually from the log as described in Section 3.2.2.
-
URL Category: URLs are categorized using BrightCloud. We then compute the likelihood of accessing each category per user, to characterise what users browse online when visiting the mall.
-
Query Category: the category of the query click-through, categorized by BrightCloud.
5.2 Log behaviors vs. questionnaire behaviors
-
Mean Absolute Error (MAE) is used if both attributes are numeric.where \(v_{s}\) denotes the value of self-declared questionnaire attribute, \(v_{l}\) denotes the corresponding attribute extracted from logs, n denotes the number of relevant data points, serving as a normalizing factor.$$ \frac{\vert v_{s} - v_{l} \vert }{n}, $$(1)
-
Mean Symmetric Difference (MSD) is used if both attributes are categorical. The Symmetric Difference set operation is applied on to measure the consistency between the two sets of attributes.In other words, MSD is the average value of the size of the symmetric difference of two sets.$$ \frac{\vert v_{s} \triangle v_{l} \vert }{n}. $$(2)
-
Probability Distribution Examination (PDE): if one attribute is categorical and its counterpart is numeric, we examine the probability distribution of the numeric values against each categorical value, and visually inspect their closeness.
-
Physical Attributes
-
Frequency: The average difference between the user self-reported frequency and the one captured in logs is 0.88 days. For example, if a user reported s/he visited the mall on a weekly (7 days) basis, the log-observed frequency is between \(7-0.88\) and \(7+0.88\) days.
-
Weekdays: There are, on average, 2.13 visits that did not appear on the self-reported visiting days, and the average visits per participants to the mall is 12.05 times. Figure 3(a) shows the user self-declared visiting days in a week versus the corresponding values computed based on the logs. Taking Sunday on the self-declared axis as an example, users declaring that they visit the mall on Sundays are also most likely to be log-observed in the mall on Sundays. This means that the log recorded user visits faithfully capture the behaviour declared in the questionnaire.×
-
Duration: The average difference between the user self-reported visiting duration in the mall and the one captured in logs is 0.70 hours, while most of the users spend around 3.5 hours in the mall.
-
Interests: The average difference between the user self-reported favourite shop categories and the ones captured in logs is 2.50, while most of the users favour more than 10 shop categories out of 34 available ones.
-
-
Web Attributes
-
WiFI Frequency: Figure 3(b) shows the distribution of the log-based observations of Web use frequency (estimated with [42]) for each corresponding categorical values obtained in the questionnaire. We find that for Every Visit, the corresponding log values are averaged around 0.6 with the max 0.83 and the min 0.4, which is clearly higher than for the other groups; for Seldom, the corresponding log values are all zero except for a single outlier value (around 0.1); the corresponding values for Often and Occasionally are not well distinct to each other, which might be because of the ambiguity of the natural language expression used in the questionnaire.
-
How many searches: Figure 3(c) shows the probability of the number of issued queries based on the log and its corresponding category from the questionnaire. Note that while most of the users submitted 2 or 3 queries in a visit and correctly reported this, a large number of users underestimated the number of times they perform a single query and overestimated the occurrence of visits with more queries issued.
-
What to browse/search: The average difference between the user self-reported favourite Web browsing/searching content and the ones captured in logs is 1.66 and 1.65, respectively. However, they tend to search/browse overall 8 categories of Web contents.
-
Questionnaire attributes
\(\boldsymbol{(v_{s})}\)
|
Type
|
vs
|
Log attributes
\(\boldsymbol{(v_{l})}\)
|
Type
|
Method
|
Result & explanation
|
---|---|---|---|---|---|---|
Physical attributes
| ||||||
Frequency | N | ↔ | median(gaps) | N | MAE | 0.88 (day) |
Weekdays | C | ↔ | Weekdays occurrence | C | MSD | 2.13 (visit) |
Duration | N | ↔ | Time in AL | N | MAE | 0.70 (hour) |
Interests | C | ↔ | Time@ShopCat | C | MSD | 2.50 (category) |
Web attributes
| ||||||
WiFi frequency | C | ↔ |
\(\frac{\# \text{of BL visits}}{\# \text{of AL visits}}\)
| N | PDE | Figure 3(b) |
How many searches | C | ↔ |
# of queries
| N | PDE | Figure 3(c) |
What to browse | C | ↔ | URL category | C | MSD | 1.66 (category) |
What to search | C | ↔ |
Query category
| C | MSD | 1.65 (category) |
6 Predictability of demographics
6.1 Experiment configuration
-
questionnaire-based attributes: (1) questionnaire-cyber: the Web attributes as shown in Table 2; (2) questionnaire-physical: the physical attributes as shown in Table 3; (3) questionnaire-social: the social grouping status as shown in Section 3.1; (4) questionnaire-all: includes all cyber, physical and social attributes.
-
Log-based attributes: (1) logs-cyber: the Web attributes extracted from BL as shown in Section 5.1; (2) logs-physical: the physical attributes extracted from AL as shown in Section 5.1; (3) logs-social: Nil; (4) logs-all: contains all attributes extracted from both AL and BL. In addition, we consider two sets of users here:
-
all users: all participants that participated in the questionnaire collection, but may or may not be also present in the logs;
-
cyber users: the sub-set of participants who have cyber browsing/searching logs associated with the questionnaire responses.
-
6.2 Predicting demographics
-
Age: while cyber features outperform physical and social features, the difference in predictive power based on accuracy is not large. Table 10 shows the top 5 best performing features. The top-5 features for predicting Age include two cyber, two physical and one social feature,
-
Education: appears to be the most difficult demographic attribute to predict, with an improvement of only around 5% compared to the mostPop baseline. Only a single feature (Work-Browsing) performs better than the mostPop baseline for the prediction of Education (Table 10). Recall the analysis in Section 4.2. Although people with higher than master degrees behave differently to others, all the other groups (by education) are not distinguishable, possibly with the exception of their association with the shopping category Children. This includes no association between social group status and education. This confirms our observation that the social feature performs roughly the same as the mostPop baseline.
-
Income: physical and cyber features significantly outperform social features when predicting Income. The top 5 best performing features in Table 10 include 1 physical feature and 4 cyber features, confirming that physical and cyber behaviors dominate the predictability of users’ income. Specifically, the most predictive physical feature is duration, which confirms the analysis in Section 4.1.
-
Parent: This attribute is relatively easier to predict, compared to Age, Education and Income, but not User Type. It is clear that social feature dominates the prediction performance, and significantly outperforms the cyber feature. As shown in Table 10, this is mainly because of the WithKids social group status. In addition, the physical, self-reported feature Children is also a good (and expected) indicator for users’ parental status. Logs-based features perform badly here, simply because they lack the social information.
-
User Type: This is the most easily predicted demographic, with a 35% improvement when comparing with mostPop. From results of questionnaire-based features, we observe physial features dominate here. The top performing features are physical visiting frequency and time (days in a week), as shown in Table 10.
Data
|
Features
|
Age
|
Education
|
Income
|
Parent
|
User type
|
---|---|---|---|---|---|---|
Questionnaire
|
Cyber
| 48.49 | 49.14 | 36.47 | 77.46 | 43.39 |
Physical
| 47.67 | 49.22 | 38.09 | 81.58 | 46.85 | |
Social
| 47.35 | 48.84 | 33.96 | 84.76 | 39.62 | |
All
|
52.20
|
50.45
|
38.99
|
88.57
|
49.68
| |
Logs
|
Cyber
| 63.40 | 60.75 | 60.06 | 79.61 | 53.96 |
(MostPop) |
50.94
|
56.60
|
43.39
|
76.92
|
39.62
| |
Physical
| 47.89 | 50.03 | 37.98 | 77.18 | 43.14 | |
Social
| - | - | - | - | - | |
(Cyber-physical) all users
| 52.45 | 50.33 | 41.75 | 77.60 | 47.25 | |
(Cyber-physical) cyber user
| 68.68 | 63.01 | 69.06 | 80.77 | 66.04 | |
MostPop
| 45.28 | 48.01 | 33.96 | 76.19 | 36.79 | |
Random
| 25.00 | 20.00 | 20.00 | 50.00 | 20.00 |
Data
|
Features
|
Paired-
t
statistics
| |
---|---|---|---|
t
|
p
-value
| ||
Questionnaire
| All vs cyber | 2.8697 | 0.0455 |
All vs physical | 2.9250 | 0.0430 | |
All vs social | 3.6552 | 0.0217 | |
All vs mostPop | 3.8710 | 0.0180 | |
All vs random | 9.2194 | 0.0008 | |
Logs
| (Cyber-physical) all users vs cyber | - | − |
(Cyber-physical) all users vs physical | 2.8115 | 0.0482 | |
(Cyber-physical) all users vs social | - | − | |
(Cyber-physical) all users vs mostPop | 3.3964 | 0.0274 | |
(Cyber-physical) all users vs random | 19.1973 | <0.0001 |
Age
|
Education
|
Income
|
Parent
|
User type
| |||||
---|---|---|---|---|---|---|---|---|---|
Feature
|
Acc.
|
Feature
|
Acc.
|
Feature
|
Acc.
|
Feature
|
Acc.
|
Feature
|
Acc.
|
(S)withKids | 47.35 | (C)Work-Browsing | 49.04 | (P)Duration | 35.17 | (S)withKids | 84.76 | (P)Frequency | 44.75 |
(C)Online Time | 47.00 | - | - | (C)Work-Browsing | 34.69 | (P)Children | 81.58 | (P)Duration | 40.25 |
(P)Frequency | 45.91 | - | - | (C)Travel-Browsing | 34.49 | (C)Online Time | 77.14 | (P)Fashion | 39.30 |
(P)Jewellery | 45.74 | - | - | (C)Communication | 34.28 | (C)Society-Search | 76.82 | (P)Monday | 38.99 |
(C)Hobbies-search | 45.51 | - | - | (C)Shopping | 34.08 | (P)Frequency | 76.19 | (P)Tuesday | 38.99 |