1 Introduction
1.1 Related Work
1.2 Research Gap
Related work | Difference |
---|---|
Idowu and Kattukottai [9] | The related work clustered purchase behaviour whilst this paper clustered the underlying visit intents. Whilst purchase clusters are important, an intent model is equally important as not every visitor would commit to a purchase and not all websites have purchase applications |
Porsche et al. [10] | The related work employed custom metrics and descriptive statistics to track visitors reading behaviour. However, this paper employs clustering models which is an advanced analytics technique that would allow better generalization of the underlying intents as opposed to descriptive statistics |
Domazet and Simovic [11] | The related work employed descriptive statistics to measure a website’s performance. However, this paper employs clustering models which is an advanced analytics technique that would allow better generalization of the underlying intents as opposed to descriptive statistics |
Jonathan et al. [12] | Within the related work, the author does not discuss much about the analytics employed. Nonetheless, the audience behaviour expected on a church website would fundamentally differ from the studied website within this paper |
Semeradova and Weinlich [13] | The related work here proposed web analytics using the Google Analytics interface which was at an aggregate level. Aggregate analytics will provide a high-level overview which assumes that every visitor has behaved in this manner. The methods employed within this paper allow for better visitor profiling |
Kalyankar and Anute [14] | The related work did not employ clustering methods to understand the visitors’ behaviour. However, this paper employs clustering models which is an advanced analytics technique that would allow better generalization of the underlying intents as opposed to descriptive statistics |
Cirlugea et al. [15] | The related work focused on the effect of marketing on website volumes. However, did not discuss the quality of the visit by detailing the visitors’ activities once on the website |
Rosqa and Ati [16] | The related work here was primarily exploratory |
Pirvu and Anghel [17] | The aim of the related work here was to predict behaviour but the authors did not explain the different types of behaviours or events being predicted. A clustering method (as proposed within this paper) should have been first determined |
Mariyapillai and Pratheepan [18] | The related work here was primarily exploratory to determine the geo-location of the visits |
Stelian and Stoicu-Tivadar [19] | The related work here was primarily exploratory with main interest in visitor interaction with virtual bone structures. Clustering may not be appropriate within this application if a visitor was guided through the application as opposed to a website where a visitor is free to behave as desired |
2 Materials and Methods
Notation | Description |
---|---|
k | A real valued integer to represent the number of clusters |
x | Represents a single data point of feature X |
\(\epsilon \) | A predefined hyper-parameter that sets the distance between a point and its neighbours |
n | Minimum number of points within cluster |
2.1 K-means
2.2 Hierarchical Clustering
2.3 Density-Based Spatial Clustering of Applications with Noise
-
Core point: data point with at least the minimum number of neighbours within epsilon (\(\epsilon \)) distance.
-
Border point: data point with at least one core point within epsilon (\(\epsilon \)) distance and less than the number of minimum neighbours within epsilon (\(\epsilon \)) distance from itself.
-
Noise point: data point with no core points within epsilon (\(\epsilon \)) distance and thus could not be placed into a cluster.
2.4 Cubic Clustering Criterion
2.5 Silhouette Coefficient
-
Clusters with silhouette coefficients closer to +1 imply very tight observations within cluster (homogenous)
-
Clusters with silhouette coefficients closer to 0 imply possible overlapping clusters
-
Clusters with silhouette coefficients closer to −1 imply observations are not very similar.
3 Cluster Data
3.1 Feature Selection
Feature name | Data type | Feature description |
---|---|---|
Accreditations | Numeric | Count of visits the user made to this page within each session |
Apprenticeship | Numeric | Count of visits the user made to this page within each session |
Bounces | Binary | Flags if the session was a single page visit only |
Contact-us | Numeric | Count of visits the user made to this page within each session |
Courses | Numeric | Count of visits the user made to this page within each session |
Customised-engineering-trading | Numeric | Count of visits the user made to this page within each session |
daysSinceLastSession | Numeric | The number of days a user is returning to the website |
Distance | Numeric | The Euclidean distance between the user’s co-ordinates and the company’s co-ordinates (owner of the website) |
Engineering-academic-studies | Numeric | Count of visits the user made to this page within each session |
Engineering-Trade | Numeric | Count of visits the user made to this page within each session |
Hits | Numeric | Represents any action on a webpage that results in data being sent to Google Analytics (such as page clicks, etc.) |
Home | Numeric | Count of visits the user made to this page within each session |
OrganicSearches | Binary | Flag to indicate if the user organically constructed a search that resulted in landing onto the webpage (a web link was not clicked) |
Pageviews | Numeric | The number of instances a page was loaded (or reloaded) |
sessionCount | Numeric | An indicator of the nth time the user has accessed the website |
SessionDuration | Numeric | The duration of the session (seconds) |
Short-courses-skilled-programmes | Numeric | Count of visits the user made to this page within each session |
Trade-test-arpl | Numeric | Count of visits the user made to this page within each session |
University-of-technology-uot | Numeric | Count of visits the user made to this page within each session |
3.2 Data Cleaning
4 Empirical Results
4.1 Number of Clusters
4.2 K-means
K-means clusters | Cluster size | % Size | Avg SessionDuration | Avg bounces rate (%) | Avg organic search rate (%) | Avg hits | Avg sessionCount | Avg daysSinceLastSession | Avg distance | Avg pageviews |
---|---|---|---|---|---|---|---|---|---|---|
1 | 988 | 15.0 | 159.84 | 7.0 | 55.0 | 5.42 | 1.55 | 3.42 | 16.14 | 5.40 |
2 | 3260 | 49.6 | 43.98 | 34.0 | 40.0 | 2.47 | 1.54 | 2.48 | 32.51 | 2.45 |
3 | 1066 | 16.2 | 398.56 | 0.0 | 66.0 | 11.11 | 1.39 | 3.18 | 7.17 | 11.02 |
4 | 778 | 11.8 | 196.44 | 7.0 | 61.0 | 8.41 | 1.38 | 2.97 | 7.35 | 8.36 |
5 | 477 | 7.3 | 611.91 | 0.0 | 70.0 | 26.02 | 1.27 | 3.00 | 4.74 | 25.94 |
K-means clusters | Home page index | Accreditations page index | Apprenticeship page index | Contact-us page index | Courses page index | Engineering-trade page index | Engineering-academic-studies page index | Customised-engineering-trading page index | Short-courses-skilled-programmes page index | Trade-test-arpl page index | University-of-technology-uot page index |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.94 | 0.09 | 0.08 | 1.15 | 0.26 | 0.02 | 0.06 | 0.02 | 0.00 | 0.11 | 0.04 |
2 | 0.86 | 0.05 | 0.05 | 0.00 | 0.24 | 0.02 | 0.04 | 0.01 | 0.00 | 0.05 | 0.03 |
3 | 1.52 | 0.22 | 0.66 | 0.11 | 1.98 | 0.37 | 0.11 | 0.20 | 0.00 | 0.66 | 0.26 |
4 | 1.15 | 0.11 | 0.11 | 0.15 | 0.97 | 0.16 | 0.17 | 0.09 | 1.21 | 0.20 | 0.10 |
5 | 1.95 | 0.64 | 0.79 | 0.45 | 2.10 | 0.82 | 1.07 | 0.92 | 1.18 | 1.17 | 0.82 |
K-means clusters | Silhouette coefficient | Outstanding attributes |
---|---|---|
1 | 0.31 | Over-index on “contact-us” page, low bounce rate, moderate visit duration |
2 | 0.40 | Very low session duration, very low page views, high bounce rate, furthest distance from the corporate location |
3 | \(-\) 0.05 | Fairly high session duration, fairly high pageviews, low bounce rate |
4 | 0.13 | Over-index on “courses page” and “short-courses page”, moderate session duration, low bounce rate |
5 | 0.01 | Very high session duration, very high page views, very close geo-proximity to the corporate coordinates, low bounce rate and high organic search rate |
4.3 Hierarchical Clustering
Hierarchical clusters | Cluster size | % Size | Avg SessionDuration | Avg bounces rate (%) | Avg organic search rate (%) | Avg hits | Avg sessionCount | Avg daysSinceLastSession | Avg distance | Avg Pageviews |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2509 | 38.2 | 306.63 | 6.0 | 57.0 | 11.67 | 1.63 | 6.02 | 8.26 | 11.60 |
2 | 551 | 8.4 | 202.87 | 9.0 | 62.0 | 7.96 | 1.35 | 1.24 | 8.44 | 7.88 |
3 | 720 | 11.0 | 4.06 | 56.0 | 22.0 | 1.54 | 1.03 | 0.02 | 124.30 | 1.54 |
4 | 1994 | 30.4 | 91.47 | 29.0 | 51.0 | 2.92 | 1.48 | 0.15 | 4.30 | 2.90 |
5 | 795 | 12.1 | 131.41 | 9.0 | 53.0 | 4.53 | 1.50 | 3.14 | 17.77 | 4.52 |
Hierarchical clusters | Home page index | Accreditations page index | Apprenticeship page index | Contact-us page index | Courses page index | Engineering-trade page index | Engineering-academic-studies page index | Customised-engineering-trading page index | Short-courses-skilled-programmes page index | Trade-test-arpl page index | University-of-technology-uot page index |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.28 | 0.35 | 0.53 | 0.22 | 1.30 | 0.38 | 0.33 | 0.30 | 0.34 | 0.49 | 0.35 |
2 | 1.15 | 0.00 | 0.14 | 0.20 | 0.96 | 0.03 | 0.21 | 0.05 | 1.19 | 0.18 | 0.02 |
3 | 0.68 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 |
4 | 1.07 | 0.00 | 0.00 | 0.00 | 0.46 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 |
5 | 0.89 | 0.00 | 0.00 | 1.15 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 0.09 | 0.00 |
Hierarchical clusters | Silhouette coefficient | Outstanding attributes |
---|---|---|
1 | \(-\) 0.21 | High session duration, high pageviews, low bounce rate |
2 | 0.15 | Over-index on “short-courses” page and “courses” page, moderate session duration, low bounce rate, highest organic search rate |
3 | 0.60 | Very low session duration, very low page views, high bounce rate, furthest distance from the corporate location |
4 | 0.38 | Moderate session duration, brief page views, moderate bounce rate, close proximity to corporate co-ordinates |
5 | 0.39 | Over-index on “contact-us” page, low bounce rate, moderate session duration |
4.4 Dbscan
Dbscan clusters | Cluster size | % Size | Avg SessionDuration | Avg bounces rate (%) | Avg organic search rate (%) | Avg hits | Avg sessionCount | Avg daysSinceLastSession | Avg distance | Avg pageviews |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3151 | 48.0 | 331.19 | 5.0 | 56.0 | 11.17 | 1.62 | 5.74 | 10.68 | 11.10 |
1 | 2440 | 37.1 | 26.41 | 40.0 | 42.0 | 2.05 | 1.37 | 0.16 | 39.13 | 2.04 |
2 | 592 | 9.0 | 58.72 | 8.0 | 58.0 | 3.78 | 1.34 | 0.15 | 11.27 | 3.77 |
3 | 270 | 4.1 | 78.29 | 15.0 | 60.0 | 5.07 | 1.21 | 0.11 | 5.33 | 5.05 |
4 | 116 | 1.8 | 60.01 | 3.0 | 67.0 | 3.44 | 1.30 | 0.13 | 2.87 | 3.38 |
Dbscan clusters | Home page index | Accreditations page index | Apprenticeship page index | Contact-us page index | Courses page index | Engineering-Trade page index | Engineering-academic-studies page index | Customised-engineering-trading page index | Short-courses-skilled-programmes page index | Trade-test-arpl page index | University-of-technology-uot page index |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.32 | 0.28 | 0.40 | 0.30 | 1.28 | 0.31 | 0.30 | 0.25 | 0.39 | 0.54 | 0.28 |
1 | 0.91 | 0.00 | 0.00 | 0.00 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | 0.78 | 0.00 | 0.00 | 1.08 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | 0.95 | 0.00 | 0.00 | 0.00 | 0.80 | 0.00 | 0.00 | 0.00 | 1.09 | 0.00 | 0.00 |
4 | 0.46 | 0.00 | 1.17 | 0.00 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Dbscan clusters | Silhouette coefficient | Outstanding attributes |
---|---|---|
0 | \(-\) 0.11 | High visit duration, high pageviews, low bounce rate |
1 | 0.45 | Very low session duration, very low page views, high bounce rate, furthest average distance from the corporate location |
2 | 0.45 | Over-index on “contact-us” page, low bounce rate, moderate session duration |
3 | 0.21 | Over-index on “short course” page, high index on the “courses” page, moderate session duration, low bounce rate |
4 | \(-\) 0.34 | Over-index on “apprenticeship” page, moderate session duration, low bounce rate, high organic search rate |
Method | Cluster | Size (%) | Silhouette coeff. | Persona |
---|---|---|---|---|
K-means | 1 | 15.04 | 0.31 | Get-in-touch |
2 | 49.63 | 0.40 | Accidentals/Drop-offs | |
3 | 16.23 | \(-\) 0.05 | Engrossed: Moderate engagement | |
4 | 11.84 | 0.13 | Seekers | |
5 | 7.26 | 0.01 | Engrossed: High engagement | |
Hierarchical | 1 | 38.19 | \(-\) 0.21 | Engrossed |
2 | 8.39 | 0.15 | Seekers | |
3 | 10.96 | 0.60 | Accidentals | |
4 | 30.35 | 0.38 | Drop-offs | |
5 | 12.10 | 0.39 | Get-in-touch | |
DBSCAN | 0 | 47.97 | \(-\) 0.11 | Noise (resembles Engrossed) |
1 | 37.14 | 0.45 | Accidentals/Drop-offs | |
2 | 9.01 | 0.45 | Get-in-touch | |
3 | 4.11 | 0.21 | Seekers: Short-courses | |
4 | 1.77 | \(-\) 0.34 | Seekers: Apprenticeships |