Introduction
Background
Topic modelling applications
The development of probabilistic topic modellings
Network theory
Recent advances in network theory
Proposed method
Methodological framework
Corpus collection and text pre-processing
Topic modelling
Topic labelling
Topic network representation
Network topology | |
---|---|
Modularity | Measures non-trivial grouping structure within a network, based on the observed number of edges within a subset of nodes, to the number of edges expected from random assignment \(\mathop \sum \limits_{k = 1}^{k} [f_{kk} \left( G \right) - f_{kk}^{*} ]^{2}\) where \(f_{kk}^{*}\) is the expected value of \(f_{kk}\) under some model of random edge assignment |
Transitivity | Measures the extent to which nodes in a network cluster together, based on the ratio of the number of triangles and the number of connected triples \(\frac{{3\tau_{\Delta } \left( G \right)}}{{\tau_{3} \left( G \right)}}\) where \(3\tau_{\Delta } \left( G \right)\) is the number of triangles in the graph, and \(\tau_{3} \left( G \right)\) is the number of connected triples |
Density | Measures the ratio of the number of edges in a graph to the maximum number of possible edges \(\frac{{\left| {E_{H} } \right|}}{{\left| { V_{H} } \right| \left( { \left| {V_{H} } \right| - 1 } \right)/ 2}}\) where |E| is the number of edges and |V| is the number of nodes in the graph |
Average Path Length | Measures the mean for the shortest paths between all nodes in a network \(_{{ \frac{1}{{n \cdot \left( {n - 1} \right)}} \cdot \mathop \sum \limits_{i \ne j} d\left( { v_{i} ,v_{j} } \right)}}\) where d(vi, vj) is the shortest path between nodes vi and v2, and n is the number of nodes in the graph |
Diameter | Measures the largest distance between any pair of nodes in a network |
\(max_{u, v} d\left( {u, v} \right)\) where d(u, v) is the distance between nodes u and v |
Topic centrality | |
---|---|
Betweenness | The fraction of shortest paths that pass through a node \(\mathop \sum \limits_{s \ne t \ne v \in V} \frac{{\sigma {(}s, t {|} v)}}{{\sigma \left( {s, t} \right)}}\) where \(\sigma {(}s, t {|} v)\) is the number of shortest paths between s and t that pass through v, and \(\sigma \left( {s, t} \right) = \mathop \sum \limits_{v} \sigma {(}s, t {|}v)\) |
Degree | The number of edges connected to a node \(g\left( v \right) = {\text{deg}}\left( v \right)\) |
PageRank | A measure of node importance based on the likelihood of reaching a given node when randomly following links within a network \(\alpha \mathop \sum \limits_{j} \alpha_{ij} \frac{{x_{j} }}{L\left( j \right)} + \beta\) where \(L\left( j \right) = \mathop \sum \limits_{i} a_{ij}\) is the number of neighbors of node j, and \(\alpha\) is a damping factor |
Topic network evaluation
Additional analyses
Experiment one: case study
Experimental setting
Case study overview
Data description
Results
Topic modelling
Topic network evaluation
Year | Instances (Number of documents) | Vocabulary (Distinct tokens) | Nodes (Number of topics) | Edges (Topic cooccurrence) |
---|---|---|---|---|
1999 | 195 | 15,824 | 65 | 468 |
2000 | 220 | 16,927 | 64 | 456 |
2001 | 211 | 17,017 | 60 | 402 |
2002 | 230 | 17,922 | 63 | 420 |
2003 | 269 | 19,323 | 63 | 444 |
2004 | 303 | 19,848 | 58 | 360 |
2005 | 314 | 20,876 | 63 | 393 |
2006 | 299 | 20,704 | 59 | 363 |
2007 | 401 | 22,691 | 65 | 435 |
2008 | 405 | 22,939 | 62 | 435 |
2009 | 495 | 23,975 | 62 | 399 |
2010 | 509 | 24,531 | 65 | 390 |
2011 | 671 | 25,877 | 64 | 408 |
2012 | 683 | 25,897 | 60 | 360 |
2013 | 793 | 26,853 | 61 | 360 |
2014 | 943 | 27,612 | 61 | 339 |
2015 | 1019 | 27,864 | 62 | 363 |
2016 | 1188 | 28,262 | 59 | 306 |
2017 | 1094 | 28,276 | 65 | 372 |
2018 | 1229 | 28,527 | 63 | 327 |
Topic subject area evaluation
Subject Area (Topic Community) | Short name | Average number of topics (per year) | Total topic interactions (20-year period) |
---|---|---|---|
Consumer Psychology | Psychology | 5.6 (± 1.7) | 12 |
Marketing | Marketing | 10.5 (± 2.5) | 20 |
Commercial Strategy | Strategy | 3.9 (± 1.5) | 6 |
Online & Digital | Digital | 5.7 (± 0.6) | 7 |
Systems & Technology | Technology | 5.8 (± 0.8) | 8 |
Sustainability & Preservation | Sustainability | 14.8 (± 1.7) | 22 |
Health & Wellness | Health | 9.7 (± 0.7) | 12 |
Economics & Finance | Economics | 7.6 (± 1.0) | 11 |
Topic evaluation
Consumer psychology & marketing
Consumer psychology
Marketing
Commercial strategy
Online & digital
Systems & technology
Sustainability & preservation
Health & wellness
Economics & finance
General, transitive & isolated topics
Additional analyses: opportunities for future research
Low degree and low prevalence (N = 4) | |
Given these topics are under-researched, they present opportunity to extend the body of knowledge within the field through further study | |
● Credit, Loans, Debt & Repayment (low PageRank) ● Data Security (low PageRank) ● Sport Consumption & Gamification (low PageRank) ● Gambling (moderate PageRank) | |
Low degree and high prevalence (N = 7) | |
These topics function as focal content with high popularity and limited associations. Therein lies opportunity to broaden the context to which these topics are addressed, by combining them with other topics | |
● Automotive Vehicles (low PageRank) ● Commercial Innovation (low PageRank) ● Green Consumption (low PageRank) ● Online Shopping (low PageRank) | ● Social Norms & Identity (low PageRank) ● Subliminal & Social Influence (low PageRank) ● Supply Chain (low PageRank) |
High degree and low prevalence (N = 10) | |
Given their low prevalence but high exposure to other topics, these topics provide added context to more dominant topical subjects. This is particularly the case when PageRank is high | |
● Child & Youth Services (moderate PageRank) ● Coaching, Counselling & Therapy (moderate PageRank) ● Food Contamination & Safety (moderate PageRank) ● Health & Clinical Services (moderate PageRank) ● Smoking & Alcoholism (moderate PageRank) | ● Cattle Farming (high PageRank) ● Consumer Ethnocentrism (high PageRank) ● Illicit Substance Treatment (high PageRank) ● Multisensory & Atmospheric Effects (high PageRank) ● Water Management (high PageRank) |
High degree and high prevalence (N = 4) | |
These topics are focal areas of research that can vary in context. The popularity, influence, and relevance of these topics provide opportunity to elevate topics that are less prevalent and central, by incorporating them for added context in future research | |
● Health Education & Intervention (moderate PageRank) ● Healthy Eating, Diet & Nutrition (moderate PageRank) ● Organic & Genetically Modified Foods (moderate PageRank) ● Market Equilibrium & Competition (high PageRank) |
Discission and concluding remarks
Experiment two: scalability tests
Experimental setting
Network size
Nodes | Edges | Run-time |
---|---|---|
11,174 | 23,409 | 3.092 secs |
111,740 | 234,090 | 31.09 secs |
558,700 | 1,170,450 | 115.77 secs |
1,117,400 | 2,340,900 | 262.63 secs |
11,174,000 | 23,409,000 | 2,485.04 secs |