Introduction
Related work
Unsupervised learning for event detection in social media stream
References | Data sources/ Features | Algorithms | Inclusion of local vocabulary | Treatment of Slang, Acronym, and Abbreviation (SAB) |
---|---|---|---|---|
[14] | Twitter/ Textual | LSH, Shannon entropy | No | No |
[20] | Twitter/ Textual | EDCoW, Term Frequency | No | No |
[21] | Twitter/ Textual | Term Frequency, Kullback–Leibler divergence | No | No |
[15] | Twitter/ Textual | Similarity score, Cluster summarization | No | No |
[26] | Twitter/ Textual | Fuzzy Hierarchical, Agglomerative Clustering | No | No |
[24] | Twitter/ Textual, Spatiote-mporal | TF-IDF, Normalised Mutual Information Frequency | No | No |
[16] | Twitter/ Textual | Named entity, TF-IDF, Sigma rule | No | No |
[23] | Twitter/ Textual, Spatio-temporal | BIRCH | No | No |
[25] | Twitter/ Textual | OPTICS | No | No |
[27] | Twitter/ Textual, Spatiote-mporal | Wavelet decomposition, Modularity-based clustering | No | No |
Weibo, Twitter/ Textual | Expected Maximization, MapReduce, K-means, Hierarchical agglomerative | No | No | |
[29] | Twitter, Flickr, YouTube/Textual | Incremental TF-IDF, Skewness, Learn and Forget term selection, Growing, Gaussian Mixture Model | No | No |
[12] | Twitter, Tumblr/ Textual | Time-evolving graphs | No | No |
[30] | Twitter/ Textual | Longest Common, Subsequence, Incremental clustering | No | No |
[31] | Twitter/ Textual | Entity co-occurrence, Louvain clustering, Aggregate ranking | No | No |
[32] | Twitter/Multimedia | Incremental clustering, Influence maximization algorithm | No | No |
[33] | Twitter, Facebook, Weibo/Textual | Indexed based algorithm | No | No |
[35] | Twitter, Weibo/Textual | Sub-event representation learning | No | No |
[36] | Twitter/Textual | Spatiotemporal clustering | No | No |
Semi-supervised learning for event detection in social media stream
References | Data sources/ Features | Algorithms | Inclusion of local vocabulary | Treatment of Slang, Acronym, and Abbreviation (SAB) |
---|---|---|---|---|
[37] | Twitter/Textual | TF-IDF, Naive Bayes | No | No |
[38] | Tumblr/Textual, Spatiotemporal | MapReduce, Pig | No | No |
[39] | Twitter/Textual, Spatiotemporal | ATSED | No | No |
[40] | Twitter/Textual, Spatiotemporal | Geometric Discretization, Administrative Hierarchies | No | No |
[41] | Twitter/ Textual | K-means, LDA, Hierarchical agglomerative | No | No |
[42] | Twitter/ Textual | Multinomial Naïve Bayes, Incremental DBSCAN | No | No |
[18] | Twitter/ Textual | ANN, AvgW2V, Mini-batch cluster | No | No |
[43] | Twitter/ Textual | Hierarchical Dirichlet Process | No | No |
Supervised learning for event detection in social media stream
References | Data sources/ Features | Algorithms | Inclusion of local vocabulary | Treatment of Slang, Acronym, and Abbreviation (SAB) |
---|---|---|---|---|
[45] | Twitter/Textual, Spatial | Variable Dimensional Extendible Hash | No | No |
[46] | Twitter/Textual | Association Rule Mining | No | No |
[44] | Twitter/Textual, Geo-spatial | Naïve Bayes, Multilayer perceptron, and Prune C4.5 | No | No |
[47] | Twitter/Textual | Streaming Non-negative Matrix | No | No |
[48] | Twitter/Textual | Soft Frequent Pattern Mining | No | No |
[49] | Flickr, Instagram/ Multimodal | TF-IDF, SVM | No | No |
[50] | Video/Multimedia | RNN, LSTM | No | No |
[51] | Video/Multimedia | CNN, Smoothing technique | No | No |
[52] | Weibo/Textual | TextRank, SVM | No | No |
[53] | Twitter/Textual | Deep Belief Network, LSTM | No | No |
[54] | Facebook/Textual | Word2Vec, Random Forest, Gated Recurrent Unit, LSTM | No | No |
Semantic-based approaches for event detection in social media stream
References | Data sources/ Features | Algorithms | Inclusion of local vocabulary | Treatment of Slang, Acronym, and Abbreviation (SAB) |
---|---|---|---|---|
[56] | Twitter, Facebook/ Textual | LSH | No | No |
[55] | Twitter/ Textual | Hash key grouping | No | No |
[58] | Twitter/ Textual | RDF | No | No |
[59] | Twitter/ Textual | TF-IDF, Named entity Recognition, Page Rank, CfsSubsetEval | No | No |
[60] | Twitter/Textual | IPLSA, EM, RS Scoring algorithm, word2vec | No | No |
Unsupervised learning approaches | Semi-supervised learning approaches | Supervised learning approaches | Semantic-based approaches |
---|---|---|---|
Strength •Detecting events without any particular regard to their nature •Can handle a large volume of data in real-time Weakness •Difficult in dealing with a high dimensionality data stream •It does not consider spatial relationships in the data | •Strength •Particularly useful when it is difficult to extract relevant features from data •Small amount of data can lead to a significant accuracy improvement Weakness •Iteration results are not stable •Low accuracy | Strength •Results are highly accurate and trustworthy Weakness •Time-consuming •Large amount of data to be trained •Handling concept drift •Labels for input and output variables require expertise | Strength •Provide contextual knowledge •Valuable for sense disambiguation •User-centric results •More precise results Weakness •Difficult to construct |
Methodology
Description of main tasks during event detection using SMAFED
Formal definition of main tasks in SMAFED
High-level overview of SMAFED
The conceptual architecture of SMAFED
Data input layer
Data pre-processing layer
Data enrichment layer
The formal definition of the SABDA model
The algorithm for disambiguation of SAB (SABDA)
Illustration of SABDA pseudocode
baddo1 | Definition: When something bad happens, or something goes wrong Usage example: She said she wants a break, baddo Related term: bad |
baddo2 | Definition: Someone who is highly respected or seen as very good at what they do Usage example: I be baddo when it comes to computing Related term: baddo, best, respected, influential |
baddo3 | Definition: A shortened, more legit name for a badass Usage example: I know, what a baddo Related term: baddo, finger food |
baddo4 | Definition: A rear-end that generates noxious emission Usage example: The fume from this generator is baddo Related term: harmful, poisonous, unpleasant |
baddo1 | Definition: When something bad happens, or something goes wrong Usage example: She said she wants a break, baddo Related term: bad | 2 overlaps |
baddo2 | Definition: Someone who is highly respected or seen as very good at what they do Usage example: I be baddo when it comes to computing Related term: baddo, best, respected, influential | 6 overlaps |
baddo3 | Definition: A shortened, more legit name for a badass Usage example: I know, what a baddo Related term: baddo, finger food | 2 overlaps |
baddo4 | Definition: A rear-end that generates noxious emission Usage example: The fume from this generator is baddo Related term: harmful, poisonous, unpleasant | 2 overlaps |
Event detection layer
The embedder
The Event clusterer
The Event clusterer algorithm
The Event ranker
The Event ranker algorithm
The Event summarizer
The Event summarizer algorithm
Evaluation experiment
Experiment I: impact of the data enrichment layer of SMAFED
Feature | GSMFPM | SMAFED |
---|---|---|
Disambiguation | No | Yes |
Abbreviation handling | No | Yes |
Acronym handling | No | Yes |
Inclusion of localised knowledge source | No | Yes |
Spell-checking module | No | Yes |
Dataset description
S/N | Dataset | Source(s) | Total | Selected | Training/Testing |
---|---|---|---|---|---|
1 | Twitter sentiment analysis training corpus | 1. University of Michigan Sentiment Analysis on Kaggle 2. Twitter sentiment corpus by Niek Sanders | 1,578,627 1,048,575 (after download) | 104,857 (10%) | 83,886/20,971 |
2 | Naija-Tweets | Extracted from Nigeria origin | 12,920 | 12,920 (100%) | 10,336/2,584 |
Feature extraction and representation
S/N | Dataset | Unigram | Bigram |
---|---|---|---|
1 | Twitter sentiment analysis training corpus | 76,522 Top-K word (50,000) | 501,026 Top-K (150,000) |
2 | Naija-Tweets | 3,296 Top-K (3,000) | 10,187 Top-K (8,000) |
Classifiers
Experiment I: result and discussion
FEATURES | ||||||
---|---|---|---|---|---|---|
UNIGRAM | BIGRAM | UNIGRAM + BIGRAM | ||||
EPOCH | SMAFED | GSMFPM | SMAFED | GSMFPM | SMAFED | GSMFPM |
1 | 0.5478 | 0.5499 | 0.5612 | 0.5729 | 0.5535 | 0.5405 |
2 | 0.4911 | 0.5132 | 0.4405 | 0.4872 | 0.5004 | 0.5192 |
3 | 0.4537 | 0.4658 | 0.3909 | 0.4427 | 0.471 | 0.5087 |
4 | 0.4045 | 0.4147 | 0.3015 | 0.3528 | 0.4047 | 0.4584 |
5 | 0.3741 | 0.3762 | 0.2181 | 0.3177 | 0.3484 | 0.3698 |
Convolution layer | ||||||||
---|---|---|---|---|---|---|---|---|
1-LAYER | 2-LAYER | 3-LAYER | 4-LAYER | |||||
EPOCH | SMAFED | GSMFPM | SMAFED | GSMFPM | SMAFED | GSMFPM | SMAFED | GSMFPM |
1 | 0.5852 | 0.4369 | 0.636 | 0.7843 | 0.6279 | 0.7762 | 0.5584 | 0.7067 |
2 | 0.4761 | 0.3928 | 0.5399 | 0.6232 | 0.4927 | 0.576 | 0.4713 | 0.5546 |
3 | 0.4213 | 0.3725 | 0.4674 | 0.5162 | 0.4434 | 0.4922 | 0.4263 | 0.4751 |
4 | 0.3622 | 0.3556 | 0.4195 | 0.4261 | 0.3943 | 0.4009 | 0.3836 | 0.3902 |
5 | 0.3026 | 0.3397 | 0.3797 | 0.4168 | 0.3515 | 0.3886 | 0.332 | 0.3691 |
6 | 0.2241 | 0.3255 | 0.3403 | 0.4417 | 0.299 | 0.4004 | 0.277 | 0.3784 |
7 | 0.1864 | 0.3002 | 0.3045 | 0.4183 | 0.2736 | 0.3874 | 0.2501 | 0.3639 |
8 | 0.1496 | 0.2892 | 0.2576 | 0.3972 | 0.2425 | 0.3821 | 0.2169 | 0.3565 |
Features | ||||||
---|---|---|---|---|---|---|
UNIGRAM | BIGRAM | UNIGRAM + BIGRAM | ||||
Epoch | SMAFED | GSMFPM | SMAFED | GSMFPM | SMAFED | GSMFPM |
1 | 0.366 | 0.3943 | 0.2911 | 0.3927 | 0.2632 | 0.3527 |
2 | 0.2552 | 0.2712 | 0.2362 | 0.2661 | 0.1908 | 0.2583 |
3 | 0.2108 | 0.2223 | 0.2009 | 0.2149 | 0.1785 | 0.2003 |
4 | 0.1688 | 0.2172 | 0.1665 | 0.2035 | 0.1216 | 0.1814 |
5 | 0.1534 | 0.2102 | 0.1759 | 0.2058 | 0.1112 | 0.2007 |
Convolution layer | ||||||||
---|---|---|---|---|---|---|---|---|
1-LAYER | 2-LAYER | 3-LAYER | 4-LAYER | |||||
EPOCH | SMAFED | GSMFPM | SMAFED | GSMFPM | SMAFED | GSMFPM | SMAFED | GSMFPM |
1 | 2.1429 | 2.351 | 0.8408 | 0.9112 | 0.7448 | 0.7953 | 0.7055 | 0.7676 |
2 | 0.7991 | 0.8535 | 0.6119 | 0.6663 | 0.6419 | 0.6949 | 0.6336 | 0.6747 |
3 | 0.4748 | 0.5268 | 0.5363 | 0.5783 | 0.6036 | 0.7111 | 0.6344 | 0.6643 |
4 | 0.3693 | 0.4804 | 0.4152 | 0.5161 | 0.5511 | 0.6353 | 0.612 | 0.6609 |
5 | 0.2316 | 0.3457 | 0.2479 | 0.3622 | 0.3897 | 0.4465 | 0.559 | 0.6303 |
6 | 0.1429 | 0.2999 | 0.1266 | 0.2736 | 0.2231 | 0.2822 | 0.4445 | 0.4895 |
7 | 0.0878 | 0.1489 | 0.0873 | 0.1484 | 0.1469 | 0.1852 | 0.2588 | 0.3581 |
8 | 0.0501 | 0.0852 | 0.052 | 0.0771 | 0.1031 | 0.1431 | 0.178 | 0.1982 |
SMAFED efficiency
Experiment II: accuracy of SMAFED
Dataset description
Type | Number of Tweets |
---|---|
Event | 82,887 |
Non-Event | 142,652 |
Total | 225,554 |
Experiment II: result and discussion
Approach | Handles | Includes | Clustering | Ranking | Summarisation | ||
---|---|---|---|---|---|---|---|
Concept Drift | SAB | Disambiguation | Localised Knowledge Source | ||||
CS | No | No | No | No | Similarity Score | No of tweets | No |
LSH | No | No | No | No | Hash function | Shannon Entropy | No |
Entity-based | No | No | No | No | TF-IDF | No of tweets | No |
Repp Framework | No | No | No | No | Cosine distance | No of tweets | No |
SMAFED | Yes | Yes | Yes | Yes | Semantic similarity | Weighting scheme | Yes |
Approach | References | Precision | Recall | F-Measure |
---|---|---|---|---|
LSH | [14] | 382/1340 (0.285) | 156/506 (0.308) | 0.296 |
CS | [15] | 53/1097 (0.048) | 32/506 (0.063) | 0.054 |
Entity-based | [16] | 181/586 (0.302) | 159/506 (0.310) | 0.306 |
Repp framework | [17] | 271/300 (0.901) | 112/150 (0.749) | 0.818 |
SMAFED | Authors (2021) | 296/321 (0.922) | 119/150 (0.793) | 0.853 |