1 Introduction
2 Contributions: a review in a nutshell with anticipated impacts
3 Related work
Authors | Datasets | Feature representation techniques | Feature selection techniques | Classifier | Evaluation metric |
---|---|---|---|---|---|
Ali et al. [14] | Manually classified news corpus | Normalized term frequency | – | NB , SVM | Accuracy |
Usman et al. [15] | News Corpus | Term Frequency (TF) | – | NB, BNB, LSVM, LSGB, RF | Precision, Recall, F1-score |
Sattar et al. [17] | Urdu News Editorials | Term Frequency (TF) | – | NB | Precision, Recall, F1-score |
Ahmed et al. [16] | Urdu News Headlines | TF-IDF | TF-IDF (Thresholding) | SVM | Accuracy |
Zia et al. [18] | EMILLE, Self Collected Naive corpus (News) | TF-IDF | Information Gain, Chi Square, Gain Ratio, Symmetrical Uncertainty | KNN, DT, NB. | F1-score |
Adeeba et al. [19] | CLE Urdu Digest (1000K, 1 Million) | Term Frequency (TF), TF-IDF | Pruning | NB, SVM (linear, radial, polynomial) | Precision, Recall, F1-score |
4 Adopted methodologies for Urdu text document classification
4.1 Traditional machine learning-based Urdu text document classification with filter-based feature selection algorithm
4.2 Preprocessing
4.3 Feature selection
\(t_j\) | \({\bar{t}}_j\) | |
---|---|---|
Positive class | \(t_p\) | \(f_n\) |
Negative class | \(f_p\) | \(t_n\) |
4.3.1 Balanced accuracy measure (ACC2)
4.3.2 Normalized difference measure (NDM)
-
it has high \(|t_{pr} - f_{pr}|\) value.
-
Either \(t_{pr}\) or \(f_{pr}\) is closer to zero.
-
If any two terms have same \(|t_{pr} - f_{pr}|\) values, then a higher rank must be assigned to that term which has smaller (\(t_{pr}\), \(f_{pr}\)) value.
4.3.3 Max–Min ratio (MMR)
4.3.4 Relative discrimination criterion (RDC)
4.3.5 Information gain (IG)
4.3.6 Chi-squared (CHISQ)
4.3.7 Odds ratio (OR)
4.3.8 Bi-normal separation (BNS)
4.3.9 Gini index (GINI)
4.3.10 Poisson ratio (POISON)
4.4 Feature representation
4.5 Classifiers
5 Deep learning methodologies
5.1 Input layer
5.2 Convolutional neural network (CNN)
5.2.1 Convolution layer
5.2.2 Pooling layer
5.2.3 Activation function
5.2.4 Batch normalization
5.2.5 Dropout
5.2.6 Fully connected layer
5.3 Recurrent neural network (RNN) and its variants (LSTM, GRU)
5.3.1 Long short-term memory (LSTM)
5.3.2 Gated recurrent unit (GRU)
5.4 Selection and optimization of model parameters
5.5 Adopted deep learning methodologies for Urdu text document classification
5.6 Transfer learning using BERT
5.7 Hybrid methodology for Urdu text document classification
6 Datasets
Class | No. of documents | No. of sentences | No. of tokens | No. of tokens after lemmatization |
---|---|---|---|---|
Agriculture | 102 | 669 | 17,967 | 9856 |
Business | 120 | 672 | 20,349 | 9967 |
Entertainment | 101 | 685 | 19,671 | 10,915 |
World | 111 | 631 | 18,589 | 12,812 |
Health-sciences | 108 | 823 | 27,409 | 12,190 |
Sports | 120 | 744 | 24,212 | 9992 |
Class | No. of documents | No. of sentences | No. of tokens | No. of tokens after lemmatization |
---|---|---|---|---|
Culture | 28 | 488 | 8767 | 8767 |
Health | 29 | 608 | 9895 | 9895 |
Letter | 35 | 777 | 11,794 | 11,794 |
Interviews | 36 | 597 | 12,129 | 12,129 |
Press | 29 | 466 | 10,007 | 10,007 |
Religion | 29 | 620 | 9839 | 9839 |
Science | 55 | 468 | 8700 | 8700 |
Sports | 29 | 588 | 10,030 | 10,030 |
Class | No. of documents | No. of sentences | No. of tokens | No. of tokens after lemmatization |
---|---|---|---|---|
Culture | 133 | 8784 | 145,228 | 145,228 |
Health | 153 | 11,542 | 169,549 | 169,549 |
Letter | 105 | 8565 | 115,177 | 115,177 |
Interviews | 38 | 2481 | 41,058 | 41,058 |
Press | 118 | 6106 | 125,896 | 125,896 |
Religion | 100 | 6416 | 107,071 | 107,071 |
Science | 109 | 6966 | 117,344 | 117,344 |
Sports | 31 | 2051 | 33,143 | 33,143 |
7 Experimental setup and results
7.1 Results of traditional machine learning-based text document classification methodology
7.1.1 DSL Urdu news dataset
Feature selection algorithms | Benchmark test points | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
RDC [56] | 0.83 | 0.85 | 0.85 | 0.88 | 0.90 | 0.91 | 0.90 | 0.89 |
NDM [47] | 0.70 | 0.76 | 0.87 | 0.90 | 0.93 | 0.94 | 0.94 | 0.88 |
MMR [21] | 0.71 | 0.82 | 0.88 | 0.91 | 0.91 | 0.91 | 0.91 | 0.89 |
POISON [60] | 0.82 | 0.86 | 0.90 | 0.89 | 0.91 | 0.92 | 0.92 | 0.89 |
GINI [59] | 0.77 | 0.81 | 0.88 | 0.87 | 0.88 | 0.90 | 0.90 | 0.90 |
ACC2 [55] | 0.82 | 0.88 | 0.87 | 0.88 | 0.89 | 0.90 | 0.90 | 0.90 |
ODDS [58] | 0.70 | 0.82 | 0.88 | 0.91 | 0.91 | 0.91 | 0.90 | 0.89 |
IG [57] | 0.81 | 0.86 | 0.90 | 0.91 | 0.91 | 0.92 | 0.91 | 0.89 |
CHISQ [48] | 0.79 | 0.87 | 0.90 | 0.91 | 0.91 | 0.92 | 0.91 | 0.89 |
BNS [55] | 0.81 | 0.88 | 0.87 | 0.88 | 0.89 | 0.90 | 0.91 | 0.90 |
Feature Selection Algorithms | Benchmark Test Points | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
RDC [56] | 0.80 | 0.79 | 0.80 | 0.84 | 0.86 |
0.88
| 0.88 | 0.88 |
NDM [47] | 0.74 | 0.78 | 0.86 | 0.88 | 0.90 |
0.91
| 0.91 | 0.90 |
MMR [21] | 0.77 | 0.83 | 0.87 | 0.89 | 0.89 | 0.88 | 0.89 |
0.90
|
POISON [60] | 0.80 | 0.83 | 0.85 | 0.88 | 0.87 | 0.89 | 0.88 |
0.90
|
GINI [59] | 0.76 | 0.77 | 0.83 | 0.83 | 0.83 | 0.86 |
0.88
| 0.88 |
ACC2 [55] | 0.80 | 0.84 | 0.82 | 0.85 | 0.86 | 0.87 | 0.88 |
0.89
|
ODDS [58] | 0.74 | 0.83 | 0.86 | 0.88 | 0.88 |
0.89
| 0.89 | 0.89 |
IG [57] | 0.83 | 0.85 | 0.84 | 0.88 | 0.89 | 0.89 | 0.89 |
0.90
|
CHISQ [48] | 0.81 | 0.85 | 0.86 | 0.88 | 0.88 | 0.88 |
0.90
| 0.90 |
BNS [55] | 0.81 | 0.83 | 0.82 | 0.85 | 0.87 | 0.87 |
0.88
| 0.88 |
7.1.2 CLE Urdu Digest 1M dataset
Feature Selection Algorithm | Benchmark Test Points | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
RDC [56] | 0.65 | 0.64 | 0.66 | 0.63 | 0.62 | 0.61 | 0.59 | 0.55 |
NDM [47] | 0.51 | 0.58 | 0.63 | 0.64 | 0.65 | 0.61 | 0.57 | 0.60 |
MMR [21] | 0.52 | 0.54 | 0.58 | 0.60 | 0.62 | 0.59 | 0.51 | 0.46 |
POISON [60] | 0.50 | 0.60 | 0.61 | 0.61 | 0.62 | 0.56 | 0.48 | 0.45 |
GINI [59] | 0.13 | 0.14 | 0.46 | 0.59 | 0.62 | 0.62 | 0.62 | 0.60 |
ACC2 [55] | 0.65 | 0.66 | 0.65 | 0.65 | 0.64 | 0.62 | 0.57 | 0.53 |
ODDS [58] | 0.53 | 0.56 | 0.62 | 0.66 | 0.68 | 0.65 | 0.64 | 0.56 |
IG [57] | 0.62 | 0.62 | 0.63 | 0.64 | 0.63 | 0.60 | 0.49 | 0.45 |
CHISQ [48] | 0.57 | 0.59 | 0.64 | 0.63 | 0.62 | 0.57 | 0.48 | 0.48 |
BNS [55] | 0.65 | 0.64 | 0.64 | 0.64 | 0.64 | 0.63 | 0.56 | 0.53 |
Feature Selection Algorithm | Benchmark Test Points | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
RDC [56] | 0.69 | 0.70 | 0.73 | 0.76 | 0.79 | 0.77 |
0.78
| 0.78 |
NDM [47] | 0.55 | 0.67 | 0.76 |
0.81
| 0.81 | 0.79 | 0.76 | 0.80 |
MMR [21] | 0.62 | 0.68 | 0.71 | 0.77 | 0.78 | 0.79 |
0.80
| 0.78 |
POISON [60] | 0.53 | 0.64 | 0.75 | 0.76 |
0.83
| 0.82 | 0.80 | 0.78 |
GINI [59] | 0.27 | 0.34 | 0.60 | 0.67 | 0.70 | 0.70 | 0.78 |
0.79
|
ACC2 [55] | 0.67 | 0.69 | 0.77 |
0.79
| 0.79 | 0.79 | 0.78 | 0.78 |
ODDS [58] | 0.59 | 0.69 | 0.74 | 0.77 | 0.80 | 0.76 | 0.80 |
0.82
|
IG [57] | 0.66 | 0.69 | 0.75 | 0.79 | 0.78 |
0.82
| 0.81 | 0.78 |
CHISQ [48] | 0.62 | 0.70 | 0.77 | 0.77 |
0.83
| 0.82 | 0.81 | 0.79 |
BNS [55] | 0.67 | 0.71 | 0.74 | 0.78 |
0.80
| 0.79 | 0.78 | 0.78 |
7.1.3 CLE Urdu Digest 1000K dataset
Feature selection algorithm | Benchmark test points | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
RDC [56] | 0.57 | 0.58 | 0.56 | 0.55 | 0.51 | 0.31 | 0.25 | 0.17 |
NDM [47] | 0.63 | 0.70 | 0.71 | 0.55 | 0.40 | 0.27 | 0.36 | 0.28 |
MMR [21] | 0.64 | 0.65 | 0.81 | 0.71 | 0.67 | 0.51 | 0.31 | 0.21 |
POISON [60] | 0.50 | 0.50 | 0.55 | 0.57 | 0.43 | 0.36 | 0.21 | 0.22 |
GINI [59] | 0.35 | 0.35 | 0.46 | 0.50 | 0.52 | 0.39 | 0.21 | 0.10 |
ACC2 [55] | 0.60 | 0.63 | 0.67 | 0.60 | 0.44 | 0.37 | 0.17 | 0.17 |
ODDS [58] | 0.64 | 0.73 | 0.70 | 0.70 | 0.62 | 0.41 | 0.31 | 0.20 |
IG [57] | 0.69 | 0.76 | 0.74 | 0.71 | 0.46 | 0.32 | 0.21 | 0.19 |
CHISQ [48] | 0.73 | 0.72 | 0.81 | 0.79 | 0.61 | 0.46 | 0.31 | 0.22 |
BNS [55] | 0.61 | 0.63 | 0.67 | 0.61 | 0.50 | 0.37 | 0.12 | 0.17 |
Feature Selection Algorithms | Benchmark Test Points | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
RDC [56] | 0.63 | 0.64 | 0.67 | 0.65 |
0.74
| 0.73 | 0.72 | 0.65 |
NDM [47] | 0.70 | 0.81 |
0.92
| 0.90 | 0.87 | 0.73 | 0.66 | 0.69 |
MMR [21] | 0.66 | 0.74 | 0.85 |
0.86
| 0.81 | 0.82 | 0.72 | 0.68 |
POISON [60] | 0.61 | 0.70 | 0.84 | 0.85 |
0.86
| 0.84 | 0.75 | 0.66 |
GINI [59] | 0.50 | 0.47 | 0.54 | 0.59 | 0.64 |
0.77
| 0.75 | 0.71 |
ACC2 [55] | 0.61 | 0.65 | 0.67 | 0.73 | 0.72 |
0.75
| 0.74 | 0.67 |
ODDS [58] | 0.71 | 0.79 | 0.74 |
0.86
| 0.85 | 0.78 | 0.68 | 0.64 |
IG [57] | 0.63 | 0.68 | 0.80 |
0.86
| 0.81 | 0.80 | 0.75 | 0.66 |
CHISQ [48] | 0.69 | 0.72 | 0.77 | 0.83 |
0.86
| 0.82 | 0.71 | 0.66 |
BNS [55] | 0.61 | 0.64 | 0.67 | 0.70 |
0.77
| 0.75 | 0.74 | 0.67 |
7.1.4 Discussion
Classifier | Number of Features | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
NB [26] | RDC [56] | BNS [55] | IG [57] | ODDS [58] | NDM [47] | NDM [47] | NDM [47] | BNS [55] |
SVM [25] | IG [57] | IG [57] | MMR [21] | MMR [21] | NDM [47] | NDM [47] | NDM [47] | MMR [21] |
Classifier | Number of Features | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
NB [26] | CHISQ [48] | IG [57] | CHISQ [48] | CHISQ [48] | MMR [21] | MMR [21] | NDM [47] | NDM [47] |
SVM [25] | ODDS [58] | NDM [47] | NDM [47] | NDM [47] | NDM [47] | POISON [60] | IG [57] | GINI [59] |
Classifier | Number of Features | |||||||
---|---|---|---|---|---|---|---|---|
10 | 20 | 50 | 100 | 200 | 500 | 1000 | 1500 | |
NB [26] | ACC2 [55] | ACC2 [55] | RDC [56] | ODDS [58] | ODDS [58] | ODDS [58] | ODDS [58] | NDM [47] |
SVM [25] | RDC [56] | BNS [55] | ACC2 [55] | NDM [47] | CHISQ [48] | POISON | CHISQ [48] | ODDS [58] |
FR Metric | RDC [56] | NDM [47] | MMR [21] | POISON [60] | GINI [59] | ACC2 [55] | ODDS [58] | IG [57] | CHISQ [48] | BNS [55] |
---|---|---|---|---|---|---|---|---|---|---|
Test Point | 500 | 1000 | 100 | 500 | 500 | 500 | 100 | 500 | 500 | 1000 |
F1 Score | 0.91 | 0.94 | 0.91 | 0.92 | 0.90 | 0.90 | 0.91 | 0.92 | 0.92 | 0.91 |
FR Metric | RDC [56] | NDM [47] | MMR [21] | POISON [60] | GINI [59] | ACC2 [55] | ODDS [58] | IG [57] | CHISQ [48] | BNS [55] |
---|---|---|---|---|---|---|---|---|---|---|
Test Point | 200 | 50 | 100 | 200 | 500 | 500 | 100 | 100 | 200 | 200 |
F1 Score | 0.74 | 0.92 | 0.86 | 0.86 | 0.77 | 0.75 | 0.86 | 0.86 | 0.86 | 0.77 |
FR Metric | RDC [56] | NDM [47] | MMR [21] | POISON [60] | GINI [59] | ACC2 [55] | ODDS [58] | IG [57] | CHISQ [48] | BNS [55] |
---|---|---|---|---|---|---|---|---|---|---|
Test Point | 200 | 100 | 1000 | 200 | 1500 | 100 | 1500 | 500 | 200 | 200 |
F1 Score | 0.79 | 0.81 | 0.80 | 0.83 | 0.79 | 0.79 | 0.82 | 0.82 | 0.83 | 0.80 |
FR Metric | DSL Urdu News | CLE Urdu Digest 1000K | CLE Urdu Digest 1M | Average |
---|---|---|---|---|
RDC [56] | 6.25 | 0 | 12.5 | 6.25 |
NDM [47] | 37.5 | 37.5 | 12.5 | 29.16 |
MMR [21] | 18.75 | 12.5 | 0 | 10.41 |
POISON [60] | 0 | 6.25 | 6.25 | 4.16 |
GINI [59] | 0 | 6.25 | 0 | 2.08 |
ACC2 [55] | 0 | 0 | 18.75 | 6.25 |
ODDS [58] | 6.25 | 6.25 | 31.25 | 14.58 |
IG [57] | 18.75 | 12.5 | 0 | 10.41 |
CHISQ [48] | 0 | 18.75 | 12.5 | 10.41 |
BNS [55] | 12.5 | 0 | 6.25 | 6.25 |
7.2 Results of adopted deep learning-based methodologies and the hybrid methodology
Model type | Models | Datasets | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
DSL Urdu News | CLE 1000K | CLE 1M | ||||||||
Full vocab | MF @3K | NDM @250 | Full vocab | MF @2K | NDM @350 | Full vocab | MF @10K | NDM @400 | ||
CNN | Yoon Kim et al [123] | 0.88 | 0.90 | 0.93 | 0.46 | 0.42 | 0.77 | 0.66 | 0.60 | 0.74 |
Kalchbrenner et al [124] | 0.89 | 0.91 | 0.91 | 0.57 | 0.63 | 0.69 | 0.63 | 0.70 | 0.75 | |
Yin et al [125] | 0.90 | 0.90 | 0.90 | 0.70 | 0.67 | 0.78 | 0.71 | 0.76 | 0.80 | |
Zhang et al [126] | 0.90 | 0.90 | 0.90 | 0.50 | 0.56 | 0.71 | 0.63 | 0.65 | 0.72 | |
RNN | Yogatama et al [127] | 0.88 | 0.89 | 0.91 | 0.51 | 0.64 | 0.56 | 0.71 | 0.68 | 0.68 |
Palangi et al [128] | 0.87 | 0.89 | 0.91 | 0.48 | 0.54 | 0.50 | 0.68 | 0.65 | 0.71 | |
HYBRID | Siwei Lai et al [142] | 0.88 | 0.88 | 0.91 | 0.60 | 0.57 | 0.69 | 0.70 | 0.66 | 0.77 |
Chen et al [143] | 0.87 | 0.89 | 0.86 | 0.32 | 0.48 | 0.47 | 0.39 | 0.52 | 0.44 | |
Zhou et al [144] | 0.88 | 0.88 | 0.88 | 0.43 | 0.61 | 0.53 | 0.53 | 0.58 | 0.55 | |
Wang et al [145] | 0.88 | 0.90 | 0.90 | 0.66 | 0.62 | 0.50 | 0.58 | 0.57 | 0.56 | |
BERT Multilingual [42] | 12-layer, 768-hidden units, 12-heads | 0.93 | 0.85 | 0.93 | 0.77 | 0.35 | 0.77 | 0.68 | 0.41 | 0.70 |
-
When vocabulary of unique words is fed to the model, bidirectional LSTM outperformed all other neural architectures.
-
When feeding highly discriminative features, convolution-based models are clearly the winners as they perform better than recurrent- and hybrid-based models. However, in a few scenarios, hybrid models perform similar to CNNs but not better.
-
Model with multi-layer CNN architecture with different filter sizes learns better data representation as compared to the model in which CNN layers are linearly stacked over each other.
-
According to the experimentation, models give better results when embedding layer is initialized by pre-trained word vectors and get updated during training.
-
Implementing wide convolutions increases the performance of models on text document classification as wide convolution equalizes the participation of all features while convolving them.
-
For text document classification, the performance of deep learning model increases significantly when the model is fed with deterministic features instead of full vocabulary having all unique terms.
-
Max pooling layer plays a significant role to extract discriminative features.
-
Using multiple embedding layers with CNN architecture produces better results when the model is fed with deterministic features, while in all other scenarios, there is no significant difference between the performance of models that use single and multiple embedding layers.