main-content

## Weitere Artikel dieser Ausgabe durch Wischen aufrufen

24.09.2020 | Original Article | Ausgabe 11/2021 Open Access

# Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification

Zeitschrift:
Neural Computing and Applications > Ausgabe 11/2021
Autoren:
Wichtige Hinweise
The article Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification, written by Muhammad Nabeel Asim, Muhammad Usman Ghani, Muhammad Ali Ibrahim, Waqar Mahmood, Andreas Dengel, Sheraz Ahmed, was originally published electronically on the publisher’s internet portal (currently SpringerLink) on [10/2020] with open access. With the author(s)’ decision to step back from Open Choice, the copyright of the article changed on [10/2020] to © [Springer-Verlag London Ltd., part of Springer Nature] [2020] and the article is forthwith distributed under the terms of copyright.

## Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## 1 Introduction

Textual resources of diverse domains such as academia and industries are growing enormously over the web due to the rapid growth of technology [ 1, 2]. According to a recent survey of data facts, users have only utilized $$0.5\%$$ of all electronic textual data [ 3]. The amount of electronic textual data which has been created in last two years is way more than the data created by entire human race previously [ 3]. This marks the desperate need of classifying or categorizing such humongous electronic textual data in order to enable the processing of text at large scale and for the extraction of useful insights. With the emergence of computational methodologies for text classification, multifarious applications have been developed such as email spam detection [ 4], gender identification [ 5], product review analysis [ 6], news categorization [ 79] and fake news detection [ 1012] for various languages like English, Arabic, and Chinese. However, despite crossing the landmark of 100 million speakers [ 13], Urdu language is still lacking in the development of such applications. The primary reason behind this limited progress is the lack of publicly available datasets for Urdu language. Urdu text document classification datasets used in the previous works are private [ 1419] which further restricts the research and fair comparison of new methodologies. In order to overcome this limitation, the paper in hand provides a new publicly available dataset in which news documents are manually tagged against six different classes.
On the other hand, regarding the improvement in performance of traditional machine learning-based text document classification methodologies, feature selection has played a significant role in various languages such as English, Arabic, and Chinese [ 20, 21]. The ultimate aim of feature selection is to eliminate irrelevant and redundant features [ 22]. Feature selection alleviates the burden on classifier which leads to faster training [ 23, 24]. It also assists the classifier to draw better decision boundary which eventually results in accurate predictions [ 23, 24]. State-of-the-art machine learning-based Urdu text document classification methodologies lack discriminative feature selection techniques [ 19]. In this paper, we embed ten most anticipated filter-based feature ranking metrics in traditional machine learning pipeline to extrapolate the impact created by the set of selected top k features over the performance of support vector machine (SVM) [ 25] and Naive Bayes (NB) [ 26] classifiers.
Although feature selection techniques reduce the dimensionality of textual data up to great extent, traditional machine learning-based text document classification methodologies still face the sparsity problem in bag-of-words-based feature representation techniques [ 27, 28]. Bag-of-words-based feature representation techniques consider unigrams, n-grams or specific patterns as features [ 27, 28]. These algorithms do not capture the complete contextual information of data and also face the problem of data sparsity [ 27, 28]. These problems are solved by word embeddings which do not only capture syntactic but semantic information of textual data as well [ 29]. Deep learning-based text document classification methodologies provide end-to-end system for text classification by automating the process of feature engineering and are outperforming state-of-the-art machine learning-based classification approaches [ 3035].
Although there exists some work on the development of pre-trained neural word embeddings (Haider et. al [ 36], and FastText 1) for Urdu language, no researcher has utilized any deep learning-based methodology or pre-trained neural word embeddings for Urdu text document classification. Here, we thoroughly investigate the performance impact of 10 state-of-the-art deep learning methodologies using pre-trained neural word embeddings. Among all, 4 methodologies are based on a convolutional neural network (CNN), 3 on a recurrent neural network (RNN), and 3 of them are based on a hybrid approach (CNN+RNN). Pre-trained neural word embeddings are just shallow representations as they fuse learned knowledge only in the very first layer of deep learning model, whereas rest of the layers still require to be trained using randomly initialized weights of various filters [ 37]. Moreover, although pre-trained neural word embeddings manage to capture semantic information of words but fail to acquire high-level information including long-range dependencies, anaphora, negation, and agreement for different domains [ 3739]. Considering the recent trend of utilizing pre-trained language models to overcome the downfalls of pre-trained neural word embeddings [ 40, 41], we also explore the impact of language modelling for the task of Urdu text document classification.
However, due to the lack of extensive research, finding an optimal way to acquire maximal results on diverse natural language processing tasks through the use of Bidirectional Encoder Representations from Transformers (BERT) [ 42] is not straightforward at all [ 4345]. For instance, whether pre-training BERT [ 42] on domain-specific data will produce good results, or fine-tuning BERT [ 42] for target tasks or multitask learning would be an optimal option [ 4345]. In this paper, we thoroughly investigate multifarious ways to fine-tune pre-trained multilingual BERT [ 42] language models and provide key insights to make the best use of BERT [ 42] for Urdu text document classification.
Previously, we proposed a robust machine and deep learning-based hybrid approach [ 46] for English text document classification. The proposed hybrid methodology reaped the benefits of both machine learning-based feature engineering and automated engineering performed by deep learning models which eventually helped the model to better classify text documents into predefined classes [ 46]. Hybrid approach significantly improved the performance of text document classification on two publicly available benchmark English datasets 20-Newsgroup 2, and BBC 3 [ 46]. This paper investigates whether utilization of both machine and deep learning-based feature engineering is versatile and effective enough to replicate promising performance figures with a variety of deep learning models for Urdu text document classification. Extensive experimentation with all machine and deep learning-based methodologies is performed on two closed source datasets, namely CLE Urdu Digest 1000 k, CLE Urdu Digest 1Million, and one newly developed dataset, namely DSL Urdu news.
Among all machine learning-based methodologies, Naive Bayes [ 26] with Normalized Difference Measure [ 47] marks the highest performance of 94% over newly developed DSL Urdu news dataset, whereas SVM [ 25] proves dominant over both close source datasets CLE Urdu Digest 1000 k, CLE Urdu Digest 1Million by marking the performance of 92% with normalized difference measure [ 47], and 83% with Chi-squared (CHISQ) [ 48]. On the other hand, trivial adopted deep learning-based methodologies manage to outshine state-of-the-art performance by the margin of 6% on CLE Urdu Digest 1000 k, and 1% on CLE Urdu Digest 1Million. Contrarily, hybrid methodology which leverages machine and deep learning-based feature engineering [ 46], and BERT [ 42] mark similar performance across all three datasets. These methodologies outperform state-of-the-art performance with the figure of 18% on CLE Urdu Digest 1000 k, 10% on CLE Urdu Digest 1 M datasets, and almost equalize the promising performance figures of machine learning-based methodology over DSL Urdu news dataset.
The remaining paper is distributed into following sections. First section deep dives into contributions along with their anticipated impacts, followed by previous work solely related to Urdu Text Document Classification followed by a detail explanation of text document classification methodologies used in this paper. Then, all datasets are elaborated comprehensively. Afterwards, experimental setup and results are revealed in subsequent sections. Finally, we summarize the key points and give future directions.

## 2 Contributions: a review in a nutshell with anticipated impacts

Researchers have employed variety of ways to improve the classification performance for multifarious textual data ranging from trivial documents to convoluted genomic sequences [ 49]. To name a few, while some researchers have tried different data re-sampling approaches and machine learning classifiers (e.g. generative, probabilistic, and ensemble classifiers (bagging, boosting)) [ 50], others have employed more effective feature engineering approaches (feature representation and selection) and less biased evaluation matrices [ 21, 51, 52]. On the other hand, with the revolutionary success brought by deep learning-based classification approaches, in recent times, some researchers have utilized variety of neural word embeddings, activation and loss functions, precisely deep (MLP) and reasonably deep standalone (CNN, RNN) or hybrid neural networks (CNN+RNN) and model parameters optimization approaches to achieve optimal classification performance [ 53]. Contrarily, other researchers have achieved promising performance figures through transfer learning [ 54] (using pre-trained language models or training language model from scratch in an un-supervised manner and then fine-tuning over target task).
However, a comprehensive review of variety of machine and deep learning-based classification approaches with every minor detail of model parameters is quite scarce especially for under-resourced languages like Nastaleeq Urdu. In addition, despite the fact that different filter-based feature selection approaches have significantly raised the performance of machine learning classifier, no work has been performed to assess the impact of variety of filter-based feature selection approaches for deep learning models.
Building on above discussion, the current study attempts to perform a comprehensive comparative analysis of machine and deep learning-based classification approaches using variety of feature representation and feature selection approaches. Contributions of this work are summarized as below:
1.
Development of a publicly available dataset that contains 662 documents of six different classes (health-science, sports, business, agriculture, world, and entertainment) containing 130 K words for Urdu text document classification.

2.
For machine learning-based text document classification, optimal combination of feature selection approach and classification algorithm is found through rigorous experimentation with 10 filter-based feature selection algorithms such as balanced accuracy measure (ACC2) [ 55], normalized difference measure (NDM) [ 47], max–min ratio (MMR) [ 21], relative discrimination criterion (RDC) [ 56], information gain (IG) [ 57], Chi-squared (CHISQ) [ 48], odds ratio (OR) [ 58], bi-normal separation (BNS) [ 55], Gini index (GINI) [ 59], Poisson's ratio (POISON) [ 60] and predefined benchmark test points. This will save a significant amount of time and effort of application developers.

3.
For deep learning-based text document classification, impact of 10 filter-based feature selection algorithms is assessed over 10 state-of-the-art standalone and hybrid neural networks to find most optimal filter-based feature selection algorithm for deep learning models. This will largely assist deep learning researchers in the selection of most discriminative features from very high-dimensional feature vectors.

4.
Considering the lack of research to optimize most widely used pre-trained multilingual language model Bidirectional Encoder Representations from Transformers (BERT [ 42]) for acquiring better performance over target tasks, performance of BERT is assessed with base vocabulary and using the vocabulary generated by top filter-based feature selection algorithm. Also, key steps to better fine-tune BERT for text classification are also provided. This facilitates how effective is the vocabulary generated by top filter-based feature selection algorithm for a language model and which parameters with what values play more crucial role in raising the classification performance.

5.
The effectiveness of a previously proposed hybrid approach [ 46] which reaps the benefits of traditional feature engineering and deep learning-based automated feature engineering is thoroughly investigated for Urdu text document classification using variety of classifiers and Urdu datasets. This alleviates the sole reliance of researchers on automated feature engineering performed by deep learning models and will encourage them to employ different approaches to further improve the feature engineering of deep learning models.

## 3 Related work

Text document classification methodologies can be categorized into rule-based and statistical approaches. Rule-based approaches utilize manually written linguistic rules, whereas, statistical approaches learn the association among multifarious features and class labels in order to classify text documents into predefined classes [ 61]. This section briefly illustrates state-of-the-art statistical work on Urdu text document classification.
Ali et al. [ 14] compared the performance of two classifiers, namely Naïve Bayes (NB) [ 26], and Support Vector Machine (SVM) [ 25] for the task of Urdu text document classification. They prepared a dataset by scrapping various Urdu news websites and manually classified them into six categories (news, sports, finance, culture, consumer information and personal information). Based on their experimental results, they summarized that SVM [ 25] significantly outperformed Naive Bayes [ 26]. Their experiments also revealed that stemming decreased the overall performance of classification.
Usman et al. [ 15] utilized maximum voting approach in quest of classifying Urdu news documents. The news corpus was divided into seven categories, namely business, entertainment, culture, health, sports, and weird. After tokenization, stop words removal, and stemming, they extracted 93400 terms and fed them to six machine learning classifiers, namely Naïve Bayes (NB), linear stochastic gradient descent (SGD) [ 62], multinomial Naïve Bayes (MNB), Bernoulli Naïve Bayes (BNB) [ 63], linear SVM [ 25], and random forest classifier [ 64]. Then, they applied max voting approach in such a way that the class selected by majority of the classification algorithms was chosen as final class. Experimentally, they proved that linear SVM [ 25] and linear SGD [ 62] showed better performance on their developed corpora.
Sattar et al. [ 17] performed Urdu editorials classification using Naïve Bayes classifier. Moreover, most frequent terms of the corpus were removed to alleviate the dimensionality of data. Their experimental results showed that Naïve Bayes classifier performs well when it is fed with frequent terms as compared to feeding all unique terms of the corpus.
Ahmed et al. [ 16] performed Urdu news headlines classification using support vector machine (SVM) [ 25] classifier. They utilized a TF-IDF-based feature selection approach which removed less important domain-specific terms from underlay corpus. This was done by utilizing the threshold paradigm on TF-IDF score which enabled the extraction of those terms that had higher TFIDF than defined threshold value. After preprocessing and threshold-based term filtration, they used SVM [ 25] classifier to make predictions.
Zia et al. [ 18] evaluated the performance of Urdu text document classification by adopting four state-of-the-art feature selection techniques, namely information gain (IG) [ 57], Chi-square (CS) [ 48], gain ratio (GR) [ 65], and symmetrical uncertainty [ 66] with four classification algorithms (K-nearest neighbours (KNN) [ 67], Naïve Bayes (NB), decision tree (DT), and support vector machines (SVM) [ 25]. They found that for larger datasets, performance of SVM [ 25] with any of the above-mentioned feature selection technique was better as compared to Naïve Bayes [ 26] which was more inclined towards small corpora.
Adeeba et al. [ 19] presented an automatic Urdu text-based genre identification system that classified Urdu text documents into one of the eight predefined categories, namely culture, science, religion, press, health, sports, letters, and interviews. They investigated the effects of employing both lexical and structural features on the performance of support vector machine [ 25], Naïve Bayes [ 26], and decision tree algorithms. For lexical features, the authors extracted word unigrams, bigrams, along with their term frequency and inverse document frequency. To extract structural features, part of speech tags and word sense information were utilized. Moreover, they reduced the dimensionality of corpora by eliminating low-frequency terms. For the experimentation, CLE Urdu Digest 100K 4 and CLE Urdu Digest 1 Million 5 corpora were used. Their experiments revealed that SVM [ 25] was better than other classifiers irrespective of feature types.
State-of-the-art work on Urdu text document classification is summarized in Table 1 by author name, benchmark dataset, exploited feature representation and selection techniques, classifiers, evaluation metrics, and their respective performances.
Table 1
State-of-the-art work on Urdu text document classification
Authors
Datasets
Feature representation techniques
Feature selection techniques
Classifier
Evaluation metric
Ali et al. [ 14]
Manually classified news corpus
Normalized term frequency
NB , SVM
Accuracy
Usman et al. [ 15]
News Corpus
Term Frequency (TF)
NB, BNB, LSVM, LSGB, RF
Precision, Recall, F1-score
Sattar et al. [ 17]
Urdu News Editorials
Term Frequency (TF)
NB
Precision, Recall, F1-score
Ahmed et al. [ 16]
TF-IDF
TF-IDF (Thresholding)
SVM
Accuracy
Zia et al. [ 18]
EMILLE, Self Collected Naive corpus (News)
TF-IDF
Information Gain, Chi Square, Gain Ratio, Symmetrical Uncertainty
KNN, DT, NB.
F1-score
CLE Urdu Digest (1000 K, 1 Million)
Term Frequency (TF), TF-IDF
Pruning
Precision, Recall, F1-score
After thoroughly examining the literature, it can be summarized that SVM [ 25] and Naive Bayes [ 26] perform better than other classifiers for the task of Urdu text document classification.
For English text document classification, recent experimentation on public benchmark datasets also proves that performance of SVM [ 25] and Naive Bayes [ 26] significantly improves with the use of filter-based feature selection algorithms [ 21, 47]. Filter-based feature selection algorithms not only improve the performance of machine learning-based methodologies, but it has also substantially raised the performance of deep learning-based text document classification approaches [ 46].
However, Urdu text document classification methodologies are lacking to produce promising performance due to the lack of research in this direction as only Ahmed et al. [ 16], and Zia et al. [ 18] utilized some feature selection approaches in order to reduce the dimensionality of data. While Ahmed et al [ 16] only experimented with TF-IDF-based feature selection approach, Zia et al. [ 18] assessed the integrity of just four feature selection algorithms (information gain (IG) [ 57], Chi-Square (CS), gain ratio (GR), and symmetrical uncertainty) in domain of Urdu text document classification. However, the performance impact of more recent filter-based feature selection algorithms has never been explored specifically for Urdu text document classification.
In addition, despite the promising performance produced by deep learning methodologies for diverse NLP tasks [ 68, 69], no researcher has utilized any deep learning-based methodology for the task of Urdu text document classification.

## 4 Adopted methodologies for Urdu text document classification

This section comprehensively illustrates machine learning, deep learning, and hybrid methodologies which we have used for the task of Urdu text document classification.

### 4.1 Traditional machine learning-based Urdu text document classification with filter-based feature selection algorithm

This section elaborates the machine learning-based Urdu text document classification methodology. Primarily, our main focus is to investigate the performance boost in traditional machine learning-based Urdu text document classification methodologies through the embedding of filter-based feature selection algorithms. Figure 1 provides graphical illustration of machine learning-based Urdu text classification methodology which utilizes filter-based feature engineering. All phases of this methodology are discussed below.

### 4.2 Preprocessing

Preprocessing of text is considered as preliminary step in almost all natural language processing tasks as better tokenization, and stemming or lemmatization eventually leads to better performance in various machine learning tasks such as text classification [ 70, 71], information retrieval [ 72], and text summarization [ 73].
Stemming undoubtedly plays an important role to alleviate sparsity problems through dimensionality reduction; however, there are very few rule-based stemmers available for Urdu language which lack to showcase quality performance. Ali et al. [ 14] claimed that stemming degrades the performance of Urdu text document classification. We analysed that the stemmer utilized by Ali et al [ 14] was of poor quality which eventually caused the decline in performance as it has been proved by many researchers that stemming often improves the performance of text document classification for various languages (e.g. English) [ 74, 75]. Urdu language lacks better stemming algorithms; therefore, instead of stemming, we perform lemmatization through a manually prepared Urdu lexicon containing 9743 possible values of 4162 base terms. In Sect. 6, Tables 3, 4, 5 reveal the impact of lemmatization on the size reduction of three datasets used in our experimentation. We believe public access to the developed lexicon will enable the researchers to perform lemmatization in several different Urdu processing tasks. In addition, all non significant words of corpus are eliminated through a stop words list. The list of 1000 stop words is formed by manually analyzing the most frequent 1500 words of underlay corpora.

### 4.3 Feature selection

Feature selection is being widely used to reduce the dimensionality of feature space in different applications like text classification [ 47], plagiarism detection [ 76], and for query expansion in pseudo-relevance feedback-based information retrieval [ 77], which eventually assists to produce better results.
Feature selection approaches can be categorized into three classes wrapper [ 78], embedded [ 79], and filter [ 80]. In wrapper methods, classifier is trained and tested over several subsets of features and only one subset of features is selected which has produced the minimum error [ 78]. Similarly, embedded feature selection approaches also work like wrapper-based methods; however, wrapper-based methods can exploit one classifier (e.g. SVM [ 25]) to train over subset of features and other classifier (e.g. Naive Bayes) to test optimal set of features, but embedded feature selection approaches are bound to use the same classifier throughout the classification process [ 79].
On the other hand, filter-based feature selection algorithms do not take into account the error value of a classifier; however, they rank the features and pick top k features based on certain threshold [ 81]. In this way, a highly discriminative user specified subset of features is acquired by utilizing the statistics of data samples.
Wrapper and embedded feature selection methods are computationally far more expensive as compared to filter-based feature selection algorithms. While both former approaches assess the usefulness of features by cross-validating classifier performance, latter approaches operates over the intrinsic properties (e.g. relevance) of features computed through univariate statistics.
In our work, considering the efficiency of filter-based feature selection algorithm, we have adapted ten most anticipated filter-based feature selection algorithms. These algorithms are extensively being utilized for English text document classification such as balanced accuracy measure (ACC2) [ 55], normalized difference measure (NDM) [ 47], max–min ratio (MMR) [ 21], relative discrimination criterion (RDC) [ 56], information gain (IG) [ 57], Chi-squared (CHISQ) [ 48], odds ratio (OR) [ 58], bi-normal separation (BNS) [ 55], Gini index (GINI) [ 59], Poisson's ratio (POISON [ 60]).
Filter-based feature ranking algorithms utilize confusion matrix (shown in Table 2) to compute the scores of corpus features.
Table 2
Confusion matrix, where $$t_p$$ refers to the number of documents in positive class having term t (true positives), $$f_p$$ refers to the number of documents in negative class having term t (false positives), $$t_n$$ implies the number of documents in negative class not having term t (true negatives), and $$f_n$$ implies the number of documents in positive class not having term t (false negatives)

$$t_j$$
$${\bar{t}}_j$$
Positive class
$$t_p$$
$$f_n$$
Negative class
$$f_p$$
$$t_n$$
In confusion matrix, positive and negative classes are two predefined classes in a typical binary text document classification problem, whereas in a multi-class text document classification problem, iteratively, one class is considered positive and rest are combined to form a negative class. $$t_j$$ and $${\bar{t}}_j$$ represent the presence and absence of terms, respectively, in corresponding classes.
Here, we only refer these feature selection algorithms, and interested readers can explore these algorithms deeply by studying their respective papers.

#### 4.3.1 Balanced accuracy measure (ACC2)

Accuracy measure (ACC) is the predecessor of the balanced accuracy measure (ACC2) [ 47]. ACC is evaluated as a difference between true positives and false positives of a feature. It is the most simplest filter-based feature ranking algorithm as it computes the difference between (t p) total number of positive class documents having feature f and (f p) total number of negative class documents having feature f. As in case of multi-class machine learning problem, ACC is biased towards true positives; therefore, it performs well only on balanced data.
\begin{aligned} \mathrm{Accuracy Measure}=ACC = t_p - f_p \end{aligned}
(1)
To overcome the $$t_p$$ biasedness, ACC2 was proposed. It is an absolute difference between true positive rate $$\left( t_{pr}\right)$$ and false positive rate $$\left( f_{pr}\right)$$.
\begin{aligned} \text{Balanced\, Accuracy\,Measure}=ACC2=| t_{pr} - f_{pr}| \end{aligned}
(2)
In Eq.  2, values of $$t_{pr}$$ and $$f_{pr}$$ are defined in Eqs. 3 and 4.
\begin{aligned} t_{\mathrm{pr}}&= \frac{t_p}{t_p + f_n} \end{aligned}
(3)
\begin{aligned} f_{\mathrm{pr}}&= \frac{t_n}{t_n + f_p} \end{aligned}
(4)

#### 4.3.2 Normalized difference measure (NDM)

ACC2 treats all terms alike which have the same $$|t_{pr} - f_{pr}|$$ value even if the $$t_{pr}$$ and $$f_{pr}$$ values of terms are different from each other. According to Rehman et al. [ 47], the terms located at the bottom right and top left corners of the contour plot are more important than the ones located around the diagonals. Although ACC2 assigns higher values to the terms located at bottom right and top left corners of contour plot, it treats the terms alike which are located around the diagonal. In order to overcome this problem, normalized difference measure (NDM) treats the terms at corners and at diagonals differently.
According to NDM, a term is important if:
• it has high $$|t_{pr} - f_{pr}|$$ value.
• Either $$t_{pr}$$ or $$f_{pr}$$ is closer to zero.
• If any two terms have same $$|t_{pr} - f_{pr}|$$ values, then a higher rank must be assigned to that term which has smaller ( $$t_{pr}$$, $$f_{pr}$$) value.
The mathematical representation of NDM is as follows:
\begin{aligned} \text{NDM} =\frac{|t_{\mathrm{pr}}-f_{\mathrm{pr}}|}{min(t_{\mathrm{pr}},f_{\mathrm{pr}})} \end{aligned}
(5)

#### 4.3.3 Max–Min ratio (MMR)

Max-min ratio is an improved version of ACC2 and NDM as it addresses the downfalls of both feature ranking algorithms. NDM assigns pretty high scores to all sparse terms $$( {t}_{\mathrm{pr}} \approx 0 ,\, \hbox {f}_{\mathrm{pr}} \,\approx \, 0\,\hbox { or} \, {t}_{\mathrm{pr}} , \,\hbox {f}_{\mathrm{pr}}\, \approx$$0) and denominator factor $$(\hbox {min} \,( {t}_{\mathrm{pr}}, {f}_{\mathrm{pr}}))$$ obliterate numerator ( $$|t_{pr}-f_{pr}|$$) in case of largely skewed data. However, MMR is highly capable of estimating true term relevance especially for those datasets in which predefined classes are highly skew in nature. The mathematical representation of MMR is as follows:
\begin{aligned} \text{MMR} =\text{max}(t\,pr, f\, pr) * \frac{|t_{\mathrm{pr}}-f_{\mathrm{pr}}|}{\text{min}(t_{\mathrm{pr}},f_{\mathrm{pr}})} \end{aligned}
(6)
It is clear from the equation that the factor max(tpr, fpr) stops the NDM scores from getting too large. It significantly helps especially for those terms having tpr, fpr approximately equal to 0. Likewise, MMR and NDM assign same score when the max of tpr, and fpr are exactly 1. In case of having tpr exactly equal to fpr, MMR imitates as ACC2. Although MMR faces the problem of determining denominator factor $$\hbox {min}(\hbox {t}_{\mathrm{pr}}, \,\hbox {f}_{\mathrm{pr}})$$ for a particular set of terms which do not exist in one of the predefined classes, it is still considered to be an improved version because of its capability to capture true relevance of corpus terms especially for highly skew datasets.

#### 4.3.4 Relative discrimination criterion (RDC)

Relative discrimination criterion (RDC) [ 56] computes document frequencies of entire corpus terms counts and calculates the difference between document frequencies of each term count by taking into account the presence of term in positives and negative classes. Moreover, in order to tackle the problem of assigning exactly same score to terms which have different discriminative powers, it divides the computed difference by the minimum of two document frequencies. Therefore, term having least minimum document frequency in one of the predefined classes would eventually get a higher score. This is because there is a widely accepted criteria that a term which is frequent in only one specific class shall get a higher score. In addition, in order to assign higher weight to the differences having smaller term counts, the difference is also divided by term count, thus alleviating the bias for higher term counts. Mathematical expression of RDC can be written as:
\begin{aligned} \text{RDC} = \frac{|t\,pr_{\mathrm{tc}} - f\,pr{\mathrm{tc}}|}{min(t\,pr_{\mathrm{tc}},f\,pr_{\mathrm{tc}})*(tc)} \end{aligned}
(7)

#### 4.3.5 Information gain (IG)

Information gain (IG) is widely used in text data. It is a measure of how likely a term is to occur in a particular class as compared to other classes. For instance, a word ’mesmerizing’ is more likely to occur in a positive review and less likely to occur in a negative review. Because the presence of word ‘mesmerizing’ is a strong indication of positive emotion, therefore it can be classified as a highly informative word.
Information gain (IG) also calculates whether the information is increased or decreased after adding or removing a term from feature subset. Information gain for a term t can be calculated as:
\begin{aligned} IG_t = e(p, n) [P_w e(tp, fp) + P_w^- e(fn, tn)] \end{aligned}
(8)
where p and n represent the number of positive and negative instances; further, e( p, n) can be calculated as:
\begin{aligned} e(p, n) = -\frac{p}{p+n} \log _2\frac{p}{p+n}-\frac{n}{p+n} \log _2\frac{n}{p + n} \end{aligned}
(9)
and $$p_w$$, $$p_w^-$$ can be calculated as:
\begin{aligned} P_w&= \frac{(tp + fp)}{N} \end{aligned}
(10)
\begin{aligned} P_w^-&= 1 - P_{\text{term}}\end{aligned}
(11)

#### 4.3.6 Chi-squared (CHISQ)

It is another widely used feature selection algorithm. In statistics, Chi-squared (CHI) is used to measure the dependency of two events, whereas in text document classification, it is used to check the dependency of a term to a class [ 55]. High scores of CHISQ demonstrate the high dependency between a term and a class.
Moreover, it is a two-sided metric because it takes only positive value in consideration and ignores the sign [ 82]. On the basis of discriminating power of positive and negative classes, two-sided metric assigns positive values to both classes. One of the downsides of this feature ranking metric is that it performs poorly when the dataset contains infrequent terms. It exaggerates such terms and pays no attention to term distribution. However, performance of the this algorithm can be improved by applying pruning with a certain threshold on the dataset. The score of CHISQ for $$i{th}$$ term of $$k{th}$$ class is given as:
\begin{aligned} \text{CHI}= \frac{\left( t_p\times t_n - f_n \times f_p\right) ^2}{\left( t_p + f_p\right) \left( f_n + t_n\right) \left( t_p + f_n\right) \left( f_p + t_n\right) } \end{aligned}
(12)

#### 4.3.7 Odds ratio (OR)

Odds ratio (OR) measures the odds of the presence of a feature in a positive class normalized by that of negative class. It is based on the idea that the distribution of feature in positive documents is not the same as distribution in the negative documents. It takes only those features into consideration that repeatedly occur in a particular class and totally ignores the features of same scope in the other classes [ 58].
Also, OR does not prioritize any redundant and irrelevant features. It is good in handling a smaller number of features. Its mathematical formulation is given as:
\begin{aligned} \text{OR}=\frac{t_p \times t_n}{f_p \times f_n} \end{aligned}
(13)

#### 4.3.8 Bi-normal separation (BNS)

Bi-normal separation (BNS) which is first introduced by Forman [ 55] is defined as follows:
\begin{aligned} \text{BNS}=|F_c^{-1}(t_{pr})-F_c^{-1}(f_{pr})| \end{aligned}
(14)
Here, $$F_c^{-1}$$ is inverse cumulative distribution function of normal distribution. Highest weights are assigned to those features that are strongly connected with positive class or negative class. Lowest weights are assigned to those features that are evenly distributed among all the classes. BNS method is not biased towards document frequency and helpful for extracting useful features in highly skewed datasets.

#### 4.3.9 Gini index (GINI)

Gini index (GINI) is used to measure the purity of an attribute. The purity of a feature can be used to calculate its importance. A feature is pure if all the documents show that the feature belongs to the same class. Therefore, GINI produces useful results when applied to features. Bigger value of GINI depicts better purity of a feature.
The score of Gini index can be calculated for a feature t using the following formula.
\begin{aligned} GI(t)=\sum _{j=1}^{M}P(t|C_j)^2P(C_j|t)^2 \end{aligned}
(15)

#### 4.3.10 Poisson ratio (POISON)

Initially, Poisson's ratio (POISON) is only used to extract query words in information retrieval. Later, Ogura et al [ 83] modified POISON for feature selection. It measures the deviation of a feature from the Poisson distribution. If the feature is far away from the Poisson distribution, then it is more effective. Conversely, if a feature lies near to or within the range of distribution, then it is poor. Mathematically, POISON is defined as follows:
\begin{aligned} \text{POIS}&= \frac{(a_p-\hat{a_p})^2}{\hat{a_p}}+\frac{(b_{np}-\hat{b_{np}})^2}{\hat{b_{np}}}\nonumber \\&\quad +\frac{(c_{fp}-\hat{c_{fp}})^2}{\hat{c_{fp}}}+ \frac{(d_{tn}-\hat{d_{tn}})^2}{\hat{d_{tn}}} \end{aligned}
(16)
\begin{aligned} \hat{a_p}&= N(C)(1-e^{(-\lambda )}), \hat{b_{np}}=N(C)e^{(-\lambda )}, \end{aligned}
(17)
\begin{aligned} \hat{c_{fp}}&= N(\bar{C})(1-e^{(-\lambda )}) \end{aligned}
(18)
\begin{aligned} \hat{d_{tn}}&= N(\bar{C})e^{(-\lambda )} \end{aligned}
(19)
\begin{aligned} \lambda&= F/N \end{aligned}
(20)
where $$a_p$$ represents the presence and $$b_{np}$$ refers to the absence of a term in a particular class. If a term is present but do not belong to class C, it is represented as $$c_{fp}$$, and $$d_{tn}$$ refers that both t and C are absent from the documents.

### 4.4 Feature representation

Diverse domains (e.g. textual, non-textual) have different stacks of features; for example, if we want to classify iris data, then the set of useful features would be sepal length, sepal width, petal length and petal width [ 84]. However, the set of textual features for certain domain is not fixed at all. Representation of features plays a vital role to raise the performance of diverse classification methodologies [ 29, 68, 69]. Machine learning methodologies utilize bag of words-based feature representation approaches. Term frequency [ 85] is the simplest and widely used feature representation technique for various natural language processing tasks such as text classification and information retrieval [ 8688]. Term frequency (TF) [ 85] of a term in a document is defined as the number of times a term occur in that document. One of the most significant problems of TF is that it does not capture the actual importance and usefulness of a term. This downfall is well addressed by term frequency-inverse document frequency (TF-IDF) [ 85] which is a modified version of term frequency [ 85] as it declines the weight specifically for the words which are commonly used and raises the weight for less commonly used words of underlay corpus. It gives more importance to less frequent terms and vice versa. It is calculated by taking dot product of term frequency (TF) and inverse document frequency (IDF).
IDF assigns weights to all the terms on corpus level. According to IDF, a term is more important if it occurs in less documents. When IDF weighting scheme is used standalone, it can allocate same weights to many terms which have the same $$DF_t$$ score. IDF is defined as follows:
\begin{aligned} IDF_t = \log \frac{N}{DF_t} + 1 \end{aligned}
(21)
where N is the total number of documents in the corpus and $$DF_t$$ is the document frequency of term t.
A higher TF-IDF score implies that the term is rare and vice versa. Its value for term t in a document d can be calculated as
\begin{aligned} TF-IDF_{t,d} = TF_{t,d} \cdot IDF_t \end{aligned}
(22)
Thus, by using both TF and IDF, TF-IDF captures the actual importance of terms on both document and corpus level.

### 4.5 Classifiers

In order to assess the impact of filter-based feature selection algorithms on the performance of trivial machine learning-based Urdu text document classification methodologies, we utilize support vector machine (SVM) [ 25], and Naive Bayes (NB) [ 26] classifiers. This is because, in state-of-the-art Urdu text document classification work, we have found that only these two classifiers mark promising performance [ 1419].
Naive Bayes [ 26] uses Bayes' theorem and probability theory in order to make predictions. Naive Bayes [ 26] classifiers are usually categorized as generative classifiers and are highly useful for applications like document classification [ 89], and email spam detection [ 90], whereas SVM [ 91] classifier is categorized as discriminative classifier and mostly used for anomaly detection [ 92], and classification problems [ 93]. It is a non-probabilistic linear classifier which plots each data sample as a coordinate point in multi-dimensional space and finds an optimal hyperplane which eventually helps to differentiate the class boundaries effectively.

## 5 Deep learning methodologies

To better understand multifarious deep learning methodologies adapted for Urdu text document classification, learning of convolutional neural network(CNN), recurrent neural network(RNN), long short-term memory network(LSTM), and gated recurrent unit (GRU) is illustrated. Also, their differences are explained through mathematical expressions in following subsections.

### 5.1 Input layer

In our work, we have either utilized randomly initialized or 300 dimensional pre-trained neural word embeddings to create numeric representation of corpus words. In other words, given a document of n words $${\hbox {w}_{1}, \hbox {w}_{2}, \hbox {w}_{3}, \ldots , \hbox {w}_{\mathrm{n}}}$$, for each corpus word $$\hbox {w}_{\mathrm{i}}$$, embedding vector $$\hbox {e}_{\mathrm{i}}$$ is generated by computing matrix vector product using embedding matrix $$W \epsilon R\,^{\mathrm{d}}*|V|$$ where | V| represents size of vocabulary and d associates to the dimension of real valued word embedding vector.
\begin{aligned} e_{\mathrm{i}}=Wv_{\mathrm{i}} \end{aligned}
(23)
In this way, each corpus document is represented in terms of several word vectors containing real values in between 0 and 1 $${e}= {e}_{1}, {e}_{2}, {e}_{3}, \ldots , {e}_{{n}}$$. These features are then fed to variety of feature extraction layers.

### 5.2 Convolutional neural network (CNN)

Typically, convolutional neural network (CNN) is composed of convolution and pooling layers proceeded by one/multiple fully connected layers which at times is replaced by global average pooling layer. Moreover, researchers have also experimented with dropout and batch normalization approaches to improve the performance of CNN [ 94]. Depth and components of CNN play a pivotal role in enhancing the task performance. Various components with their respective roles within CNN are briefly discussed below.

#### 5.2.1 Convolution layer

Convolutional layer has collection of kernels that act as feature extractors. In case of symmetrical kernel, convolution operation certainly turns into a correlation operation [ 95]. Each kernel $$w \in R^{kk}$$ is applied upon a window containing h words with certain stride size in order to generate fresh feature. For instance, a fresh feature $$c_i$$ is produced from the window of words $$x_{i:i+h-1}$$
\begin{aligned} c_i=f(w.x_{i:i+h-1}+b) \end{aligned}
(24)
In Eq.  24, $$b \in R$$ represents bias and f acts as a nonlinear function like hyperbolic tangent. This kernel is executed over every possible window of words present in a sentence $$w_{1:h}, w_{2:h+1}\cdots w_{n-h+1:n}$$ to generate a feature map that can be represented as:
\begin{aligned} c=[c_1,c_2,c_3\ldots c_{n-h+1}] \end{aligned}
(25)
where $$c \in R^{n-h+1}$$. Model actually makes use of multiple kernels with diverse window and stride size to get collection of features. Convolution operation can be classified into distinct different types considering the size and type of filters, padding type, and convolution direction [ 94].

#### 5.2.2 Pooling layer

After extracting features from documents, capturing their relative positions become an important task which is achieved by down-sampling or pooling. Pooling is a local operation that captures the dominant response of local region by aggregating similar neighbourhood information [ 96]. Pooling operation can be expressed as:
\begin{aligned} Z_l^k=g_p(F_l^k) \end{aligned}
(26)
Here, $$Z_l^k$$ refers to down-sampled feature map of $$l^th$$ layer for $$k^th$$ give feature map $$(F_l^k)$$, $$g_p$$ represents the kind of pooling operation. Pooling mainly assists to acquire feature combinations that are invariant to small distortions and translational shifts [ 97, 98]. Minimization of feature map alleviates the complexity of neural network and also assists in raise the generalization through reducing over-fitting. Variety of pooling operations are applied like average, min, max overlapping, L2, spatial pyramid, etc. [ 94, 99].

#### 5.2.3 Activation function

It is a decision function and helps the network for learning complex patterns. Appropriate selection of activation function enhances the process of learning. For a feature map acquired through convolution operation, activation function can be written as:
\begin{aligned} T_l^k=g_a(F_l^k) \end{aligned}
(27)
Here, $$(F_l^k)$$ is produced by convolution, that is given to activation function represented as $$g_a$$. Activation function embeds nonlinearity and yields output $$T_l^k$$ for $$l{th}$$ layer.
Critical analysis of literature shows that multifarious activation functions have been used like tanh, sigmoid, maxout, SWISH, ReLU, and its variants (LeakyReLu, PReLU, ELU) [ 99103]. But, ReLU and variants of ReLU are mostly preferred by the researchers as they greatly assist in dealing with gradient vanishing issue [ 104, 105].

#### 5.2.4 Batch normalization

In order to resolve the issues related to covariance shift inside feature maps, batch normalization is widely used. Covariance shift refers to the change in distribution of network hidden units/values that significantly decreases convergence rate (by pushing learning rate to lower value) and demands watchful initialization of model parameters. For transformed feature map, batch normalization is given as follows:
\begin{aligned} N_l^k=F_l^k-u_b/\sqrt{\sigma _b^2+\epsilon } \end{aligned}
(28)
Here, 28, for mini batch, $$N_l^k$$ and $$F_l^k$$ refer to normalized and input feature map, $$\sigma _b^2$$ and $$u_b$$ correspond to variance and mean of a given feature map. To avoid zero division, $$\epsilon$$ is injected to add numerical stability.
Batch normalization standardizes distribution of values of feature map through setting them into unit variance and zero mean [ 106]. Also, it greatly flatten gradient flow and serve as a regularizing factor, through which network generalization is improved up to great extent.

#### 5.2.5 Dropout

Dropout layer serves as a regularizer in neural network that eventually alleviates overfitting and improves generalization through randomly neglecting few connections or units with particular probability [ 107]. In neural networks, as several connections based on nonlinear relation are co-adapted at times, random dropping of few units creates multiple thinned deep architectures and afterward one optimal representative network architectures is opted with quite small weights. Then, this opted architecture is considered an approximation of all proposed networks [ 108].

#### 5.2.6 Fully connected layer

The layer which is used at the end of neural network is called fully connected layer. Unlike convolution and pooling, it can be classified as a global extraction. It takes the input from all feature extraction phases and analyses the result of former layers [ 109]. As a result, it creates a nonlinear association of selected features that are used for text classification [ 110].

### 5.3 Recurrent neural network (RNN) and its variants (LSTM, GRU)

Researchers have extensively utilized recurrent neural network (RNN) for text classification [ 111, 112]. As RNN allocates more weights to former points and takes into account information of former nodes, it analyses the structure of dataset in a more effective manner. Mostly, RNN makes use of long short-term memory (LSTM) or gated recurrent unit (GRU) that consists of embedding layer, hidden layers, and output layer. This methodology can be expressed as:
\begin{aligned} x_t=f(x_{t-1}, u_{t}, \theta ) \end{aligned}
(29)
At time step t, here $$x_t$$ represents state and $$u_t$$ refers to the input. Using weights, it can be expressed as:
\begin{aligned} x_{t}=W_{rec} \sigma (x_{t-1}+ W_{in} u_{t}+ b) \end{aligned}
(30)
Here, $$W_rec$$ represents recurrent weight matrix $$w_in$$ is input weights, b refers to bias and $$\sigma$$ implies the element-wise operation.
RNN is highly vulnerable to exploding and vanishing gradient problems [ 113] when error of neural network is back propagated in the network. Due to these reasons, its variants LSTM and GRU are used mostly in experimentation, details of which are given below.

#### 5.3.1 Long short-term memory (LSTM)

Long short-term memory (LSTM), a special type of RNN given by Hochreiter et al. [ 114], addressed the downfalls of RNN such as vanishing gradient issue through preserving long-range dependencies in a very effective manner [ 114]. Although due to having chain-like architecture, it is quite similar to RNN, LSTM makes use of several gates in order to efficiently regulate information which is allowed for the state of each node.
\begin{aligned} i_l&= \sigma (W_{i}[x_{t}, h_{t-1}]+ b_{i}) \end{aligned}
(31)
\begin{aligned} C_t&= \text{tanh}(W_{c}[x_{t}, h_{t-1}]+ b_{c}) \end{aligned}
(32)
\begin{aligned} f_t&= \sigma (W_{f}[x_{t}, h_{t-1}]+ b_{f}) \end{aligned}
(33)
\begin{aligned} C_t&= i_{t}*C_{t}^{f_{t}C_{t-1}} \end{aligned}
(34)
\begin{aligned} o_{t}&= \sigma (W_{o}[x_{t}, h{_t-1}]+ b_{o}) \end{aligned}
(35)
\begin{aligned} h_{t}&= o_{t}(tanh(C_{t})) \end{aligned}
(36)
Here, Eq.  31 refers to input gate, Eq.  32 refers to the value of candid memory cell, Eq.  33 represents forget gate activation, Eq.  34 computes value of fresh memory cell, and Eqs.  35, 36 describe the final yield of gate value. Moreover, b refers to bias, w represents the weight matrix, $$x_t$$ represents the input at timestamp t, and indies i, c, f, o refer to input, memory of cell, forget, and final output gates in turn.

#### 5.3.2 Gated recurrent unit (GRU)

Gated recurrent unit (GRU) is a more simplified version of LSTM architecture [ 115]. But unlike LSTM, it has two gates and does not have internal memory. In addition, it does not apply second nonlinearity [ 116].
\begin{aligned} z_{t}=\sigma _{g}(W_{z}*x_{t}+U_{z}*h_{t-1}+b_{z}) \end{aligned}
(37)
Here, $$z_t$$ is the update gate representation of t, $$x_t$$ is input vector, and W, b, U are parameter vectors. Activation function is either ReLu or sigmoid that can be formulated as:
\begin{aligned} r_t =\sigma _{g}(W_{r}*x_{t}+U_{r}*h_{t-1}+b_{r}) \end{aligned}
(38)
where $$z_t$$ is the update gate representation of t and $$r_t$$ is reset gate representation of t.
\begin{aligned} h_{t}=z_{t} \cdot h_{t-1}+(1-z_t)\cdot \sigma _h(W_h*x_t+U_h*(r_t \cdot h_t-1)+b_h) \end{aligned}
(39)
For t, $$h_t$$ is the final output vector where $$\sigma _h$$ represents hyperbolic tangent operation.

### 5.4 Selection and optimization of model parameters

To achieve optimal performance in multifarious natural language processing (NLP), tasks like classification, selection, and tuning of hyperparameters are quite crucial in deep learning approaches. Inappropriate selection or tuning does not only badly hit the generalization of neural networks which eventually leads to significant decline in performance, but it may also cause endless training and ineffective consumption of valuable resources. Due to humongous number of hyperparameters, selecting most crucial ones based on time and resource complexity and performance impact is not a straightforward task at all. In addition, optimization of hyperparameters is considered as black box research of x, in a way that for a defined function $$f: S \subset R^d \Rightarrow R$$, f( x) values is quite small and function f is stochastic by nature. This infers the scenario where one is searching for best setting of hyperparameters for certain model by trying multiple values of such parameters and opting the value that yields best performance on validation data.
Existing hyperparameter search approaches can be classified into pattern search [ 117], Gaussian processes [ 118, 119], evolution strategies [ 120], random sampling [ 121], and grid sampling [ 121]. Building on the critical findings related to most crucial hyperparameters acquired by Yin et al. [ 122] after performing a thorough comparative analysis of deep neural networks for diverse NLP tasks, in our work, we have also only selected most influential hyperparameters. More specifically, we tweak hidden size, batch size, optimizer, learning rate, momentum, loss criterion, activation function, dropout, number of kernels, and their sizes. To find optimal values of selected parameters, instead of functional or manual evaluation that proves extremely expensive, we have employed most widely used hyperparameter optimization approach, namely grid search. In grid search, a set of hyperparameter are opted beforehand (random manner or on the grid) and model training is performed in parallel. Grid search is highly scalable and quite easy to execute. In our experimentation, we have tried different batch sizes [10, 20, 30, 40, 50, 60, 70, 80, 90, 100], optimizers [SGD’, RMSprop, Adadelta, Adagrad, Adam, Nadam, Adamax], learning rate [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5], momentum [0.0, 0.2, 0.4, 0.6, 0.8, 0.9] dropout rate [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] hidden units [1, 5, 10, 15, 20, 25, 30], number of kernels [ $$2^3$$$$2^8$$], kernel sizes [25, 50, 75, 100, 125, 150, 200], epochs [10, 20,30, 40, 50], and loss criterion [categorical cross entropy, binary cross entropy]. For pre-trained language model BERT-based transfer learning, we find multilingual cased model containing 12 heads, 12 layers, 110 parameters, and 768 hidden units and pre-trained over corpora of 104 languages to be the most optimal among all available BERT variants. In terms of parameters, we experiment with sequence length ranging from $$2^4$$$$2^9$$, batch size $$2^3$$$$2^9$$, learning rate [1e-1, 1e-2, 1e-3, 1e-4 or 1e-5, 2e-1, 2e-2, 2e-3, 2e-4 or 2e-5], buffer size [100, 200, 300, 400, 500], and epochs [10, 20, 30, 40, 50]. On the other hand, for machine learning-based Urdu text classification , we have wrapped support vector machine (SVM) into one against rest classification paradigm with linear and rbf kernels and balanced class weights. In order to find optimize gamma and cost values, we utilize grid search to compute the accuracy of every parameter combination (log2g range(3,− 15), step=− 2, whereas log2c range(− 5, 15), step=2>> log). Contrarily, Naive Bayes is used with default parameters.
In our work, SVM marks better performance with linear kernel, whereas all deep learning models mark optimal performance with root mean square propagation (RMSprop) optimizer, learning rate of 0.001, categorical cross-entropy loss criterion, and batch size of 50 when executed for 20 epochs. Using BERT, we achieve optimal performance with buffer size of 400, sequence length of 512, batch size of 16, and learning rate of 1e-5 by training the model up to 50 epochs. Experimentation with a variety of deep learning model indicates that changes in batch size, hidden size significantly influence model performance. To sum up, for diverse deep learning models, we consider batch size and hidden size are really crucial parameters, tweaking of which leads to optimal or sub-optimal performance.

### 5.5 Adopted deep learning methodologies for Urdu text document classification

This section summarizes state-of-the-art deep learning-based methodologies adapted for the task of Urdu text document classification. In order to provide a bird's eye view on adopted deep learning methodologies, generalized architecture is drawn in Fig. 2.
We adapt a multi-channel CNN model presented by Yoon Kim [ 123] for the task of sentiment, question, and sentence classification. In order to reap the benefits of distinct pre-trained word vectors, for the very first time, they made few channels dynamic, and others static throughout training in order to prevent overfitting. Several researchers (e.g. Nabeel et al. [ 46]) utilized this model for English text document classification and achieved state-of-the-art performance. In our experimentation, we have fed FastText neural word embeddings at one channel and pre-trained neural word embeddings provided by Haider et al. [ 36] at the second channel. At third channel, we have used randomly initialized word embeddings. In order to avoid overfitting, we keep the FastText embeddings static, and fine-tuned other embeddings during training.
Embedding layer of this model is followed by 3 convolution layers with 128 filters of size 3, 4, and 5, respectively. After that, extracted features of all convolution layers are concatenated and fed to another convolution layer having 128 filters of size 5. After applying max-pooling of size 100, the extracted features are then passed to a flatten layer which flattens the features. These flattened features are then passed to a dense layer with 128 output units which are followed by a dropout layer of rate 0.5. Finally, a last dense layer acts as a classifier.
Another CNN-based approach adapted for Urdu text document classification was presented by Nal Kalchbrenner et al. [ 124]. A distinct aspect of this model was the use of wide convolutions. The authors claimed that the words at edges of a document do not actively participate in convolution and get neglected especially when the filter size is large. An important term can occur anywhere in the document so by using wide convolution every term take equal part while convolving. Although, originally, authors did not use any pre-trained word embeddings in the proposed CNN architecture, we have utilized pre-trained word embeddings.
This model begins with an embedding layer, followed by convolution layer with 64 filters of size 50. Top five features are extracted from the convolution layer by using a K-max-pooling layer of value 5. Zero padding is utilized to maintain the wide convolution structure. After that, there is another convolution layer with 64 filters of size 25. This layer is followed by a K-max-pooling layer of value 5. Finally, the extracted features are flattened and passed to a dense layer which classifies the documents.
Yin et al. [ 125] proposed a CNN model for the task of binary or multi-class sentiment analysis, and subjectivity-based question classification for the English language. The significance of the multi-channel input layer was deeply explored by the author by using five different pre-trained word vectors. This model has outperformed eighteen baseline machine and deep learning methodologies [ 125] for sentiment and question classification tasks. While adopting this model, we have utilized two embedding layers, two convolution layers along with wide convolutions.
The model starts with two embedding layers, and each embedding layer is followed by two wide convolution layers with 128 filters of size 3 and 5, respectively. Each convolution layer is followed by a K-max-pooling layer of size 30. After that, both convolution layers are followed by two other convolution layers of the same architecture except the value of k which is 4 in K-max-pooling layers. All the features from all convolution layers are then concatenated and flattened by using a flatten layer. These flattened features are then passed to two dense layers from which the first dense layer has 128 output units and the last dense layer acts as a classifier.
Just like Yin et al. [ 125] CNN-based approach, Zhang et al. [ 126] also proposed a CNN-based approach for text classification. In proposed approach, they not only experimented with three different pre-trained neural word embeddings but also applied l2 norm regularization before and after concatenating all features of different channels. While adopting this model in our experimentation, three embedding layers, l2 norm regularization after features concatenation, and wide convolutions are utilized.
The model starts with three embedding layers, and each embedding layer is then followed by two convolution layers. Both convolution layers have 16 filters of size 3 and 5, respectively, which are followed by a global max-pooling layer. After that, features of all layers are concatenated and l2 norm regularization is applied using a dense layer with 128 output units. These features are then passed to a dense layer which acts as a classifier.
Dani Yogatama et al. [ 127] proposed an LSTM-based neural network model for classifying news articles, questions, and sentiments. Two different versions of the model, namely generative and discriminative LSTM model, were proposed. Both models were the same except that the discriminative model tried to maximize the conditional probability, while the generative model maximized the joint probability. We adopt discriminative version of the model. This model begins with an embedding layer, and output of the previous layer is fed to an LSTM layer which has 32 units. The features extracted by LSTM are then flattened and passed to a dense layer for classification.
Another LSTM-based model was proposed by Hamid Palangi et al. [ 128] to generate the sentence neural embeddings for raising the performance of document retrieval task. This model was not used for any sort of text classification but as its architecture is pretty similar to Yogatama et al. [ 127] proposed model that is why we have adopted this model for our experimentation. The output of the first embedding layer is fed to an LSTM layer which has 64 output units. The output of the LSTM layer is then flattened and feed into a dense layer that acts as a classifier.
As discussed before, both CNN and RNN have their own benefits and drawbacks [ 122]. In order to reap the benefits of both architectures CNN, and RNN, researchers proposed hybrid models [ 122, 129132] in which usually a CNN architecture is followed by RNN. CNN extracts global features [ 129, 133, 134], while RNN learns long-term dependencies for the extracted features [ 116, 135141].
A hybrid model was presented by Siwei Lai et al. [ 142] for the task of text classification. The author claimed that RNN was a biased model in which later words were more dominant than earlier words. To tackle this problem, a hybrid model was suggested that consists of bidirectional LSTM, followed by a max-pooling layer. The bidirectional nature of the model reduces the words dominance, whereas max-pooling layer captures more discriminative features. This model has outperformed twelve machine and deep learning-based models for the task of text classification.
The model begins with three embedding layers, first one is passed to forward LSTM layer, and the second one is fed to backward LSTM layer. Both LSTM layers have 100 output units. The yielded features from both LSTMs are concatenated along with third embedding layer and pass to a dense layer which has 200 output units. Dense layer is followed by a max-pooling layer, and the output of max-pooling layer is then passed to another dense layer which acts as a classifier.
Guibin Chen et al. [ 143] proposed another hybrid model that consists of CNN and LSTM and used for multi-label text classification. Pre-trained word embeddings were used to feed the CNN, and then, features were extracted to feed LSTM. The author claimed that the pre-trained word vectors contain the local features of each word, whereas CNN captured the global features of the input document. Both local and global features were then used by LSTM to predict the sequence of labels. We have adopted this model for multi-class classification instead of multi-label classification.
The model starts with an embedding layer which is followed by five convolution layers with 128 filters of sizes 10, 20, 30, 40, and 50, respectively. Each convolution layer is followed by a max-pooling layer of the same filter size. The output features from all five max-pooling layers are concatenated and flattened using a flatten layer. These flattened features are then passed to a dense layer which has 128 output units. The output from the dense layer along with the output of the embedding layer is then passed to an LSTM layer. This LSTM layer is followed by another dense layer that acts as a classifier.
Another hybrid model based on CNN and LSTM was proposed by Chunting Zhou et al. [ 144] for sentiment analysis and question classification. CNN was used to capture the high-level word features, whereas LSTM extracted the long-term dependencies. Different types of max-pooling layers were applied to the features extracted from CNN. However, the authors suggested that max-pooling layer must be avoided if the features needed to be passed to LSTM. Because LSTM was used for sequential input and a max-pooling layer would break the sequential architecture.
The output of the first embedding layer is passed to five convolution layers which have 64 filters of size 10, 20, 30, 40, and 50, respectively. The extracted features of these five convolution layers are then concatenated and fed to an LSTM layer which has 64 output units. This layer is followed by two dense layers from which the first dense layer has 128 units and the last dense layer eventually acts as a classifier.
The last chosen model in our research is also a hybrid model presented by Xingyou Wang et al. [ 145] for sentiment classification. The theory behind this model is the same as Chunting Zhou et al. [ 144] model except it used both LSTM and GRU along with max-pooling layers after CNN. Based on experimental results, authors claimed that both LSTM and GRU produced the same results, that is why we have adopted this model only with LSTM for our experimentation.
This model begins with an embedding layer, followed by three convolution layers which have 64 filters of size 3, 4 and 5, respectively. Each convolution layer is followed by a max polling layer of same filter sizes. After that, all the output features of the max-pooling layers are concatenated and passed to an LSTM layer which has 64 units. The features yielded by LSTM layer is then passed to a dense layer which has 128 units. This layer is followed by another dense layer that finally acts as a classifier.

### 5.6 Transfer learning using BERT

This section discusses the fruitfulness of transfer learning using pre-trained language model “BERT [ 42]” for the task of Urdu text document classification. Pre-training language model has proven extremely useful to learn generic language representations. In previous section, all discussed deep learning-based classification methodologies utilized pre-trained neural word embeddings including Word2vec [ 146], FastText [ 147], and Glove [ 148]
Traditional neural word embeddings are classified as static contextualized embeddings. These embeddings are prepared by training a model on a gigantic corpus in an unsupervised manner to acquire the syntactic and semantic properties of the words up to certain extent. However, these embeddings fail to grasp polysemy which is all about generating distinct embeddings for the same word on account of different contexts [ 37, 3739]. For instance, consider two sentences like “Saim, I ’ll get late as I have to deposit some cash in Bank” and the other one is “My house is located in canal Bank”. In both sentences, word bank has a different meaning. However, models build on top of neural word embeddings do not consider the context of words in which they appear; thus, in both sentences, the word “Bank” will get a same vector representation which is not correct.
These downfalls are resolved by pre-trained language models which learn the vector representation of words based on the context in which they appear, and this is why embeddings of pre-trained language models such as Bidirectional Encoder Representations from Transformers (BERT [ 42]) are categorized as dynamic contextualized embeddings. Dynamic contextualized embeddings capture word semantics in dissimilar contexts to tackle the problem of polysemous, and context-dependent essence of words. In this way, language models such as BERT [ 42] manage to create different embeddings for the same word which appear in multiple contexts. Traditional language models are trained from left to right; thus, they are framed to predict next word. Contrarily, there exist few approaches such as Universal Language Model Fine-Tuning (UMLFit) [ 41] and Embeddings for Language Models (ELMo) [ 40] based on Bi-LSTM. Bi-LSTM is trained from left to right in order to predict next word, and from right to left to predict previous word, however not both at the same time, whereas BERT [ 42] utilizes entire sentence to learn from all words located at different positions. It randomly masks the words in certain context before making prediction. In addition, it uses transformers which further make it accurate.
To summarize, due to masked language modelling, BERT [ 42] supersedes the performance of other language modelling approaches such as UMLFiT [ 41], and ELMO [ 40]. Moreover, training the transformed architecture bidirectionally in language modelling has proved extremely effective as it has deeper understanding of language context than unidirectional language models. Although BERT [ 42] has marked promising results in several natural language processing (NLP) tasks, there exists a limited research to optimize BERT [ 42] for the improvement of target NLP tasks. In this paper, we thoroughly investigate how to make the best use of BERT [ 42] for the task of text document classification. We explore multifarious methods to fine-tune BERT [ 42] in order to maximize its performance for Urdu text document classification. We perform pre-processing in a same manner as discussed in detail in Sect.  4.2.

### 5.7 Hybrid methodology for Urdu text document classification

This section explains the hybrid methodology for the task of Urdu text document classification. It is considered that deep learning-based methodologies automate the process of feature engineering; however, recent research in computer vision [ 149] and natural language processing (NLP) [ 46] extrapolates that these methodologies also extract some irrelevant and redundant features too which eventually derail the performance of underlay methodologies. In NLP, to remove irrelevant and redundant features, we [ 46] proposed a hybrid methodology which harvested the benefits of both trivial machine learning-based feature engineering, and deep learning-based automated feature engineering. In proposed hybrid methodology, first, a vocabulary of discriminative features was developed by utilizing a filter-based feature selection algorithm, namely normalized difference measure (NDM) [ 47], and then, the constructed vocabulary was fed to the embedding layer of CNN. Hybrid methodology managed to produce the promising figures on two benchmark English datasets 20-Newsgroup 6, and BBC 7, when compared against the performance figures of traditional machine, and deep learning methodology. To evaluate that the proposed hybrid approach is extremely versatile and its effectiveness is neither biased towards the size of training data nor towards specific language or deep learning model, we assess the integrity of hybrid methodology by performing experimentation on different datasets and language with a variety of deep learning models. We adopt 4 CNN, 2 RNN, and 4 hybrid models (CNN+RNN) which were previously used for text document or sentence classification (discussed in Sect.  5.5). Hybrid approach is evaluated on three Urdu datasets (CLE Urdu Digest 1000 k, CLE Urdu Digest 1 M, DSL Urdu News) (Fig. 3).
We perform pre-processing in the same manner as discussed in detail in Sect.  4.2.

## 6 Datasets

To evaluate the integrity of all variety of methodologies based on machine learning, deep learning, hybrid approach, and language modelling, we use two state-of-the-art closed source corpora CLE Urdu Digest 1000 k, CLE Urdu Digest 1 M, and one publicly available presented corpus namely DSL Urdu news. All textual documents of DSL Urdu news dataset are crawled from following web sites Daily Jang 8, Urdu Point 9, HmariWeb 10, BBC Urdu 11, and parsed through Beautiful Soup 12. Table 3 illustrates the characteristics of newly developed corpus having 300 K words, 4224 sentences, and a total 662 documents which belong to following six categories health-science, sports, business, agriculture, world, and entertainment. Average length of a document is approximately 193 words in the developed corpus.
Table 3
DSL Urdu news dataset statistics
Class
No. of documents
No. of sentences
No. of tokens
No. of tokens after lemmatization
Agriculture
102
669
17,967
9856
120
672
20,349
9967
Entertainment
101
685
19,671
10,915
World
111
631
18,589
12,812
Health-sciences
108
823
27,409
12,190
Sports
120
744
24,212
9992
Table 4
CLE Urdu Digest 1000 k dataset statistics before and after Lemmatization
Class
No. of documents
No. of sentences
No. of tokens
No. of tokens after lemmatization
Culture
28
488
8767
8767
Health
29
608
9895
9895
Letter
35
777
11,794
11,794
Interviews
36
597
12,129
12,129
Press
29
466
10,007
10,007
Religion
29
620
9839
9839
Science
55
468
8700
8700
Sports
29
588
10,030
10,030
State-of-the-art corpora CLE Urdu Digest 1000K contains 270 news documents, and CLE Urdu Digest 1 M contains 787 news documents belonging to 8 classes. Former one is a precise corpus and average length of a document is nearly 140 words; however, latter one is a large corpus with an average document length of 900 words. Statistics of both corpora with respect to each class are reported in Tables 4 and 5, respectively.
Table 5
CLE Urdu Digest 1M dataset statistics before and after lemmatization
Class
No. of documents
No. of sentences
No. of tokens
No. of tokens after lemmatization
Culture
133
8784
145,228
145,228
Health
153
11,542
169,549
169,549
Letter
105
8565
115,177
115,177
Interviews
38
2481
41,058
41,058
Press
118
6106
125,896
125,896
Religion
100
6416
107,071
107,071
Science
109
6966
117,344
117,344
Sports
31
2051
33,143
33,143
In order to apply machine learning-based text document classification methodologies, for underlay corpus, textual documents of each class need to be asymmetric when compared with the documents of other classes. Here, in our work, to perform distribution analysis of experimental datasets with respect to unique classes, we have employed most widely used Kullback–Leibler (KL) divergence approach [ 150] following the work of Stehlik et al. [ 151]. It assists to deduce whether samples of distinct classes of one or more datasets are symmetrical or asymmetrical by nature. We receive multifarious divergences which empirically reveal that samples belonging to different classes for each dataset are asymmetrical. This asymmetry fully supports the fact that all experimental datasets are not biased towards samples of one particular class.

## 7 Experimental setup and results

This section summarizes different APIs that are used to perform Urdu text document classification. It also discusses the results produced by methodologies based on machine learning, deep learning, and hybrid approach on three datasets (DSL Urdu news, CLE Urdu Digest 1000 k, CLE Urdu Digest 1 M) used in our experimentation. In order to process Urdu text for the task of Urdu text document classification, we develop a rule base sentence splitter and tokenizer. To evaluate the integrity of machine learning-based Urdu text document classification methodology, all three datasets are split into train and test sets containing 70%, and 30% documents from each class, respectively. The parameters of Naive Bayes [ 26] classifier are alpha=1.0, fit_prior=True, class_prior=None, and SVM [ 25] classifier is used with linear kernel and balanced class weight.
On the other hand, in order to evaluate the performance of adopted deep learning methodologies and to perform a fair comparison with machine learning-based approaches for all three datasets, we use 30% data for test set and remaining 70% data is further split into train and validation sets having 60% and 10% data, respectively. We use Keras API to implement the methodologies of ten adopted neural network-based models. Pre-trained Urdu word embeddings provided by Haider et. al [ 36], and FastText 13 are used to feed all embedding layers except the second layer in Yin et al. [ 125] model and both second and third layers in Zhang et al. [ 126] model which are randomly initialized. To evaluate and compare the performance of filter-based feature selection algorithms, first we rank the features of training corpus against all classes. Then, at different predefined test points, we take top k features from all classes and feed these features to two different classifiers SVM [ 25], and Naive Bayes [ 26]. For adopted deep learning-based Urdu text document classification methodologies, we perform experimentation in two different ways. In first case, after pre-processing, we select entire set of unique terms of each corpus and fed to the embedding layer of all adopted models (discuss with detail in Sect.  5.5), whereas in second case, we select 1000 most frequent terms for DSL Urdu News, and CLE Urdu Digest 1000 k datasets, and 10, 000 most frequent terms for CLE Urdu Digest 1 M dataset.
Likewise, to evaluate the performance of hybrid approach which reaps the benefits of both machine and deep learning-based feature engineering, as similar to machine learning-based classification, for each dataset, we first rank the features of training corpus using NDM [ 47] feature selection algorithm. Then, top k features of each class are fed to 10 different deep learning models. Rather than performing extensive experimentation with all feature selection algorithms once again, considering the promising performance produced by NDM [ 47] with all machine learning-based methodologies, we only explore the impact of NDM [ 47] feature selection algorithm for 10 different deep learning-based classification methodologies.
To assess the effectiveness of transfer learning using BERT [ 42], we fine-tune multilingual cased language model (BERT-Base [ 42]) having 12-layers, 12, heads, 768 hidden units, 110M parameters and pre-trained on 104 languages. We utilize multilingual cased model as it resolves normalization problems in several languages. We fine-tune multilingual model with the buffer size of 400, sequence length of 512, batch size of 16, and learning rate of 1e-5 for 50 epochs.
As two close source experimental datasets (CLE Urdu Digest 1000 k, 1 M) are highly unbalanced, thus instead of using accuracy, or an other evaluation measure, we have performed evaluation using F1 measure as it is widely considered more appropriate evaluation measure for unbalanced datasets.

### 7.1 Results of traditional machine learning-based text document classification methodology

This section summarizes and compares the performance of ten feature selection algorithms (RDC [ 56], MMR [ 21], NDM [ 47], POISON [ 60], GINIINDEX [ 59], ACC2 [ 55], ODDS [ 58], IG [ 57], CHISQ [ 48], BNS [ 55]) on two closed source corpora (CLE Urdu Digest 1000 k, CLE Urdu Digest 1 M Benchmark dataset) and one newly developed corpus (DSL Urdu News) using Naive Bayes [ 26], and SVM [ 25] classifiers. We compare the performance of feature selection algorithms over predefined set of features {10, 20, 50, 100, 200, 500, 1000, 1500} in terms of F1 score.

#### 7.1.1 DSL Urdu news dataset

Tables 6 and 7 summarize the performance of ten feature selection algorithms produced against 8 different benchmark test points over DSL Urdu news dataset using Naive Bayes [ 26] and SVM [ 25] classifiers, respectively.
Table 6
Performance of ten feature selection algorithms against 8 different benchmark test points over DSL Urdu news dataset using Naive Bayes classifier
Feature selection algorithms
Benchmark test points
10
20
50
100
200
500
1000
1500
RDC [ 56]
0.83
0.85
0.85
0.88
0.90
0.91
0.90
0.89
NDM [ 47]
0.70
0.76
0.87
0.90
0.93
0.94
0.94
0.88
MMR [ 21]
0.71
0.82
0.88
0.91
0.91
0.91
0.91
0.89
POISON [ 60]
0.82
0.86
0.90
0.89
0.91
0.92
0.92
0.89
GINI [ 59]
0.77
0.81
0.88
0.87
0.88
0.90
0.90
0.90
ACC2 [ 55]
0.82
0.88
0.87
0.88
0.89
0.90
0.90
0.90
ODDS [ 58]
0.70
0.82
0.88
0.91
0.91
0.91
0.90
0.89
IG [ 57]
0.81
0.86
0.90
0.91
0.91
0.92
0.91
0.89
CHISQ [ 48]
0.79
0.87
0.90
0.91
0.91
0.92
0.91
0.89
BNS [ 55]
0.81
0.88
0.87
0.88
0.89
0.90
0.91
0.90
Peak performance of every classifier is highlighted in bold for each experimental dataset
Table 7
Performance of ten feature selection algorithms against 8 different benchmark test points over DSL Urdu news dataset using SVM classifier
Feature Selection Algorithms
Benchmark Test Points
10
20
50
100
200
500
1000
1500
RDC [ 56]
0.80
0.79
0.80
0.84
0.86
0.88
0.88
0.88
NDM [ 47]
0.74
0.78
0.86
0.88
0.90
0.91
0.91
0.90
MMR [ 21]
0.77
0.83
0.87
0.89
0.89
0.88
0.89
0.90
POISON [ 60]
0.80
0.83
0.85
0.88
0.87
0.89
0.88
0.90
GINI [ 59]
0.76
0.77
0.83
0.83
0.83
0.86
0.88
0.88
ACC2 [ 55]
0.80
0.84
0.82
0.85
0.86
0.87
0.88
0.89
ODDS [ 58]
0.74
0.83
0.86
0.88
0.88
0.89
0.89
0.89
IG [ 57]
0.83
0.85
0.84
0.88
0.89
0.89
0.89
0.90
CHISQ [ 48]
0.81
0.85
0.86
0.88
0.88
0.88
0.90
0.90
BNS [ 55]
0.81
0.83
0.82
0.85
0.87
0.87
0.88
0.88
Peak performance of every classifier is highlighted in bold for each experimental dataset
It can be summarized from Tables 6 and 7 that NDM [ 47] marks the lowest performance at top 10 features; however, with the increase in number of features, its performance gets rocketed for both classifiers. NDM [ 47] outperforms rest of the feature selection algorithms with a huge margin after the induction of 150 or more number of features. Although MMR [ 21] does not perform well with Naive Bayes [ 26], it outshines other nine feature selection algorithms on 50, and 100 number of features using SVM [ 25] classifier. BNS [ 55] only manages to beat other feature selection algorithms at 20, and 1500 number of features with Naive Bayes [ 26] classifier compared to SVM [ 25] where it is badly beaten by seven feature selection algorithms as it marks the second lowest performance of 89%. While GINI [ 59] and RDC [ 56] show the worst performance for both classifiers, all other feature selection algorithms show a mix trend across all test points.
In a nutshell, Naive Bayes [ 26] outperforms SVM [ 25] by revealing a better performance. Moreover, the performance of Naive Bayes [ 26] reaches the peak of 94% than SVM [ 25] which manages to produce the performance of only 91%.

#### 7.1.2 CLE Urdu Digest 1M dataset

Ten feature selection algorithms performance figures against 8 different benchmark test points over CLE Urdu Digest 1 M dataset using Naive Bayes, and SVM classifiers are shown in Tables 8 and 9.
Table 8
Performance of ten feature selection algorithms against 8 different benchmark test points over CLE Urdu Digest 1 M dataset using Naive Bayes classifier
Feature Selection Algorithm
Benchmark Test Points
10
20
50
100
200
500
1000
1500
RDC [ 56]
0.65
0.64
0.66
0.63
0.62
0.61
0.59
0.55
NDM [ 47]
0.51
0.58
0.63
0.64
0.65
0.61
0.57
0.60
MMR [ 21]
0.52
0.54
0.58
0.60
0.62
0.59
0.51
0.46
POISON [ 60]
0.50
0.60
0.61
0.61
0.62
0.56
0.48
0.45
GINI [ 59]
0.13
0.14
0.46
0.59
0.62
0.62
0.62
0.60
ACC2 [ 55]
0.65
0.66
0.65
0.65
0.64
0.62
0.57
0.53
ODDS [ 58]
0.53
0.56
0.62
0.66
0.68
0.65
0.64
0.56
IG [ 57]
0.62
0.62
0.63
0.64
0.63
0.60
0.49
0.45
CHISQ [ 48]
0.57
0.59
0.64
0.63
0.62
0.57
0.48
0.48
BNS [ 55]
0.65
0.64
0.64
0.64
0.64
0.63
0.56
0.53
Peak performance of every classifier is highlighted in bold for each experimental dataset
Table 9
Performance of ten feature selection algorithms against 8 different benchmark test points over CLE Urdu Digest 1 M dataset using SVM classifier
Feature Selection Algorithm
Benchmark Test Points
10
20
50
100
200
500
1000
1500
RDC [ 56]
0.69
0.70
0.73
0.76
0.79
0.77
0.78
0.78
NDM [ 47]
0.55
0.67
0.76
0.81
0.81
0.79
0.76
0.80
MMR [ 21]
0.62
0.68
0.71
0.77
0.78
0.79
0.80
0.78
POISON [ 60]
0.53
0.64
0.75
0.76
0.83
0.82
0.80
0.78
GINI [ 59]
0.27
0.34
0.60
0.67
0.70
0.70
0.78
0.79
ACC2 [ 55]
0.67
0.69
0.77
0.79
0.79
0.79
0.78
0.78
ODDS [ 58]
0.59
0.69
0.74
0.77
0.80
0.76
0.80
0.82
IG [ 57]
0.66
0.69
0.75
0.79
0.78
0.82
0.81
0.78
CHISQ [ 48]
0.62
0.70
0.77
0.77
0.83
0.82
0.81
0.79
BNS [ 55]
0.67
0.71
0.74
0.78
0.80
0.79
0.78
0.78
Peak performance of every classifier is highlighted in bold for each experimental dataset
For CLE Urdu Digest 1 M dataset, ODDS [ 58] performance begins at low of just 53%, and 59% with Naive Bayes [ 26] and SVM [ 25], but it shows an upward trend until 200 number of features with both classifiers considering the trends depicted by Tables 8 and 9. Although ODDS [ 58] outperforms nine feature selection algorithms at four benchmark test points (no. of features= 100, 200, 500, 1000) using Naive Bayes [ 26], it fails to produce highest performance with SVM [ 25] classifier. Contrarily, CHISQ [ 48] does not produce good performance with Naive Bayes [ 26] classifier, but it manages to reveal best performance with SVM [ 25] classifier. CHISQ [ 48] either equalizes or surpass the performance of nine feature selection algorithms at most test points. Although NDM [ 47] performance rises almost gradually until 200 number of features, afterwards its performance fluctuates and fails to surpass best performance figures. Likewise, feeding MMR [ 21] ranked features to both classifiers, performance kept increasing until 200 number of features. After 200 features, its performance declines almost gradually with Naive Bayes [ 26] and shows mixed trend with SVM [ 25] classifier. While POISON [ 60] marks the worst performance with Naive Bayes [ 26], GINI [ 59] shows the lowest performance with SVM [ 25] classifier. The rest of the feature selection algorithms show both upward and downward trends across test points for both classifiers, but they produce better performance figures with SVM [ 25] classifier.
As a whole, SVM [ 25] outshines Naive Bayes [ 26] for CLE Urdu Digest 1 M dataset. Moreover, the performance of SVM [ 25] reaches the peak of 83% than Naive Bayes [ 26] which only manages to reach at 68%.

#### 7.1.3 CLE Urdu Digest 1000K dataset

Tables 10 and 11 elaborate the performance of ten feature selection algorithms on CLE Urdu Digest 1000 k dataset using Naive Bayes [ 26] and SVM [ 25] classifiers, respectively.
Table 10
Performance of ten feature selection algorithms against 8 different benchmark test points over CLE Urdu Digest 1000 K dataset using Naive Bayes classifier
Feature selection algorithm
Benchmark test points
10
20
50
100
200
500
1000
1500
RDC [ 56]
0.57
0.58
0.56
0.55
0.51
0.31
0.25
0.17
NDM [ 47]
0.63
0.70
0.71
0.55
0.40
0.27
0.36
0.28
MMR [ 21]
0.64
0.65
0.81
0.71
0.67
0.51
0.31
0.21
POISON [ 60]
0.50
0.50
0.55
0.57
0.43
0.36
0.21
0.22
GINI [ 59]
0.35
0.35
0.46
0.50
0.52
0.39
0.21
0.10
ACC2 [ 55]
0.60
0.63
0.67
0.60
0.44
0.37
0.17
0.17
ODDS [ 58]
0.64
0.73
0.70
0.70
0.62
0.41
0.31
0.20
IG [ 57]
0.69
0.76
0.74
0.71
0.46
0.32
0.21
0.19
CHISQ [ 48]
0.73
0.72
0.81
0.79
0.61
0.46
0.31
0.22
BNS [ 55]
0.61
0.63
0.67
0.61
0.50
0.37
0.12
0.17
Peak performance of every classifier is highlighted in bold for each experimental dataset
Table 11
Performance of ten feature selection algorithms against 8 different benchmark test points over CLE Urdu Digest 1000 K dataset using SVM classifier
Feature Selection Algorithms
Benchmark Test Points
10
20
50
100
200
500
1000
1500
RDC [ 56]
0.63
0.64
0.67
0.65
0.74
0.73
0.72
0.65
NDM [ 47]
0.70
0.81
0.92
0.90
0.87
0.73
0.66
0.69
MMR [ 21]
0.66
0.74
0.85
0.86
0.81
0.82
0.72
0.68
POISON [ 60]
0.61
0.70
0.84
0.85
0.86
0.84
0.75
0.66
GINI [ 59]
0.50
0.47
0.54
0.59
0.64
0.77
0.75
0.71
ACC2 [ 55]
0.61
0.65
0.67
0.73
0.72
0.75
0.74
0.67
ODDS [ 58]
0.71
0.79
0.74
0.86
0.85
0.78
0.68
0.64
IG [ 57]
0.63
0.68
0.80
0.86
0.81
0.80
0.75
0.66
CHISQ [ 48]
0.69
0.72
0.77
0.83
0.86
0.82
0.71
0.66
BNS [ 55]
0.61
0.64
0.67
0.70
0.77
0.75
0.74
0.67
Peak performance of every classifier is highlighted in bold for each experimental dataset
It can be seen from Tables 10 and 11, with Naive Bayes [ 26] classifier, CHISQ [ 48] produces the highest performance until 100 number of features but its performance dips sharply with the induction of more features. With SVM [ 25] classifier, CHISQ [ 48] reveals a similar trend until 200 number of features but decreases slightly afterwards. CHISQ [ 48] manages to beat nine feature selection algorithms with Naive Bayes [ 26] as the performance of most feature selection algorithms start declining after 100 number of features. While NDM [ 47] manages to beat other feature selection algorithms at two benchmark test points (number of features= {1000, 1500}), MMR [ 21] reveals better performance at 200, and 500 number of features with Naive Bayes [ 26] classifier, whereas NDM [ 47] outshines rest of the feature selection algorithms at four test points {20, 50, 100, 200} with SVM [ 25] as compared to POISON [ 60] which manages to surpass the performance of other feature selection algorithms at only one test point (number of features=500). Rest of the feature selection algorithms mark a mixed trend for both classifier, but produce better performance figures with SVM [ 25] classifier. Among all, GINI [ 59] shows the worst performance with both classifiers.
In a nutshell, once again SVM [ 25] outshines Naive Bayes [ 26] by revealing a better performance. Furthermore, the performance of SVM [ 25] reach the peak of 92% than Naive Bayes [ 26] which manages to produce the performance of only 81%.

#### 7.1.4 Discussion

In order to provide a bird's eye view over the performance of each filter-based feature selection algorithm, this section reports the average performance of ten feature selection algorithms across three datasets. To assess the discriminative power of features ranked by ten feature selection algorithms, we select subset (10, 20, 50, 100, 200, 500, 1000, 1500) of top k features to feed Naive Bayes [ 26] and SVM [ 25] classifiers. Based on predefined subset of features, Tables 12, 13 and 14 show the best performing feature selection algorithm at each benchmark test point using Naive Bayes [ 26] and SVM [ 25] classifiers for two closed source (CLE Urdu Digest 1000 K, CLE Urdu Digest 1 M) and one presented dataset (DSL Urdu News).
For DSL Urdu news dataset, Table 12 illustrates that NDM [ 47] produces the best performance with both classifiers on three same benchmark test points (number of features= 200, 500, 1000), whereas MMR [ 21] reveals the top performance at 50, 100, and 1500 number of features with SVM [ 25] only. While BNS [ 55] manages to produce promising performance at two test points, ODDS [ 58] and RDC [ 56] outperform rest of the feature selection algorithms at only one test point with Naive Bayes [ 26] classifier. On the other hand, for SVM [ 25] classifier, IG [ 57] attains best performance at two test points compared to Naive Bayes [ 26] classifier where it manages to mark best performance at only one test point. All other feature selection algorithms such as ACC2 [ 55], GINI [ 59], POISON [ 60], and CHISQ [ 48] have failed to outshine the peak performance of other feature selection algorithms with SVM [ 25] or Naive Bayes [ 26] classifier at any test point.
Table 12
Best performing feature selection algorithm against each benchmark test point on DSL Urdu News dataset
Classifier
Number of Features
10
20
50
100
200
500
1000
1500
NB [ 26]
RDC [ 56]
BNS [ 55]
IG [ 57]
ODDS [ 58]
NDM [ 47]
NDM [ 47]
NDM [ 47]
BNS [ 55]
SVM [ 25]
IG [ 57]
IG [ 57]
MMR [ 21]
MMR [ 21]
NDM [ 47]
NDM [ 47]
NDM [ 47]
MMR [ 21]
Turning towards the CLE Urdu Digest 1000 k dataset, Table 13 depicts that NDM [ 47] once again reveals promising performance with both classifiers as it outshines other feature selection algorithms at four test points using SVM [ 25] (no. of features=20, 50, 100, 200) and at two test points using Naive Bayes [ 26] (no. of features=1000, 1500), whereas CHISQ [ 48] and MMR [ 21] show good performance with only Naive Bayes [ 26] classifier at two test points. While ODDS [ 58] and POISON [ 60] manage to produce best performance at one test point each using SVM [ 25] classifier, IG [ 57] once again shows mixed trend as it produces good performance with both classifiers. All other feature selection algorithms such as RDC [ 56], BNS [ 55], and ACC2 [ 55] fail to outperform other on any test point with any classifier.
Table 13
Best performing feature selection algorithm against each benchmark test point on CLE Urdu Digest 1000 k dataset
Classifier
Number of Features
10
20
50
100
200
500
1000
1500
NB [ 26]
CHISQ [ 48]
IG [ 57]
CHISQ [ 48]
CHISQ [ 48]
MMR [ 21]
MMR [ 21]
NDM [ 47]
NDM [ 47]
SVM [ 25]
ODDS [ 58]
NDM [ 47]
NDM [ 47]
NDM [ 47]
NDM [ 47]
POISON [ 60]
IG [ 57]
GINI [ 59]
Contrarily, in CLE Urdu Digest IM dataset, ODDS [ 58] reveals promising performance at four test points with Naive Bayes [ 26] and at one test point with SVM [ 25] as shown by Table 14. While ACC2 [ 55] shows highest performance on the induction of 10, and 20 number of features with Naive Bayes [ 26], and at 50 number of features with SVM [ 25] classifier, NDM [ 47] and RDC [ 56] mark best performance at one test point with each classifier. On the other hand, CHISQ [ 48] performs well with SVM [ 25] classifier only. Likewise, BNS [ 55] and POISON [ 60] also manage to beat other feature selection algorithms at one test point with SVM [ 25] classifier, whereas MMR [ 21], GINI, and IG [ 57] have failed to beat other feature selection algorithms with any classifier and test point.
Table 14
Best performing feature selection algorithm against each benchmark test point on CLE Urdu Digest 1M dataset
Classifier
Number of Features
10
20
50
100
200
500
1000
1500
NB [ 26]
ACC2 [ 55]
ACC2 [ 55]
RDC [ 56]
ODDS [ 58]
ODDS [ 58]
ODDS [ 58]
ODDS [ 58]
NDM [ 47]
SVM [ 25]
RDC [ 56]
BNS [ 55]
ACC2 [ 55]
NDM [ 47]
CHISQ [ 48]
POISON
CHISQ [ 48]
ODDS [ 58]
Furthermore, Tables 15, 16 and 17 highlight the highest performance attain by each feature selection algorithm against particular test point using any classifier for all three corpora.
It can be clearly seen from Table 15 that NDM [ 47] outperforms rest of the feature ranking metrics by revealing the highest performance figure of 94% at 1000 number of features on DSL Urdu news dataset. Three algorithms {POISON [ 60], IG [ 57], CHISQ [ 48]} produce second highest performance of 92% on the induction of top 500 features. While MMR [ 21], ODDS [ 58], RDC [ 56], and BNS [ 55] attain third highest performance of 91% at different test points, GINI [ 59] and ACC2 [ 55] both mark the performance of 90% at same test point (number of features = 500).
Table 15
Peak performance figures of ten feature selection algorithms on DSL Urdu news dataset using Naive Bayes and SVM classifiers
FR Metric
RDC [ 56]
NDM [ 47]
MMR [ 21]
POISON [ 60]
GINI [ 59]
ACC2 [ 55]
ODDS [ 58]
IG [ 57]
CHISQ [ 48]
BNS [ 55]
Test Point
500
1000
100
500
500
500
100
500
500
1000
F1 Score
0.91
0.94
0.91
0.92
0.90
0.90
0.91
0.92
0.92
0.91
On the basis of the figures reported in Table 16, for CLE 1000 K dataset, NDM [ 47] reveals the highest performance figure of 92% at 50 number of features. Majority of the algorithms {MMR [ 21], ODDS [ 58], IG [ 57], CHISQ [ 48], POISON [ 60]} reveal second best performance of 86% on the account of different number of features. In addition, BNS [ 55] and GINI [ 59] produce third highest performance of 77% at different test points, whereas ACC2 [ 55] and RDC [ 56] show least best performance figures of 75%, and 74% on the induction of 500, and 200 number of features respectively.
Table 16
Peak performance figures of ten feature selection algorithms on CLE Urdu Digest 1000K dataset using Naive Bayes and SVM classifiers
FR Metric
RDC [ 56]
NDM [ 47]
MMR [ 21]
POISON [ 60]
GINI [ 59]
ACC2 [ 55]
ODDS [ 58]
IG [ 57]
CHISQ [ 48]
BNS [ 55]
Test Point
200
50
100
200
500
500
100
100
200
200
F1 Score
0.74
0.92
0.86
0.86
0.77
0.75
0.86
0.86
0.86
0.77
However, POISON [ 60] and CHISQ [ 48] produce the highest performance figure of 83% at same test point (number of features = 200) for CLE Urdu Digest 1 M dataset as illustrated by Table 17. IG [ 57] and ODDS manage to produce the second best performance of 82% at different number of features. Likewise, NDM [ 47] marks third highest performance figure of 81%, whereas BNS [ 55] and MMR [ 21] manage to produce fourth highest performance figure of 80% on the induction of different features. Remaining three feature selection algorithms {ACC2 [ 55], RDC [ 56], GINI [ 59]} reveal lowest best performance figure of 79% at different number of features.
Table 17
Peak performance figures of ten feature selection algorithms on CLE Urdu Digest 1 M dataset using Naive Bayes and SVM classifiers
FR Metric
RDC [ 56]
NDM [ 47]
MMR [ 21]
POISON [ 60]
GINI [ 59]
ACC2 [ 55]
ODDS [ 58]
IG [ 57]
CHISQ [ 48]
BNS [ 55]
Test Point
200
100
1000
200
1500
100
1500
500
200
200
F1 Score
0.79
0.81
0.80
0.83
0.79
0.79
0.82
0.82
0.83
0.80
Moreover, we compare the average performance of all feature selection algorithms against each corpus. Average performance is calculated by taking the ratio between number of outperforming test points and total number of test points with both classifiers which are 16. For each algorithm, outperforming test points are those test points over which an algorithm beats all other feature selection algorithms on certain dataset but regardless of classifier. For instance, as IG [ 57] outperforms nine feature ranking metrics at three test points (no. of features =10, 20, 50) in DSL Urdu news dataset, so IG [ 57] will get the score of 18.75 which is computed as below:
\begin{aligned}&\text{Score \; on \; DSL \; Urdu \; news \; dataset }\\&\quad = (\text{outperforming\; test \;points\;/total\; test \;points}) *100\\&\quad = (3/16)*100\\&\quad = 18.75 \end{aligned}
Table 18
Average performance of ten feature selection algorithms against each corpus
FR Metric
DSL Urdu News
CLE Urdu Digest 1000K
CLE Urdu Digest 1 M
Average
RDC [ 56]
6.25
0
12.5
6.25
NDM [ 47]
37.5
37.5
12.5
29.16
MMR [ 21]
18.75
12.5
0
10.41
POISON [ 60]
0
6.25
6.25
4.16
GINI [ 59]
0
6.25
0
2.08
ACC2 [ 55]
0
0
18.75
6.25
ODDS [ 58]
6.25
6.25
31.25
14.58
IG [ 57]
18.75
12.5
0
10.41
CHISQ [ 48]
0
18.75
12.5
10.41
BNS [ 55]
12.5
0
6.25
6.25
Peak performance of every classifier is highlighted in bold for each experimental dataset
Considering the trends shown in Table 18, on DSL Urdu news dataset, average performance 37.5% of NDM [ 47] feature selection algorithm is the highest. MMR [ 21] and IG [ 57] produce second best average performance of 18.75%. While BNS [ 55] marks third highest average performance of 12.75%, RDC [ 56] and ODDS [ 58] reveal the lowest average performance of 6.25%. However, four feature selection algorithms {POISON [ 60], GINI [ 59], ACC2 [ 55], CHISQ [ 48]} do not outperform other feature selection algorithms at any test point at all. Likewise, on CLE Urdu Digest 1000K dataset, NDM [ 47] attains same highest average performance (37.5%). CHISQ marks second highest average performance of 18.75%, MMR [ 21], and IG [ 57] both reveal third best average performance of 12.75%. Three feature selection algorithms POISON [ 60], GINI [ 59], and ODDS [ 58] reveal lowest average performance of 6.25%. Among all, three algorithms {RDC [ 56], ACC2 [ 55], BNS [ 55]} fail to outperform other feature selection algorithms at any test point.
Furthermore, on CLE Urdu Digest 1 M dataset, ODDS shows the best average performance of 31.25% followed by 18.75% attained by ACC2 [ 55]. Three feature selection algorithms {RDC [ 56], NDM [ 47], CHISQ [ 48]} mark the third highest average performance of 12.5%. While POISON [ 60] and BNS [ 55] manage to attain the lowest performance of 6.25%, three feature selection algorithms {MMR [ 21], GINI [ 59], IG [ 57]} fail to surpass the performance of other feature selection algorithms at any test point.
Taking into account the average performance of ten feature selection algorithms across all three datasets, NDM [ 47] performs well across all datasets; thus, it produces highest average performance of 29.16%. Contrarily, three feature selection algorithms {RDC [ 56], ACC2 [ 55], BNS [ 55]} show the lowest average performance of 6.25%.
Ideally, feature selection algorithm shall assign greater scores to extremely discriminative features, and lower scores to less significant or irrelevant features. Considering this, typical criteria to select highly discriminative features are as follows: features appearing very rarely in a certain class or very frequently across all classes are totally irrelevant. Thus, they shall be assigned lower scores. Contrarily, features having relative frequencies in a certain class and show in most of the corpus classes or do not show at all are greatly discriminative; hence, they must be assigned higher scores. Among all discussed feature selection algorithms, ACC2 [ 55] is the simplest feature selection algorithm which assigns scores by computing the absolute difference among $$t_{\mathrm{pr}}$$, and $$f_{\mathrm{pr}}$$; however, it assigns same scores to those features which have same frequencies in negative or positive classes. This is why it fails to perform better than other feature selection algorithms such as NDM [ 47], MMR [ 21], ODDS [ 58], IG [ 57], and CHISQ [ 48]. NDM [ 47] shows the best performance on DSL Urdu News, and CLE Urdu Digest 1000 K datasets as it is a modified version of ACC2 [ 55]. NDM [ 47] assigns high score to those features which occur more times in one class and least occur in other classes. To achieve this, it normalizes the ACC2 [ 55] scores by dividing it with the minimum of $$t_{\mathrm{pr}}$$, and $$f_{\mathrm{pr}}$$ which results in better performance. As no other feature selection algorithm considers the minimum of $$t_{\mathrm{pr}}$$, and $$f_{\mathrm{pr}}$$ while computing score for certain feature, NDM [ 47] marks best performance for most datasets. However, NDM [ 47] score may shoot out for highly sparse and irrelevant features. For instance, consider a dataset of six classes where the $$t_{\mathrm{pr}}$$ of a feature is 0, and $$f_{\mathrm{pr}}$$ of the feature is 5, so in this case assuming that 0 is approximately equal to 0.0001 to avoid infinity, NDM [ 47] (0–5/0.0001= − 50,000) will assign very high score to an irrelevant feature which is not correct at all. MMR [ 21] which is the modified version of NDM [ 47] reduces such high score by multiplying NDM [ 47] score with the max of $$t_{\mathrm{pr}}$$,and $$f_{\mathrm{pr}}$$. According to Rehman et al. [ 47], the terms located at the bottom right and top left corners of the contour plot are more important than the ones located around the diagonals. Although MMR [ 21] alleviates the score of highly sparse and irrelevant features but in case of small and non-skewed datasets where the term distribution is pretty balanced, MMR [ 21] score gets affected by the max of $$t_{\mathrm{pr}}$$,and $$f_{\mathrm{pr}}$$. In other words, MMR [ 21] performs best when the dataset is large and highly skewed in nature which is not the case with experimental datasets. This is why MMR [ 21] fails to obliterate the performance of NDM [ 47]. ODDS [ 58] shows second best performance along with IG [ 57], and CHISQ [ 48]. ODDS [ 58], IG [ 57], and CHISQ [ 48] select the features in a univariate fashion; therefore, these feature selection algorithms fail to handle redundant features [ 152]. Also these feature selection algorithms rank the features on the basis of class relevance, but they fail to correctly discriminate those features which have large distinct values.

### 7.2 Results of adopted deep learning-based methodologies and the hybrid methodology

This section summarizes the performance of ten adopted deep learning methodologies on the test sets of two closed source (CLE Urdu Digest 1000K, CLE Urdu Digest 1 M) and one publicly available presented dataset (DSL Urdu News) using F1 measure. It also reveals the performance impact created by hybrid approach which combines traditional machine learning-based feature engineering and deep learning-based automated feature engineering on the test set of all three datasets.
Table 19
Performance of adopted deep learning methodologies, and BERT over three corpora using full, most frequent and NDM features
Model type
Models
Datasets
DSL Urdu News
CLE 1000 K
CLE 1 M
Full vocab
MF @3K
NDM @250
Full vocab
MF @2K
NDM @350
Full vocab
MF @10K
NDM @400
CNN
Yoon Kim et al [ 123]
0.88
0.90
0.93
0.46
0.42
0.77
0.66
0.60
0.74
Kalchbrenner et al [ 124]
0.89
0.91
0.91
0.57
0.63
0.69
0.63
0.70
0.75
Yin et al [ 125]
0.90
0.90
0.90
0.70
0.67
0.78
0.71
0.76
0.80
Zhang et al [ 126]
0.90
0.90
0.90
0.50
0.56
0.71
0.63
0.65
0.72
RNN
Yogatama et al [ 127]
0.88
0.89
0.91
0.51
0.64
0.56
0.71
0.68
0.68
Palangi et al [ 128]
0.87
0.89
0.91
0.48
0.54
0.50
0.68
0.65
0.71
HYBRID
Siwei Lai et al [ 142]
0.88
0.88
0.91
0.60
0.57
0.69
0.70
0.66
0.77
Chen et al [ 143]
0.87
0.89
0.86
0.32
0.48
0.47
0.39
0.52
0.44
Zhou et al [ 144]
0.88
0.88
0.88
0.43
0.61
0.53
0.53
0.58
0.55
Wang et al [ 145]
0.88
0.90
0.90
0.66
0.62
0.50
0.58
0.57
0.56
BERT Multilingual [ 42]
0.93
0.85
0.93
0.77
0.35
0.77
0.68
0.41
0.70
Peak performance of every classifier is highlighted in bold for each experimental dataset
The performance of four CNN, two RNN, four hybrid models, and pre-trained multilingual language model BERT [ 42] with full features, most frequent 2 K, 3 K, 10 K features, and discriminative 250, 350, and 400 features selected by filter-based feature selection algorithm (NDM) [ 47] is summarized in Table 19. In this table, full vocabulary is the set of unique terms found after corpus preprocessing. MF implies most frequent terms of the corpus, whereas NDM@ refers to the discriminative terms selected by NDM [ 47]. For discriminative terms, firstly, all unique terms of the training corpus are ranked using filter-based feature selection algorithm (NDM) [ 47], and then, top 250, 350, and 400 features of each class are selected in order to feed the embedding layer.
Considering the figures shown in Table 19, feeding adopted deep learning models with vocabulary of unique words, among all, Yin et al CNN-based model [ 125] produces the highest performance figures of 70%, 71%, and 90% on CLE Urdu Digest 1000 K, CLE Urdu Digest 1 M, and DSL Urdu news datasets, respectively. While Zhang et al model [ 126] also produces the peak performance of 90% at DSL Urdu News dataset, Yogatama et al. [ 127] RNN-based model replicates the peak performance of Yin et al. [ 125] model on CLE Urdu Digest 1 M dataset.
Conversely, by feeding deep learning models with most frequent features instead of full features, over both DSL Urdu news, and CLE Urdu Digest 1000 K datasets, performance of more than half of the adopted deep learning methodologies increases as compared to CLE Urdu Digest 1 M where the performance of exactly half of the deep learning methodologies raises by a decent margin. Kalchbrenner et al. model [ 124] surpasses the performance of all models by marking the performance of 91% over DSL Urdu News corpus, whereas Yin et al [ 125] CNN-based model depicts the performance of 76% with 10K most frequent features over CLE Urdu Digest 1 M dataset, and 67% over CLE Urdu Digest 1000 K with 2K most frequent features.
On the account of hybrid methodology which feeds most discriminative features ranked by filter-based feature selection algorithm (NDM) [ 47], the performance of all models gets rocketed. The performance of Yoon Kim et al. CNN-based model [ 123] jumps from 88% to 93% over DSL Urdu News dataset, whereas Yin et al CNN-based model [ 125] performance raises from 70% to 78% over CLE Urdu Digest 1000 K, and 71% to 80% over CLE Urdu Digest 1 M dataset.
On the other hand, pre-trained language model BERT [ 42] shows decent performance with multilingual vocabulary. BERT [ 42] tokenizer is based on a WordPiece model which greedily builds a fixed sized vocabulary of most common words, subwords, and individual characters which best fits certain language data. BERT [ 42] effectively handles out of vocabulary words by generating the embeddings of subword tokens which retain most of the contextual meaning of the words, and individual characters. In this way, BERT [ 42] learns effective representations for the features present in experimental datasets. Through utilizing multilingual vocabulary, BERT [ 42] marks the promising performance of 93% over DSL Urdu news dataset, whereas for CLE Urdu Digest 1000 K dataset which only consists of 270 documents, on average, BERT [ 42] only utilizes 20 documents of each class for training, but still it manages to mark the performance figure of 77%. As BERT [ 42] only supports the sequence length up to 512 due to memory overhead, this is why BERT [ 42] manages to achieve the limited performance of only 68% over CLE Urdu Digest 1 M dataset which is the largest among all experimental datasets with an average document length of around 1100 words. BERT [ 42] would have achieved better results for CLE Urdu Digest 1 M dataset if it had supported sequence length higher than 512. In addition, with the buffer size of 100, BERT [ 42] only marks the performance of 91%, 70%, and 60% as compared to 93%, 77%, and 68% achieved with the buffer size of 400 over DSL Urdu news, CLE Urdu Digest 1000 K, and CLE Urdu Digest 1 M datasets respectively.
Contrarily, BERT [ 42] does not perform well when we replace its multilingual base vocabulary with the vocabulary that only contains most discriminative features ranked by NDM [ 47]. In addition, its performance further gets deteriorated when a vocabulary of most frequent features is passed during fine-tuning. This is because most of the frequent features are irrelevant and also BERT [ 42] lacks the embeddings of most of the features. Besides, although BERT [ 42] with most discriminate features outshines the performance figures of 68% produced by multilingual vocabulary for CLE Urdu Digest 1 M dataset; however, according to our observation, it happens only because of the fact that CLE Urdu Digest 1 M has greater number of discriminative features which overlap with multilingual vocabulary as compared to CLE Urdu Digest 1 M, and DSL Urdu news datasets.
To summarize, BERT [ 42] proves extremely effective as it almost replicates the performance of hybrid methodology. Hybrid methodology raises the performance of all adopted deep learning models up to great extent. This is because it alleviates the noise from underlay dataset and selects highly discriminative features to feed the models. Evidently, for those datasets where average document length lies within 512 tokens, BERT [ 42], and hybrid methodology mark similar performance figures, however for those datasets where average document length exceeds from 512 tokens, hybrid methodology outshines the performance of BERT [ 42].
Furthermore, in order to perform class-level comparison of hybrid methodology and state-of-the-art adopted deep learning methodologies, we present accuracy confusion matrices of both methodologies on the test set of all three datasets. To summarize the evaluation, among all 10 adopted deep learning models, three best performing deep learning models are selected including CNN given by Yin et al. [ 125], RNN proposed by Palangi et al. [ 128], and Hybrid model presented by Siwei Lai et al. [ 142].
Confusion matrices produced by three best performing adopted deep learning methodologies across three datasets are shown in Fig. 4. It can be noted from Fig.  4, regardless of model architecture, all three adopted methodologies are unable to produce promising performance on any dataset as automated feature engineering fails to extract discriminative features which helps the model to differentiate class boundaries. For example, in DSL Urdu news dataset, most of the testing samples of class world are wrongly classified into three classes, namely health-science, entertainment, and sports classes by the adopted models as the models extract less discriminative features, thus get confused to assign correct class labels. Likewise, in CLE Urdu Digest 1 M dataset, few samples of almost every class are mistakenly classified into letter, health, culture or press classes, whereas in CLE Urdu Digest 1000K dataset, three similar classes interviews, letters, and press show the most number of samples being wrongly classified in either of classes.
In contrast, hybrid methodology performs way better as it utilizes filter-based feature selection algorithm (NDM) [ 47]. With the induction of feature selection algorithm (NDM) [ 47], a vocabulary of most discriminative features is fed to the embedding layer of adopted deep learning models which assists the model to better understand the class boundaries; thus, it raises the performance of adopted deep learning methodologies up to great extent on all three datasets. As summarized by Fig.  5, models are no more confused among highly similar classes such as letter, interview, and press in both CLE Urdu Digest 1 M, 1000 K datasets.
The general observations drawn while deploying neural network architecture for Urdu text document classification are as follows:
• When vocabulary of unique words is fed to the model, bidirectional LSTM outperformed all other neural architectures.
• When feeding highly discriminative features, convolution-based models are clearly the winners as they perform better than recurrent- and hybrid-based models. However, in a few scenarios, hybrid models perform similar to CNNs but not better.
• Model with multi-layer CNN architecture with different filter sizes learns better data representation as compared to the model in which CNN layers are linearly stacked over each other.
• According to the experimentation, models give better results when embedding layer is initialized by pre-trained word vectors and get updated during training.
• Implementing wide convolutions increases the performance of models on text document classification as wide convolution equalizes the participation of all features while convolving them.
• For text document classification, the performance of deep learning model increases significantly when the model is fed with deterministic features instead of full vocabulary having all unique terms.
• Max pooling layer plays a significant role to extract discriminative features.
• Using multiple embedding layers with CNN architecture produces better results when the model is fed with deterministic features, while in all other scenarios, there is no significant difference between the performance of models that use single and multiple embedding layers.

## 8 Conclusion

This paper may be considered a milestone towards Urdu text document classification as it presents a new publicly available dataset (DSL Urdu News), introduces 10 filter-based feature selection algorithms in state-of-the-art machine learning-based Urdu text document classification methodologies, adopts 10 state-of-the-art deep learning methodologies, assesses the effectiveness of transfer learning using BERT, and evaluates the integrity of a hybrid methodology which harvests the benefits of both machine learning-based feature engineering, and deep learning-based automated feature engineering. Experimental results show that in machine leaning-based Urdu text document classification methodology, SVM classifier outperforms Naive Bayes as all feature selection algorithms produce better performance for two datasets (CLE Urdu Digest 1000 k, 1 M) with SVM classifier. NDM and CHISQ reveal the promising performance with both classifiers. Among all, GINI shows the worst performance with both classifiers. Furthermore, adopted deep learning methodologies fail to mark a promising performance with trivial automated feature engineering. Although using a vocabulary of most frequent features raises the performance of adopted deep learning methodologies, it fails to obliterate the promising performance figures of hybrid approach. The hybrid methodology has proved extremely versatile and effective with different languages. It substantially outperforms adopted deep learning-based methodologies and almost equalizes the top performance of machine learning methodologies across two datasets (DSL Urdu News, CLE Urdu Digest 1 M). Similarly, BERT almost mimics the performance of hybrid methodology on account of those datasets where the average document length does not exceed 512 tokens. However, for datasets where average document length exceeds from 512 tokens, hybrid methodology performs better than BERT. Contrarily, for all three datasets, hybrid methodology fails to outshine the peak performance figures produced by machine learning methodology due to the small size of experimental datasets. To illustrate the point, consider the class interviews of CLE Urdu Digest 1M which has only 38 documents, so in this scenario, deep learning-based hybrid methodology only uses 22 documents for training which are not good enough at all. A compelling future line of this work would be the development of a robust neural feature selection algorithm which can assists the models to automatically select highly discriminative features from each class. In addition, comparison of diverse data augmentation approaches, investigating the impact of ensembling feature selection algorithms and assessing whether feeding the discriminative features of different filter-based feature selection algorithm at multiple channels of deep learning models can significantly raise the classification performance for Urdu language

## Compliance with ethical standards

Not applicable.

### Conflict of interest

Corresponding author on the behalf of all authors declares that no conflict of interest is present.

### Availability of data and material

DSL Urdu news dataset, and other pre-processing resources will be available at Github repository ( https://​github.​com/​minixain/​Urdu-Text-Classification).

### Code availability

The source code will be available at Github repository ( https://​github.​com/​minixain/​Urdu-Text-Classification).

## Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Footnotes
Literatur
Über diesen Artikel

Zur Ausgabe