Top

Neural Computing and Applications

Published in:

Open Access 05-02-2024 | Original Article

Quran reciter identification using NASNetLarge

Authors: Hebat-Allah Saber, Ahmed Younes, Mohamed Osman, Islam Elkabani

Published in: Neural Computing and Applications | Issue 12/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Speaker identification has significant advantages for the field of human–computer interaction. Recently, many scholars have made contributions in this field and successfully created deep learning models for automatic speaker identification systems. However, most of the speech signal processing work is limited to English-only applications, despite numerous challenges with Arabic speech, particularly with the recitation of the Holy Quran, which is the Islamic holy book. In the light of these considerations, this study proposes a model for identifying the reciter of the Holy Quran using a dataset of 11,000 audio samples extracted from 20 Quran reciters. To enable feeding the audio samples' visual representation to the pre-trained models, the audio samples are converted from their original audio representation to visual representation using the Mel-Frequency Cepstrum Coefficients. Six pre-trained deep learning models are evaluated separately in the proposed model. The results from the test dataset reveal that the NASNetLarge model achieved the highest accuracy rate of 98.50% among the pre-trained models used in this study.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Speaker identification is a technique that uses features extracted from speech signals to automatically figure out who is speaking. Due to its potential usefulness in a variety of everyday activities like surveillance applications, biometrics applications, business transactions, telephone banking, access control systems, law enforcement, and voice mail surfing, it has recently gained significant scientific interest. The difficult work of speaker identification starts with gathering the speaker's audio data, followed by extracting audio features that can be utilized to identify each speaker. Finally, selecting the appropriate model to train it with the extracted features to achieve correct identification [1]. Hence, feature extraction methods and classification algorithms are essential parts of a speaker identification system [2]. Various tools and techniques with several rates of success have been created for speaker identification systems. Features can be implemented as Zero Crossing Rate (ZCR) [3], Mel-Frequency Cepstral Coefficients (MFCCs) [4], Spectral Centroid (SC) [5], pitch frequency [6], etc. MFCC features are the most effective and popular features for acoustic modelling, particularly for speaker identification [7]. For speaker identification task, machine learning models such as Linear Regression [8], Support Vector Machine (SVM) [9], and k-Nearest Neighbour (KNN) [10] were implemented. However, the efficacy of such machine learning models is reliant on the richness and variety of their features [11]. Also, deep learning models have been successfully employed in speaker identification tasks, even though there are some concerns with deep learning techniques. Whereas, starting from the beginning with a fresh model for deep learning necessitates an enormous amount of data, long periods of training, and vast computational power. Consequently, a transfer learning approach can indeed be used to get over the concerns of the traditional deep learning model [12]. The deep transfer learning technique can be applied by using the new training data to update the weights for a previously trained model on a new task. Another way to apply deep transfer learning models is to implement a previously trained model as a feature extractor, followed by a classifier or simple model to get identification [13]. Deep transfer learning has many benefits, such as less time spent training, improved neural network efficiency, and a lot of data is not necessary.

Despite many studies have been done on speaker identification, relatively few of them have specifically focused on the Holy Quran reciter identification. For Muslims, nothing is more sacred than the Holy Quran. Muslims perform five daily prayers, and during their prayers, they must read from the Holy Quran. Many Muslims, regardless of language, listen to and read the Holy Quran, especially during Ramadan (the Islamic month of fasting). Muslims cannot always identify the reciter of the Holy Quran when they hear it. Since the Holy Quran is recited using a unique mechanism, certain guidelines must be followed at all times, known as Tajweed [14].

In this paper, an identification model for a Holy Quran reciter is provided. Six pre-trained deep learning models are assessed after being separately integrated into our proposed model. A Quranic dataset for twenty reciters of the Holy Quran is created based on reprocessing audio files to an image-based visual representation depending on MFCC features from about 11,000 audio files. This paper is the first to use deep transfer learning models for efficient application in Quranic reciter identification.

The remaining sections are organized as follows: The related work on the Arabic language, especially for the Holy Quran, is in Sect 2. Section 3 presents a background on MFCC features, in addition to deep transfer learning approaches and its pre-trained models. Section 4 describes the dataset that was applied to the proposed model. Section 5 explains the proposed model for identifying the reciter of the Holy Quran. Section 6 talks about the results and discussion of the experiment depending on the rate of recognition. Section 7 concludes this paper and talks about future work.

Many researchers have explored speech processing; despite that, the majority of their efforts are concentrated on English, and few of them have particularly targeted Arabic. Although there are many challenges with Arabic speech that can be highlighted, particularly with the Holy Quran. Reciting the Holy Quran correctly is one of these challenges, whereas when reading the Holy Quran, some unique procedures and guidelines must be followed, known as Tajweed rules. Ayyoub et al. [15] attempted to solve the problem of finding the right way to use Tajweed rules throughout the Holy Quran. In particular, they looked at eight Tajweed rules that beginners in recitation have to deal with. They combined traditional features (such as Linear predictive Code (LPC) [16], MFCCs [4], Hidden Markov Model-based Spectral Peak Location (HMM-SPL) [17], and Wavelet Packet Decomposition (WPD) [18]) with those retrieved by a convolutional Deep Belief Network [19] and utilized SVM for classification. For an internal dataset of thousands of audio recordings, the obtained accuracy was 97.7%. Alagrami and ljazzar [20] also conducted automatic recognition of the Arabic recitation rules of the Holy Quran (smart Tajweed). Alagrami and ljazzar [20] used filter banks [21] to extract features as a baseline method, and for classification, they used SVM. This model achieved 99% validation accuracy for only four Tajweed rules. In addition to Tajweed, Makhraj (the parts of the mouth from which Arabic alphabets are uttered) is an important thing Muslims should know to correctly read the Holy Quran [22]. As a result, distinguishing the Makhraj from the reciter is another challenge of the Holy Quran. Hamid et al. [23] developed an approach for Makhraj recognition employing MFCC features and Mean Squared Error (MSE) for pattern matching of the hijaiyah letter. This model achieved 100% precision in a dataset that was created for people between the ages of 21 and 23 who are experts in Makhraj utterance. Also, the categorization of specific melodies, known as maqams in Arabic, by the Holy Quran's reciters is a further instance of the Holy Quran's challenges. Shahriar and Tariq [14] implemented MFCC features and Artificial Neural Network (ANN) that consisted of five deep layers to classify the maqams of the Holy Quran recitations. Shahriar and Tariq [14] achieved a 95.7% accuracy rate on a dataset created from two Quranic reciters. Identifying the reciter of the Holy Quran is also a difficult challenge. Although there exists a unique signal for each Quran reciter, these individual signals tend to converge due to the influence of Tajweed rules and recitation style. Alkhateeb [24] developed an algorithm using machine learning for the identification of Holy Quran reciters. MFCC features are extracted from ten reciters, and for classification purposes, the KNN classifier and the ANN classifier are both used. The ANN achieved 97.62% accuracy for Chapter 18 and 96.7% accuracy for Chapter 36, while KNN achieved 97.03% accuracy for chapter 18 and 96.08% accuracy for chapter 36. Also, Anazi and Shahin [25] used the same model for reciter identification as in a related study [24]. But in [25], the authors used another 10 reciters and different Quranic chapters to construct the MFCC features data set. ANN performed with an average accuracy of 98.5% in Chapter 7 and 97.2% in Chapter 32, while KNN's average accuracy for Chapter 7 is 97.02%, compared to 96.07% for Chapter 32. Nahar et al. [26] suggested two models: ANN and SVM to identify the Quranic reciter out of 15 reciters using MFCC features. The accuracy rate reached was 96.59% using SVM classifier and 86.1% using an ANN. This model is applied to a dataset composed of 230 verses for each of the 15 reciters. In addition, Shah and Ahsan [27] introduced an Arabic speaker identification system using the combination of Discrete Wavelet Transform (DWT) [28] and LPC features [16]. Shah and Ahsan [27] achieved a 90.90% recognition accuracy in the dataset for five reciters.

3 Background

This section describes the necessary background information, which includes the MFCCs audio feature extraction step as well as pre-trained models for deep transfer learning.

3.1 MFCC audio features

Extraction of features from the audio stream is the initial step in the speaker identification task. Since features are used to represent the audio stream by compressing it. The features are insensitive to environmental changes and audio variations [29]. MFCCs are one of the most effective and popular extracted features from audio [30]. It is grammatically correct. The MFCCs count is small enough to effectively represent the features in the audio. It has a strong ability to produce a robust feature when the signal is affected by noise, in addition to its strength in mimicking what a human perceives.

3.2 Deep transfer learning model

The transfer learning technique is one of the deep learning methods that has been used most in the past few years. A pre-trained model in the deep transfer learning approach that was trained on one problem can be used to solve another problem that is related in some way. The deep transfer learning approach can be a weight initialization method in which the pre-trained weights and biases from the dataset are used to train another model rather than random weights and biases for a new dataset [31]. Most of the pre-trained weights of models use the ImageNet dataset [32] to train as NASNet models [33], EfficientNet [34], and EfficientNetV2 [35]. Figure 1 presents a general pipeline for a traditional deep transfer learning approach. Traditional machine learning and deep learning models need to be trained from scratch, which is expensive to do on a computer and requires a lot of data to work well. On the other hand, the deep transfer learning approach is computationally simple and allows better results to be gotten from a small set of data. As well as traditional learning models take longer to get to their best performance than pre-trained models in transfer learning approach. This is due to the transfer learning technique, which lets new tasks use information (features, weights, etc.) from pre-trained models that already know what the features are. There are two approaches to using deep transfer learning: the develop model approach and the pre-trained model approach. The develop-model approach involves keeping the weights of the previously trained model fixed (frozen) on some layers, while the rest of the network is fine-tuned (trained). In the context of transfer learning, the use of a pre-trained model is regarded as a higher-level feature extractor. This is because the features extracted from the pre-trained model are subsequently fed into either a classifier or another model [36]. There are various pre-trained models listed below that are applied in this paper.

3.2.1 NASNet

The Neural Architecture Search Network (NASNet) was presented in 2018 [33] by Google Brain. NASNetLarge can generate network structures automatically, eliminating the need to design the network model based on the dataset. It can reduce the number of parameters while maintaining accuracy. The main search method in the NASNet pre-trained model is the NAS Framework [37], which consists of the Convolutional Neural Network (CNN) [38] and Controller Recurrent Neural Network (CRNN). The CRNN investigates the performance of the CNN's Child Network as well as controls its design using Reinforcement Learning (RL) [39]. The authors looked for an architectural building block using a small dataset (CIFAR-10), which is a subset of the 80 million Tiny Images dataset [40]. The block is then transferred to a larger dataset (ImageNet) [32] to obtain a higher mean Average Precision (mAP) [41]. The authors performed a basic search for learning rate schedules in order to discover the optimal model. They got multiple models, where each model consists of several normal cells and reduction cells. Each cell is supported by a set of commonly used operations in CNN models with varying kernel sizes, such as convolutions, average pooling, max pooling, and separable convolutions. The allocation of normal and reduction cells for CIFAR-10 and ImageNet is depicted in Fig. 2. ImageNet's dataset has more reduction cells than CIFAR's because the size of the images coming in is 299x299 instead of 32x32. However, the best performance is the NASNet-A image enhancement at N = 6 and 4032 filter in the initial convolutional cell, which is known as NASNetLarge, and at N = 6 and 1056 filter, which is also known as NASNetMobile, where N is the number of times normal cells are placed between reduction cells [33]. NASNetLarge achieved state-of-the-art 82.7% top-1 and 96% top-5 accuracy on ImageNet with 88.9 million parameters. NASNetMobile achieved state-of-the-art 74.4% top-1 and 91.9% top-5 accuracy on ImageNet with 5.3 million parameters [33].

3.2.2 EfficientNet-B

A pre-trained model called EfficientNet-B was introduced in 2019 [34]. It has 8 models, ranging from B0 to B7, and as the number of models increases, the number of parameters that are calculated does not increase significantly, although accuracy obviously increases [42]. The breadth, depth, resolution, and size of these models vary. The MBConv blocks [34] serve as the foundation of the EfficientNet model and are made up of layers that compress and then expand the channels, as shown in Fig. 3. There are variations in the number of MBConv blocks in the EfficientNet models. EfficientNetB7, the top-performing model of EfficientNet models, is 8.4x smaller and 6.1x faster than other models that are currently on the market [42]. With 66 million parameters, EfficientNetB7 achieved cutting-edge accuracy on ImageNet of 84.4% for top-1 and 97.1% for top-5. Depending on the number of channels, the stride, and the size of the filter, EfficientNet B7 can be separated into seven blocks [42], as shown in Fig. 4.

3.2.3 EfficientNetV2

EfficientNetV2 is a new family of EfficientNet that is presented in 2021 [35]. It produces high performance and a short training period. The main building block for EfficientNetV2 is MBConv and fused-MBConv [35]. Figures 3 and 5 show the steps of MBConv and Fused-MBConv. It uses Fused-MBConv for the early stages and MBConv for the subsequent [43]. EfficientNetV2 models train significantly more quickly than state-of-the-art models while being up to 6.8x smaller. It is divided into 7 models EfficientNetV2S, EfficientNetV2M, EfficientNetV2L, EfficientNetV2S (21k), EfficientNetV2M (21k), EfficientNetV2L (21k) and EfficientNetV2XL (21k), and the ‘21k’ denotes pre-trained on ImageNet 21k images [35]. EfficientNetV2S, EfficientNetV2M, and EfficientNetV2L are the three versions of EfficientNetV2 that were used in this study. EfficientNetV2S achieves state-of-the-art, 84.9% top-1 accuracy on ImageNet with 22 million parameters. The structure of EfficientNetV2S is illustrated in Fig. 6. EfficientNetV2M obtains a state-of-the-art top-1 accuracy of 86.2% on ImageNet with 54 million parameters. Figure 7 depicts the EfficientNetV2M architecture. For EfficientNetV2L, 208 million parameters are used to achieve state-of-the-art 86.1% top-1 accuracy on ImageNet. Figure 8 provides an illustration of the EfficientNetV2L's structure.

4 The proposed dataset

The Holy Quran is divided into 30 chapters and contains 114 surahs. Each chapter has a distinct number of surahs, and each surah has a variable length. Long verses can be found in the surahs Al-Bakara and l-Emran, which have 286 and 200 verses, respectively. However, other surahs, such as Alikhlas, Alfalaq, and Alnas, include only a few verses 4, 5, and 6, respectively. This study uses 7 surahs and the audio recitations of twenty different reciters as its dataset. To make the identification process more challenging, the reciters' voices were chosen to be as similar as possible. The MP3 recitations of the Quran were found in a publicly accessible audio dataset at http://ourquraan.com/, from which the collection was obtained. The length of each clipped audio sample is exactly 20 s. Overall, the database is composed of 11,000 audio segments, with 550 segments allocated for each reciter. The names of different surahs of the Quran are listed in Table 1.

Table 1

The dataset distribution from the Holy Quran

	The name of the surah	The number of samples
1	Al-Bakara	95
2	Al-Emran	75
3	Al-Nesaa	70
4	Al-Maeda	50
5	Al-Anaam	60
6	Al-Aaraf	135
7	Al-Anfal	65

In addition, the names of twenty distinct reciters are included in Table 2.

Table 2

The names of twenty reciters

The label name in the dataset	The name of the reciter
0	Abdul Basit Abdul Samad
1	Saad Al-Ghamdi
2	Yasser Al-Dosary
3	Abdul Rahman Al-Awsi
4	Mahmoud Ali Al-Banna
5	Mahmoud Khalil El-Hosary
6	Mahmoud El-Tablawy
7	Mohamed Gabriel
8	Hatem Farid
9	Naser Al-Katamy
10	Abdullah Basfar
11	Al-Ajmi
12	Ali Al-Hudhaifi
13	Abdul Rashid Al-Sufi
14	Mishary Rashid
15	Fares Abbad
16	Ahmed Naina
17	Abdel Moneim Abdel Muttalib
18	Wadih Al-Yamani
19	Ahmed Amer

5 The proposed model

This proposed model focuses on determining the reciter of the Holy Quran; its pipeline is shown in Fig. 9. In this work, the voice input is made up of various audio files containing the reciter's voice signals, whereas each audio file consists of 20 seconds. Since this work, consider the speaker identification task as if it is an image classification problem. A pre-processing phase is applied first to convert audio files from the sound representation to the image representation before moving on to the extraction phase. Therefore, the MFCCs are obtained from audio samples to apply the pre-processing phase. To get the MFCCs [7] first the audio signal is pre-emphasized to boost the amount of energy in the high frequencies and make them more available to the acoustic Model [30]. The pre-emphasis is applied to our audio files using a transformation filter, which is given by Eq. 1.

$${S}^{\prime}_{n}={S}_{n-}\alpha {S}_{n-1} , 0.9<\alpha <1$$

(1)

where ${S}_{n}$ is the input signal and ${S}^{\prime}_{n}$ is the output after applying pre-emphasis.

After that, the incoming pre-emphasized audio signal is segmented into frames. In this study, each frame is 25 milliseconds long, and the frames next to it overlap by 10 milliseconds. After the data are split into frames, the window function is used to multiply each frame as declared in Eq. 2.

$$X\left(n\right)=W\left(n\right){S}^{\prime}_{n} , 0\le n\le N-1$$

(2)

where $N$ is the quantity of samples, $W\left(n\right)$ is the window applied to the original input sample, ${S}^{\prime}_{n}$ and is the pre-emphasized audio signal.

The Hamming window is used for windowing [44], which is typically expressed as Eq. 3.

$$W\left(n\right)=0.54-0.45{\text{cos}}\left(\frac{2\pi }{N-1}\right), 0\le n\le N-1$$

(3)

Discrete Fourier Transform (DFT) [45] is then performed on each frame to transform the signal from its time domain representation to its frequency representation using Eq. 4 in order to get the speech signal's spectrum.

$$X\left(k\right)=\sum_{n=0}^{N-1}X\left(n\right){e}^{\left(-j\frac{2\pi }{N}kn\right)} ,0\le K\le N-1$$

(4)

where $X\left(k\right)$ is the frequency representation that matches up its corresponding time domain signal $X\left(n\right)$.

Following that, Mel-scale power spectrum coefficients from the audio frequency representation are calculated by applying the following two steps. Firstly, the Mel frequencies for the lowest and highest linear frequencies are calculated using Eq.uation 5.

$$Mel\left(f\right)=1127ln\left(1+\frac{f}{700}\right)$$

(5)

where $Mel\left(f\right)$ is the Mel frequency that matches up with the linear frequency $f$.

Secondly, a triangular Mel-scale filter bank [21] is applied to highlight significant frequencies in order to mimic what a human would perceive. Mel-scale power spectrum coefficients are calculated using Eq. 6.

$$S\left(m\right)=\sum_{k=0}^{N-1}\begin{array}{c}\left|X\left(k\right)\right|W\left(K\right), 1<m<M, \\ 0<K<N-1\end{array}$$

(6)

where $M$ is the Mel filter bank number, $K$ is the DFT bin number, $X\left(k\right)$ is the DFT point of the particular window frame of the input signal, and $W\left(K\right)$ is the triangular Mel-scale filter.

Since the Mel-spectral values for each frame are closely correlated, it is necessary to apply an appropriate transformation. In this work, the Discrete Cosine Transform (DCT) is used to produce uncorrelated cepstral features using Eq. 7.

$${C}_{n}=\sum_{m=0}^{M-1}{\text{log}}\left(S\left(m\right)\right)Cos\left(n)m+0.5)\frac{\pi }{M}\right)$$

(7)

where $S\left(m\right)$ is the output of the filter bank and ${C}_{n}$ are the cepstral coefficients.

Until now, 12 cepstral coefficients per frame were extracted. The 13th coefficient energy [30] can also be calculated in each frame using Eq. 8 from the time domain signal without any windowing.

$$E=\sum_{n=1}^{N}x(n{)}^{2}$$

(8)

where $x(n)$ is the audio, and N is the frame’s length.

After completing the calculations, a total of 13 MFCC coefficients have been extracted per frame. The proposed method uses only these 13 MFCC cepstral features per frame to reduce the dimensionality of feature space and make the model feasible for real-time implementation. Those steps are applied to all frames on the whole audio file until MFCC features sequences are obtained for the audio file. Following that, MFCC features sequences are visualized to get the MFCCs images, which represent the visual representation for the audio file. The previous steps applied to 11,000 reciters audio files. Figure 10 represents one of the MFCC features sequences representation for one of the twenty reciters of the Holy Quran.

After that, the second approach of deep transfer learning which uses the pre-trained model as a feature extractor is used. The main advantage of the second approach is that the data are only run once on the pre-trained model, rather than once every training epoch. So, it is much faster and less computationally expensive. Six pre-trained models are adopted in this stage separately: NASNetMobile, NASNetLarge, EfficientNetB7, EfficientNetV2S, EfficientNetV2M, and EfficientNetV2L. Each MFCCs image that represents an audio file is resized to match the size of the pre-trained model. The size of the image for each pre-trained model is shown in Table 3.

Table 3

Input image size for transfer learning models

Pre-trained model	Input image size
NASNetMobile	224 × 224 × 2
NASNetLarge	331 × 331 × 2
EfficientNetB7	600 × 600 × 2
EfficientNetV2S	300 × 300 × 2
EfficientNetV2L	380 × 380 × 2
EfficientNetV2M	380 × 380 × 2

The features extracted by the pre-trained model are inputs to a flattened layer, which flattens the pre-trained model's output data. Following that, two dense (1024) hidden layers are applied. For both hidden layers, a Batch Normalization (BN) [46] layer is applied between the linear layer and a nonlinear activation function ReLU [47]. BN is performed to avoid overfitting [48] and normalize the input data for the next layer so as to combat the internal covariate shift [46] issue, in addition to improving the learning speed of neural networks. BN is applied to all batches, whereas each batch has $m$ samples. The initial step of BN is to calculate the inference mean ${E}_{x}$ and the inference variance ${Var}_{x}$ of the layer inputs by Eqs. 9 and 10, respectively [46].

$${E}_{x}=\frac{1}{m}\sum_{i=1}^{j}{\mu }_{B}^{(i)}$$

(9)

$${Var}_{x}=\left(\frac{m}{m-1}\right)\frac{1}{m}\sum_{i=1}^{j}{\sigma }_{B}^{2(i)}$$

(10)

where $j$ is the number of batches, and batch mean ${\mu }_{B}$ as well as batch variance ${\sigma }_{B}^{2}$ are obtained using Eqs. 11, and 12, respectively [46].

$${\mu }_{B}=\frac{1}{m}\sum_{i=1}^{m}{x}_{i}$$

(11)

$${\sigma }_{B}^{2}=\frac{1}{m}\sum_{i=1}^{m}({x}_{i}+{\mu }_{B}{)}^{2}$$

(12)

After that, ${{\varvec{E}}}_{{\varvec{x}}}$ and ${{\varvec{V}}{\varvec{a}}{\varvec{r}}}_{{\varvec{x}}}$ are utilized to normalize the layer inputs as well as ${\varvec{\delta}}$ and ${\varvec{\beta}}$ are used to scale and shift the normalized input to obtain the layer's output. The normalized data are given by Eq. 13.

$${\varvec{y}}=\frac{{\varvec{\gamma}}}{\sqrt{{{\varvec{V}}{\varvec{a}}{\varvec{r}}}_{{\varvec{x}}}+{\varvec{c}}}}{\varvec{x}}+\left({\varvec{\beta}}+\frac{{\varvec{\delta}}{{\varvec{E}}}_{{\varvec{x}}}}{\sqrt{{{\varvec{V}}{\varvec{a}}{\varvec{r}}}_{{\varvec{x}}}+{\varvec{c}}}}\right)$$

(13)

where ${\varvec{\delta}}$ and ${\varvec{\beta}}$ are learning parameters of BN.

Also, the mathematical representation for the ReLU activation function which is fed by the normalized data is given by Eq. 14.

$$F\left(x\right)=mzx\left(0,x\right)$$

(14)

where $x$ is the normalized data. Finally, the outputs of the two layers are fed into a SoftMax activation function [49] to identify the twenty identity classes for Holy Quranic reciters. The SoftMax function is adapted using Eq. 15. It generates outputs that are a range of values between 0 and 1, with total probabilities equal to 1, whereas the highest probability represents the class of the Quranic reciter.

$${\sigma \left(\underset{z}{\to }\right)}_{j}=\frac{{e}^{{z}_{j}}}{\sum_{k=1}^{k}{e}^{{z}_{j}}} ,\,for\,j=1,\dots \dots \dots .,k$$

(15)

6 Results and discussion

In this section, experiments on our created dataset are applied to evaluate the proposed models. The experimental settings, training details, and performance metrics are presented before the results are shown.

6.1 Experimental setup

The proposed model in this paper is implemented in Python using a TensorFlow backend with the Keras framework and implemented on the Google Colab Pro website. The proposed model was built and trained with 16 GB of RAM and an Intel(R) Core (TM) i7-11370H processor from the 11th generation.

6.2 Training

The created dataset is randomly divided into 60% for training, 20% for validation, and 20% for testing. K-fold cross-validation is not used due to the limited hardware resources because it requires a lot of processing. To prevent over-fitting, batch normalization, test splitting and independent validation are used instead [50]. The proposed model is trained with a training split and a validation split, and a test split is used to make predictions. The images in the splits are augmented and modified by resizing them to the standard size of their respective transfer learning models. The dataset is then divided into batches of size 16, and the epoch is set to 150. RMSProp optimizer is applied, with a 0.0001 learning rate. The output layer is chosen to use the SoftMax activation function, and the categorical cross-entropy loss function is used.

6.3 Performance metric

In this study, the identification of the Quranic reciter is treated as an audio classification challenge. The top-1 accuracy is used for assessing the performance of the proposed model on the dataset and is also utilized as a performance indicator because the expected class should have the highest probability. Additionally, sensitivity is calculated to assess the model’s proficiency in capturing all relevant instances, representing the ratio of true positives to the sum of true positives and false negatives.

6.4 Results

In this work, six pre-trained deep learning models are assessed after integrating each separately in our proposed model. The pre-trained models that have been used are NASNetMobile, NASNetLarge, EfficientNetB7, EfficientNetV2S, EfficientNetV2M, and EfficientNetV2L. Table 4 declares the performance of the proposed model on the proposed dataset. While using the six pre-trained deep learning models separately for the feature extraction phase. The accuracy of testing is an estimate that shows how accurate the chosen deep transfer models are Eq. 16 calculates the testing accuracy value.

Table 4

The accuracy of training, validation, and testing for the proposed models

Proposed model	Train	Validate	Test
NASNetMobile	99.86%	89.68%	87.95%
NASNetLarge	99.87%	98.45%	98.50%
EfficientNetB7	99.87%	98.45%	98.40%
EfficientNetV2S	99.79%	97.50%	97.36%
EfficientNetV2M	99.77%	96.45%	95.81%
EfficientNetV2L	99.61%	97.18%	96.54%

Bold font was used as these results represent the highest accuracy

when using the NASNetLarge deep transfer learning model

$${\text{Testing}}\,{\text{accuracy}}=\frac{{\text{Correct}}\,{\text{Prediction}}}{{\text{Total}}\,{\text{cases}}}$$

(16)

The testing accuracy that was obtained for the six pre-trained deep learning models, which are NASNetMobile, NASNetLarge, EfficientNetB7, EfficientNetV2S, EfficientNetV2M, and EfficientNetV2L, is 87.95, 98.50, 98.40, 97.36, 95.81, and 96.54%, respectively. The NASNetLarge achieved the highest testing accuracy with 98.50%, followed by EfficientNetB7 at 98.40%.

The evaluation of a model involves the calculation of Sensitivity and Specificity. Sensitivity, also referred to as Recall or True Positive Rate, gauges the model's proficiency in correctly recognizing instances of a specific class. A higher sensitivity indicates the model's effectiveness in capturing true positive cases, which is particularly crucial in scenarios where overlooking positive cases carries significant consequences. On the other hand, Specificity assesses the model's accuracy in identifying instances of the negative class. It plays a pivotal role in computing the False Positive Rate (FPR), expressed as FPR = 1 - Specificity. The FPR signifies the proportion of actual negative cases incorrectly predicted as positive by the model. Equations 17 and 18 are employed to compute Sensitivity and Specificity for each class in a multiclass classification model, respectively.

$${{\text{sensitivity}}}_{i}=\frac{{{\text{TP}}}_{i}}{{{\text{TP}}}_{i}+{{\text{FN}}}_{i}}$$

(17)

$${\mathrm{ specificity}}_{i}=\frac{{{\text{TN}}}_{i}}{{{\text{TN}}}_{i}+{{\text{FP}}}_{i}}$$

(18)

where $i$ denotes each class, ${{\text{TP}}}_{i}$ is True Positives for class$i$, ${{\text{FN}}}_{i}$ is False Negatives for class $i$, ${{\text{TN}}}_{i}$ is True Negatives for class $i$, and ${{\text{FP}}}_{i}$ is False Positives for class $i.$

Macro-Averaged Sensitivity and Macro-Averaged Specificity are computed as indicators of the overall Sensitivity and Specificity for the entire model. These values are denoted by Eqs. 19 and 20, encapsulating the comprehensive performance across all classes.

$${\text{Macro}}\,{\text{average}}\,{\text{sensitivity}}= \frac{1}{C}\sum\limits_{i=1}^{c}{{\text{sensitivity}}}_{i}$$

(19)

$$\text{Macro}\,\text{average}\,\text{specificity}= \frac{1}{C}\sum \limits_{i=1}^{c}{{\text{specificity}}}_{i}$$

(20)

where $C$ is the total number of classes.

Figures 11 and 12 visually present the learning curve and loss curve for the six pre-trained models proposed in this study. Additionally, Fig. 13 showcases the confusion matrix, a reliable metric offering detailed insights into the testing accuracy for each class. This matrix provides valuable information beyond overall accuracy, allowing a more understanding of the model's performance across individual classes. Complementing this, Fig. 14 features ROC curves, serving as a helpful tool to assess the models' discrimination abilities between classes.

6.5 Discussion

This study introduced a proposed model to identify the reciters of the Holy Quran and evaluated it on a suggested dataset. Although deep transfer learning technique has many advantages, no previous studies have used it in this task, so this study attempted to implement it. Consequently, six pre-trained models from the transfer learning approach were evaluated after being implemented to extract features separately in our proposed model. From Tables 4 and 5, it can be declared that NASNetMobile shows the lowest training, validation, and testing accuracies, along with the least sensitivity and specificity. Furthermore, there is a large gap between the training curve and the validation curve, proving that overfitting is occurring, as declared in Figs. 11a and 12a.

Table 5

provides a summary of Sensitivity, Specificity, and False Positive Rate (FPR) for each model

Proposed model	Sensitivity (TPR)	Specificity	FPR
NASNetMobile	87.95%	99.39%	0.57%
NASNetLarge	98.50%	99.92%	0.08%
EfficientNetB7	98.40%	99.91%	0.09%
EfficientNetV2S	97.36%	99.86%	0.14%
EfficientNetV2M	95.81%	99.77%	0.23%
EfficientNetV2L	96.54%	99.81%	0.19%

Bold font was used as these results represent the highest accuracy

when using the NASNetLarge deep transfer learning model

EfficientNetV2 models attained higher training, validation, and testing accuracies, as well as higher sensitivity and specificity ratios. Figure 11d, e, and f for EfficientNetV2S, EfficientNetV2M, and EfficientNetV2L, respectively, demonstrates the associated learning curves. Correspondingly, Fig. 12d, e, and f illustrates the associated loss curves. While EfficientNetV2S demonstrates higher accuracy, the fluctuations in Figs. 11d and 12d that the model is precariously balancing between overfitting and underfitting. Additionally, Figs. 11e and 12e indicate a slight elevation of the validation curve compared to the training curve, implying a minor occurrence of overfitting for EfficientNetV2M. Furthermore, Figs. 11f and 12f reveal a notable gap between the validation and training curves, with the validation curve surpassing the training curve indicating a substantial overfitting issue for EfficientNetV2L. EfficientNetB7 outperforms NASNetMobile and the EfficientNetV2 models in terms of accuracies, sensitivity, and specificity. However, an analysis of Figs. 11c and 12c reveals fluctuations and an elevation of the validation curve in comparison with the training curve. This suggests a potential occurrence of overfitting in EfficientNetB7 despite its superior performance. On the other hand, the NASNetLarge model achieved the highest efficiency ratios for training, validation, and testing accuracies, along with the highest sensitivity and specificity.

The pre-trained NASNetLarge model that is applied in the proposed model performs consistently with a great fit. The training and validation loss fell to the point of stability, with only a small gap between the two loss curves that is not visible, indicating that there is no overfitting, as shown in Figs. 11b and 12b. As depicted in Fig. 13, which illustrates the confusion matrix for each model, specific observations can be made. In Fig. 13a, it is evident that NASNetMobile has the lowest count of correct predictions. Conversely, EfficientNetV2 models, as illustrated in Fig. 13d, e, and f, exhibit a higher number of correct predictions. Notably, Fig. 13c indicates that EfficientNetB7 has a superior count of accurate predictions. Furthermore, Fig. 13b reveals that the NASNetLarge model achieves the highest count of correct predictions across all classes among all pre-trained model. Shifting attention to Fig. 14, it becomes evident that the ROC curves for all classes in all models generally approach the top-left corner, indicating robust discrimination ability. However, it is worth noting that NASNetMobile's curves are slightly farther from the top-left corner. Our proposed model achieved remarkable performance when applied to the problem of identifying the reciter of the Holy Quran. This is due to the use of the MFCCs to represent our audio samples as well as the use of a pre-trained model to provide our model with more accurate features in addition to Batch Normalization to eliminate overfitting. Some samples have been incorrectly classified, whereas the Holy Quran's reciters must read the chapters' verses in accordance with the same standards rules (Tajweed). Hence, applying Tajweed increases the possibility of intermodulation of sound waves coming from numerous reciters. Additionally, each reciter has a "magam" delivery style that is frequently the same as other reciters. Determining the performance of the proposed model was challenging when criteria utilized in existing applications; the comparison proved difficult due to differences in the used datasets.

7 Conclusion

This research introduces a proposed dataset for twenty reciters of the Holy Quran, comprising 11,000 audio files that have been converted from an audio representation to a visual representation using the Mel-Frequency Cepstrum Coefficients. Furthermore, we propose the first Holy Quran reciter identification model that applies a transfer learning approach. Six pre-trained deep learning models, including NASNetMobile, NASNetLarge, EfficientNetB7, EfficientNetV2S, EfficientNetV2M, and EfficientNetV2L, were assessed after each was integrated separately as a feature extractor in our proposed model. On evaluating the proposed models on the suggested dataset, the NASNetLarge model achieved the highest accuracy of 98.50% when used for feature extraction.

Future plans include increasing the number of Holy Quran reciters and improving identification accuracy by adopting newer extraction methods and classification algorithms. Since the Holy Quran can be recited in ten different styles, determining the recitation style will also pose a challenge in the future.

Acknowledgements

I am deeply grateful to my supervisors for their guidance and support. To my family, thank you for your unwavering love and encouragement.

Declarations

Conflict of interest

Authors declare that there is neither funding nor conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article MGFEEN: a multi-granularity feature encoding ensemble network for remote sensing image classification

next article Enhanced user verification in IoT applications: a fusion-based multimodal cancelable biometric system with ECG and PPG signals

Togneri R, Pullella D (2011) An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits Syst Mag 11(2):23–61. https://doi.org/10.1109/MCAS.2011.941079CrossRef

Dhakal P, Damacharla P, Javaid AY, Devabhaktuni V (2019) A near real-time automatic speaker recognition architecture for voice-based user interface. Mach Learn Knowl Extract 1(1):504–520. https://doi.org/10.3390/make1010031CrossRef

Khan AU, Bhaiya LP, Banchhor SK (2012) Hindi speaking person identification using zero crossing rate. Int J Soft Comput Eng, 2(3):101–104

Bharti R, Bansal P (2015) Real time speaker recognition system using MFCC and vector quantization technique. Int J Comput Appl 117(1). https://doi.org/10.5120/20520-2361

Le PN, Ambikairajah E, Epps J et al (2011) Investigation of spectral centroid features for cognitive load classification. Speech Commun 53(4):540–551. https://doi.org/10.1016/j.specom.2011.01.005CrossRef

Ghahremani P, BabaAli B, Povey D, Riedhammer K, Trmal J, Khudanpur S (2014) A pitch extraction algorithm tuned for automatic speech recognition. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 2494–2498). IEEE. https://doi.org/10.1109/ICASSP.2014.6854049

Hossan MA, Memon S, Gregory MA (2010) A novel approach for MFCC feature extraction. In: 2010 4th International conference on signal processing and communication systems, pp 1–5. IEEE. https://doi.org/10.1109/ICSPCS.2010.5709752

Wang ZZ, Yong JH (2008) Texture analysis and classification with linear regression model based on wavelet transform. IEEE Trans Image Process 17(8):1421–1430. https://doi.org/10.1109/TIP.2008.926150MathSciNetCrossRefPubMedADS

Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567. https://doi.org/10.1038/nbt1206-1565CrossRefPubMed

10.

Cunningham P, Delany SJ (2021) k-Nearest neighbour classifiers—a tutorial. ACM Comput Surv (CSUR) 54(6):1–25. https://doi.org/10.1145/3459665CrossRef

11.

Padi S, Sadjadi SO, Manocha D, Sriram RD (2022) Multimodal emotion recognition using transfer learning from speaker recognition and bert-based models. arXiv preprint arXiv:2202.08974. https://doi.org/10.48550/arXiv.2202.08974

12.

Shivakumar PG, Georgiou P (2020) Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput Speech Lang 63:101077. https://doi.org/10.1016/j.csl.2020.101077CrossRefPubMedPubMedCentral

13.

Beikmohammadi A, Faez K (2018) December. Leaf classification for plant recognition with deep transfer learning. In 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS) (pp. 21–26). IEEE. https://doi.org/10.1109/ICSPIS.2018.8700547

14.

Shahriar S, Tariq U (2021) Classifying maqams of qur’anic recitations using deep learning. IEEE Access 9:117271–117281. https://doi.org/10.1109/ACCESS.2021.3098415CrossRef

15.

Al-Ayyoub M, Damer NA, Hmeidi I (2018) Using deep learning for automatically determining correct application of basic quranic recitation rules. Int Arab J Inf Technol 15(3A):620–625

16.

Bradbury J (2000) Linear predictive coding. Mc G. Hill

17.

Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 IEEE international conference on acoustics, speech, and signal processing, 2003. Proceedings. (ICASSP'03). IEEE. (vol 2, pp II-1). https://doi.org/10.1109/ICASSP.2003.1202279

18.

Ting W, Guo-Zheng Y, Bang-Hua Y et al (2008) Eeg feature extraction based on wavelet packet decomposition for brain computer interface. Measurement 41(6):618–625. https://doi.org/10.1016/j.measurement.2007.07.007CrossRefADS

19.

Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning (pp 609–616). https://doi.org/10.1145/1553374.1553453

20.

Alagrami AM, Eljazzar MM (2020) Smartajweed automatic recognition of Arabic quranic recitation rules. arXiv preprint arXiv:2101.04200. https://doi.org/10.48550/arXiv.2101.04200

21.

Vaidyanathan PP (1990) Multirate digital filters, filter banks, polyphase networks, and applications: a tutorial. Proc IEEE 78(1):56–93. https://doi.org/10.1109/5.52200CrossRef

22.

Marlina L, Wardoyo C, Sanjaya WM, Anggraeni D, Dewi SF, Roziqin A, Maryanti S (2018) Makhraj recognition of Hijaiyah letter for children based on mel-frequency cepstrum coefficients (MFCC) and support vector machines (SVM) method. In: 2018 International conference on information and communications technology (ICOIACT) (pp 935–940). IEEE. https://doi.org/10.1109/ICOIACT.2018.8350684

23.

Hamid R, Naim F, Naharuddin NZA (2013) Makhraj recognition for al-quran recitation using mfcc. Int J Intell Inf Process 4(2):45–53. https://doi.org/10.4156/ijiip.vol4.issue2.5

24.

Alkhateeb JH (2020) A machine learning approach for recognizing the holy quran reciter. Int J Adv Comput Sci Appl 11(7). https://doi.org/10.14569/ijacsa.2020.0110735

25.

Anazi M, Shahin OR (2022) A machine learning model for the identification of the holy quran reciter utilizing k-nearest neighbor and artificial neural networks. Inf Sci Lett 11(4):1093–1102.CrossRef

26.

Nahar KM, Al-Shannaq M, Manasrah A et al (2019) A holy quran reader/reciter identification system using support vector machine. Int J Mach Learn Comput 9(4):458–464.CrossRef

27.

Shah SM, Ahsan SN (2014) Arabic speaker identification system using combination of DWT and LPC features. In: 2014 International conference on open source systems and technologies. IEEE. (pp 176–181). https://doi.org/10.1109/ICOSST.2014.7029340

28.

Shensa MJ et al (1992) The discrete wavelet transform: wedding the a trous and mallat algorithms. IEEE Trans Signal Process 40(10):2464–2482. https://doi.org/10.1109/78.157290CrossRefADS

29.

Chapaneri SV (2012) Spoken digits recognition using weighted MFCC and improved features for dynamic time warping. Int J Comput Appl 40(3):6–12.

30.

Han W, Chan CF, Choy CS, Pun KP (2006). An efficient MFCC extraction method in speech recognition. In: 2006 IEEE international symposium on circuits and systems (ISCAS), IEEE. (pp 4). https://doi.org/10.1109/ISCAS.2006.1692543

31.

Chakraborty S, Mondal R, Singh PK et al (2021) Transfer learning with fine tuning for human action recognition from still images. Multimedia Tools Appl 80:20547–20578. https://doi.org/10.1007/s11042-021-10753-yCrossRef

32.

Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

33.

Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp 8697–8710). https://doi.org/10.1109/CVPR.2018.00907

34.

Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114.

35.

Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: International conference on machine learning, PMLR, pp 10096–10106

36.

Vrbanˇciˇc G, Podgorelec V (2020) Transfer learning with adaptive fine-tuning. IEEE Access 8:196197–196211. https://doi.org/10.1109/ACCESS.2020.3034343CrossRef

37.

Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. https://doi.org/10.48550/arXiv.1611.01578

38.

Albawi S, Mohammed TA, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 International conference on engineering and technology (ICET) (pp 1–6). IEEE. https://doi.org/10.1109/ICEngTechnol.2017.8308186

39.

Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285. https://doi.org/10.1613/jair.301CrossRef

40.

Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970. https://doi.org/10.1109/TPAMI.2008.128CrossRefPubMed

41.

Henderson P, Ferrari V (2017) End-to-end training of object class detectors for mean average precision. In: Computer vision–ACCV 2016: 13th Asian conference on computer vision, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part V 13 (pp 198–213). Springer International Publishing. https://doi.org/10.48550/arXiv.1607.03476

42.

Baheti B, Innani S, Gajre S, Talbar S (2020) Eff-unet: A novel architecture for semantic segmentation in unstructured environment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp 358–359). https://doi.org/10.1109/CVPRW50498.2020.00187

43.

Sunil CK, Jaidhar CD, Patil N (2021) Cardamom plant disease detection approach using EfficientNetV2. IEEE Access 10:789–804. https://doi.org/10.1109/ACCESS.2021.3138920CrossRef

44.

Gupta S, Jaafar J, Ahmad WW et al (2013) Feature extraction using mfcc. Signal Image Process Int J 4(4):101–108. https://doi.org/10.5121/sipij.2013.4408CrossRef

45.

Briggs WL, Henson VE (1995) The DFT: an owner’s manual for the discrete Fourier transform. Soc Ind Appl Math

46.

Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456. pmlr

47.

Agarap AF (2018) Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375. https://doi.org/10.48550/arXiv.1803.08375

48.

Dietterich T (1995) Overfitting and undercomputing in machine learning. ACM Comput Surv (CSUR) 27(3):326–327CrossRef

49.

Sharma S, Sharma S, Athaiya A (2017) Activation functions in neural networks. Towards Data Sci 6(12):310–316

50.

Berrar D (2019) Cross-validation. Encyclopedia Bioin Comput Biol, pp 542–545. https://doi.org/10.1016/B978-0-12-809633-8.20349-X

Title: Quran reciter identification using NASNetLarge
Authors: Hebat-Allah Saber
Ahmed Younes
Mohamed Osman
Islam Elkabani
Publication date: 05-02-2024
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 12/2024
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-023-09392-1

Springer Professional

Quran reciter identification using NASNetLarge

Abstract

Publisher's Note

1 Introduction

3 Background

3.1 MFCC audio features

3.2 Deep transfer learning model

3.2.1 NASNet

3.2.2 EfficientNet-B

3.2.3 EfficientNetV2

4 The proposed dataset

5 The proposed model

6 Results and discussion

6.1 Experimental setup

6.2 Training

6.3 Performance metric

6.4 Results

6.5 Discussion

7 Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Related work

3 Background

3.1 MFCC audio features

3.2 Deep transfer learning model

3.2.1 NASNet

3.2.2 EfficientNet-B

3.2.3 EfficientNetV2

4 The proposed dataset

5 The proposed model

6 Results and discussion

6.1 Experimental setup

6.2 Training

6.3 Performance metric

6.4 Results

6.5 Discussion

7 Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Other articles of this Issue 12/2024

Decentralized variable impedance control of modular robot manipulators with physical human–robot interaction using Gaussian process-based motion intention estimation

Multi-label noisy samples in underwater inspection from the oil and gas industry

Machine learning applications in detection and diagnosis of urology cancers: a systematic literature review

MGFEEN: a multi-granularity feature encoding ensemble network for remote sensing image classification

A new metaheuristic-based MPPT controller for photovoltaic systems under partial shading conditions and complex partial shading conditions

Detecting SQL injection attacks by binary gray wolf optimizer and machine learning algorithms

Premium Partner