Skip to main content
Erschienen in: Complex & Intelligent Systems 2/2022

Open Access 19.11.2021 | Original Article

Static–dynamic features and hybrid deep learning models based spoof detection system for ASV

verfasst von: Aakshi Mittal, Mohit Dua

Erschienen in: Complex & Intelligent Systems | Ausgabe 2/2022

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Detection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convolutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static–dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Building the robust spoof detection system for Automatic Speaker Verification (ASV) is now an essential task, as the attention and demand for voice protected authentication systems is increasing in the users of smart devices. According to a survey users are curiously looking forward to use the speech driven authentication systems [1]. ASV system verifies whether the input speech signal is actually spoken by the authentic user or generated by the tricks by the imposter to gain access to the legitimate user’s account. With the availability of low cost voice sensors, and advanced research in mathematical and logical techniques for generating the synthetic speech, the number of spoofing attack types are also getting increased. Speech synthesis (SS), voice conversion (VC), replay, mimicry and twins attacks are the very potential spoofing attacks to these type of systems. SS attacked utterance is generated by the text to speech technique [2]. VC speech signals are generated by converting the imposter’s voice in to the legitimate user’s voice with the help of transformation functions [35]. Replay attack are the one of the easiest form of attacks in which spoofed speech is the recorded voice signal of targeted user [6]. For mimicking the legitimate user’s voice, any professional manipulates his/her speech features. Twins attack is also a kind of mimicry attack [7, 8]. In some cases, twine siblings are able to get access to each other’s voiced locked accounts [5, 9]. SS and VC attacks can be injected via the channel into the system. Hence, these attacks are named as Logical Access (LA) attacks [9]. The replay, mimicry and twins attacks are inserted by the microphone into the system. Hence, these attacks known as physical access (PA) attacks. Performance of ASV systems is greatly affected in the presence of these spoofing attacks [10]. Various speech corpora have been proposed enriched with different kind of spoofing attacks. For instance, ASVspoof 2015 data includes SS and VC attacks [11], ASVspoof 2017 dataset includes only replay attack [12], Yoho dataset includes mimicry attacks [13], etc. The recently proposed ASVspoof 2019 dataset includes SS, VC and replay attacks, however, in two sets. This paper presents an initiative of putting all kind of attacks into a single dataset.
Along with attacks consideration, the robust designs of frontend and backend of an ASV system can become a preventive shield for spoofing attacks. Frontend of an ASV system uses a speech feature extraction technique to extract useful information form the recorded speech signal. Features of cepstrum domain that are Mel Frequency Cepstrum Coefficients (MFCC), Inverse Mel Frequency Cepstrum Coefficients (IMFCC) [14], Linear Frequency Cepstrum Coefficients (LFCC), Constant Q Cepstrum Coefficients (CQCC), etc. have performed remarkably well for the spoof detection tasks, and for speech and speaker recognition tasks as well. These techniques can model the human vocal tract and human auditory system very well [1517]. Human ear is proved to be deaf for the phase factor of sound. However, utilization of this factor for frontend development of speech driven devices [18, 19] can be done by using All Pole Group Delay Function (APGDF), Modified Group Delay Function (MODGDF), etc. Both static and dynamic coefficients of speech features deliver the information of context and speaker specification information. These coefficients are passed to the backend spoof detection model. CQCC features are specially designed for spoof detection tasks proposed in ASV systems of [20, 21] and it is claimed that, these features perform better than Instantaneous Frequency Cosine Coefficients (IFCC), MFCC, Epoch Features (EF). The proposed work in this paper also exploits a hybrid of static and dynamic CQCC features for developing the frontend. Also, it presents performance comparison of static and static–dynamic CQCC features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend.
Various machine learning techniques Gaussian Mixture Model (GMM), Hidden Markov Model (HMM) [2224], Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), etc. are playing crucial role for classification tasks even in speech based systems [25, 26]. In case of ASV system, backend classification model takes the speech features as input and classifies the signal as spoofed or bonafide after analyzing the speaker specific information in them. In the initial research, GMM was used effectively as the backend model [27]. As the deep learning algorithms are getting improved day by day ASV community has started to use CNN and LSTM models [2830]. In various speech and speaker recognition tasks, LSTM-based deep learning models are performing better than the other models. However, CNN models are also giving satisfactory results [3133]. Also, different arrangements of frontend and backend models can bring smoothness and accuracy to spoof detection task.
The rest of the paper is organized as: second section discusses the related work then third section of the paper discusses the proposed method, the experimental setup details and results are presented in fourth section, fifth section explains the performance analysis of proposed models and systems then sixth section compares proposed systems with existing systems and seventh section concludes the proposal with dropping some light on future directions.
This section discusses the related works in this area. Literature is enriched with the experiments on various feature extraction techniques of audios at frontend and different classification models at the backend. Research done by Valenti et al. [34] discusses an approach with end to end speech signal passing to an evolving Recurrent Neural Network (RNN). System used in their work is designed with RNN and neuroevolution of augmenting topologies. The proposed work considers replay attack particularly.
The review done by Kamble et al. [35] presents a wide analysis of many existing ASV spoof systems from the perspective of ASVspoof challenge. Lai et al. [36] proposed Attentive Filtering Network based and ResNet classifier based system to detect replay attacks. The proposed attention-based filtering approach is used to improve feature representations. The proposed work used ASVspoof 2017 Version 2.0 dataset to attain a very low Equal Error Rate (EER). The authors claimed an improvement of about 30% over the existing ASVspoof 2017 enhanced baseline system.
ASVspoof 2019 challenge puts the three different types of attacks in one dataset and presents baseline models with LFCC and CQCC features at frontend and GMM at the backend [27]. Chettri et al. [10] trained various deep learning backend models and tested them with different features extraction approach in front end. These backend models are further combined to get three ensemble models, where all the systems were tested for physical access and logical access attacks.
Recently, Dua et al. [30] also proposed the ensemble approach using LSTM based deep learning models at the backend, and three different feature extraction techniques Constant Q cepstral coefficients (CQCC), inverse Mel frequency cepstral coefficients (IMFCC) and MFCC at the frontend. The author claimed that their proposed ensemble model with CQCC features outperforms some already existing proposed ASV systems.
Motivated by these works, the proposed work in this paper compares performances of different deep learning models at backend by using them with static–dynamic CQCC features at frontend. The implemented work of this paper has also used combination of LSTM and CNN models for development of the backend. Also, two two-level spoof detection systems for ASV by using static–dynamic features at frontend are implemented. The first system does voting protocol based implementation by using CNN, LSTM models at first level and LSTM with Time Distributed Wrappers model at second level. The second system uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. These systems can bring new insights in the development of spoof detection methods for ASV.

Proposed method

This section of the paper discusses the architecture of the proposed ASV system. Figure 1a shows the frontend and backend arrangement that has been used for comparison of static CQCC and static–dynamic CQCC features in the implemented ASV system. Speech signals taken from the dataset are applied to the frontend where static CQCC features are extracted with the general process of extraction and static–dynamic hybrid features are extracted with the proposed methodology. Then these features along with the labels from the dataset are applied to the backend model that runs the classification. These classification results are useful for feature comparison. Figure 1b shows the frontend and backend arrangement that has been used for comparison of various deep learning models by keeping static–dynamic CQCC features at frontend. Frontend used in this arrangement is the best performing feature extraction technique from the feature comparison. Speech signals and labels are the part of same dataset in whole arrangement. Backend here has all the proposed models for spoof detection and single model for speaker identification task. At the backend all chosen model are trained and their performances are analyzed. Systems of Fig. 2 are the arrangements of models from Fig. 1. Figure 2a shows the block diagram for the voting protocol based two-level spoof detection system. This system classifies the speech signal according to the voting protocol that is implemented with the help of level 1 and level 2. Level 1 applies analysis on the input that is further analyzed at level 2 as per the protocol to declare the decision. Figure 2b gives the block diagram for the two-level user identification and verification system. This two stages arrangement makes the use of speaker identification model at stage 1 result of which is passed to stage 2. Stage 2 uses the user identification and verification protocol along with chosen backend model to declare the classification result.
The following is the pointwise contribution of the proposed work and following subsections discuss each component in detail.
  • This paper promotes the development of single countermeasure that is free from every kind of spoofing attack. Therefore, initiative of modification in the used dataset is taken. AllSpoofsASV dataset (Fig. 1) is a generated variation of the standard dataset.
  • Selection of suitable features for the frontend is essential. This work tests, whether static CQCC or a combination of static and dynamic CQCC speech features perform better at frontend, where both features have LSTM with time distributed wrappers model at backend.
  • Different deep learning models, LSTM, LSTM with time distributed wrappers and CNN based systems are implemented with static–dynamic CQCC features to measure their performances individually.
  • One voting protocol based implementation by using CNN, LSTM models at first level and LSTM with Time Distributed Wrappers model at second level is done. And, another implementation using LSTM model for user identification at first stage and LSTM with Time Distributed Wrappers for verification at the second stage is performed.

AllSpoofsASV dataset

A generated variant of ASVspoof 2019 dataset is used for building the proposed ASV systems. ASVspoof 2019 dataset is provided by the ASVspoof challenge community [37]. The design of this dataset is intended to tackle with SS, VC and replay attacks in ASV systems. LA set of the dataset includes SS and VC spoofed utterances and PA set includes replay attacked utterances [27]. All the audios are recorded in English language and are 2–8 s in length. However, the length of maximum number of audios lies between 4 and 6 s in both the sets. Proposed system is making use of both of the LA and PA sets by mixing them into a single set, AllSpoofsASV Dataset. Mixing of sets provides the reliability in developing the spoof detection systems in one run for all kind of included spoofing attacks of the used dataset. Table 1 shows number of bonafide, SS spoofed, VC spoofed and replay spoofed utterances in training, development and evaluation sets of AllSpoofsASV Dataset.
Table 1
AllSpoofsASV dataset
Sets
Bonafide
SS & VC spoofed
Replay spoofed
Training
7980
22,800
48,600
Development
7948
22,296
24,300
Evaluation
25,445
63,822
116,640

Feature extraction using CQCC features

Constant Q Cepstral Coefficients (CQCC) feature extraction is used for extracting useful information from the recorded speech signal during both training and testing phase of an ASV system. In recent years, this technique is proved to be most promising for the development of robust and accurate ASV systems [20, 21]. The mathematical representation of CQCC feature extraction approach is described as:
$$ C_{{{\text{CQF}}}} \left( e \right) = {\text{CQT}}\left( {p\left( n \right)} \right) $$
(1)
$$ C_{{{\text{CQCC}}}} \left( j \right) = \mathop \sum \limits_{e = 1}^{E} \log \left| {C_{{{\text{CQF}}}} \left( e \right)} \right|^{2} {\text{cos}}\left\{ {\frac{{j\left( {e - 0.5} \right)\pi }}{E}} \right\} $$
(2)
Here, Eq. (1) finds out the Constant Q Transform (CQT) of input speech signal \(p\left(n\right)\) in \({C}_{\mathrm{CQF}}\left(e\right)\), and Eq. (2) finds out \(j\) number of CQCC coefficients in \({C}_{\mathrm{CQCC}}\left(j\right)\), where \(E\) is used for number of linearly spaced bins and \(e\) is used for indexing into the number of bins. The process of CQCC feature extraction applies Constant Q Transform (CQT) and then, it takes the log of powered spectrum [38]. Also, before calculating the Discrete Cosine Transformation (DCT) it applies the resampling [39, 40]. It sets the number of feature coefficients and returns CQCC features.
The proposed system uses the find_CQCC_features () function for implementing CQCC feature extraction. This function applies the actual CQCC feature extraction process to the speech signal. This function takes an audio file as the input and returns a matrix of 90 × m_frames with the 30 static, 30 delta (D) and 30 delta-delta (DD) features for m_frames number of frames. m_frames denotes the number of audio frames extracted depending on the length of input audio. Firstly, it sets the initial values for number of bins per octave b, maximum frequency Nmax, minimum frequency Nmin, number of desired coefficients of any type n_coeff and type of feature f_type. Here, feature type f_type can be static (S), delta (D) or delta-delta (DD). Secondly, it calls the find_cqcc () function that takes all these initialized values as input to output the values as static, delta or delta-delta features. Algorithm working in this function starts with the calculation of gamma value that is one of the parameters to CQT application process. Then, it calculates the log power spectrum of the output of CQT application, which is considered for resampling before calculating the DCT. Function performing these operations are discussed further in this sections. Understanding of input taken, operations applied and nature of output of these functions are provided. Then, this algorithm takes care of taking only desired number of features. It returns the static, delta or delta-delta coefficients as per the value of f_type. Finally, find_CQCC_features () combines all type of coefficients into one matrix and finds out number of frames. This function ensures the 400 minimum number of frames in the output. If the number of frames are less than 400 then padding of zeros is done and the final matrix is the desired CQCC feature matrix.
This whole process uses some functions that are inbuilt functions of different libraries of Python and MATLAB [41, 42]. In the proposed work, these functions are named according to their functionality and are described further in this section. Function 1 given in the Appendix gives the pseudo code for find_CQCC_features () that calls find_cqcc () to compute CQCC features.
  • audioread (): This function takes an audio file (audio_file) as input, and returns its time series y and sampling rate Ns. Number values in time series y depends on the length of the audio file, which further contributes to the number of frames.
  • zscore (): This function calculates the row wise zscore for each value of the input matrix. As the values coming out from find_cqcc () function reside in a continuous range of small to large values. Hence, application of this function normalizes these values. General formula to calculate the zscore is given by the Eq. (3).
    $$z \mathrm{score}= \frac{\left(x-\mu \right)}{\sigma }$$
    (3)
Here, \(x\) is the element value to be normalized, \(\mu \) is the mean of the values of entire row and \(\sigma \) stands for standard deviation of those values.
  • length (): This function takes a matrix as input and outputs the value of number of columns in it.
  • zero_padding (): Functionality of this function is to add extra columns of zero values up to the desired number of rows. More specifically, it does the padding of zeros for the desired number of columns in a given matrix.
  • cqt (): This function applies the Constant Q Transform (CQT) to the representative values of a speech signal. CQT changes the frequency domain into the time domain along with maintaining the constant Q factor across the signal. gamma_value is a parameter to this function that is calculated using Eq. (4) with the help of number of bins b per octave in speech signal.
    $$\mathrm{gamma}\_\mathrm{value}=228.7\left({2}^{\frac{1}{b}}-{2}^{-\frac{1}{b}}\right)$$
    (4)
  • log (): This function applies logarithm operation on the input values. Logarithm is calculated for the squared spectrum that is output of cqt () function.
  • resample (): This function converts the geometrically spaced bins provided by CQT into linearly spaced bins. Bins are converted into linear space to make the signal compatible with Discrete Cosine Transformation (DCT).
  • dct (): This function applies DCT internally. Application of DCT is helpful in signal compression task, conversion of frequency domain into time domain, etc.
  • cut (): cut () function cut a matrix to the desired number of rows.
  • delta (): This function calculates the derivative of the applied values.

Backend classification using deep learning models

This section gives brief a detail of the Deep Learning models that are used at the backend of the different architectures proposed in this paper.

Long short term memory (LSTM) with time distributed wrappers (M1)

Proposed Long Short Term Memory (LSTM) Network, shown in Fig. 3, is comprised of three time distributed dense layers, each having ReLU activation function. Time distribution wrapped layers are especially suitable for time varying data frames like audio, video, etc. Proposed LSTM model (M1) has 32, 16 and 10 units in time distributed dense layers in this order. Number of units inside the layers are presented to provide the finer grained knowledge of structure of model to the readers. Motivation to choose these number of neuron is taken from the related work [30, 31]. After that 15% dropout is applied to disuse the effect of some randomly selected neurons. Addition of dropout layer prevents the model from overfitting. In the M1 model, this operation is followed by three LSTM layers each having 10, 20 and 30 units in this order. Again these layers are followed by 10% dropout, and the result of dropout is passed to a dense layer having sigmoid activation function in it.

Long short term memory (LSTM) (M2 & M4)

Proposed Long Short Term Memory (LSTM) Network, shown by Fig. 4, takes input on the first LSTM layers that are followed by two more LSTM layers. These layers have 10, 20 and 30 LSTM units in this order, which are chosen as per the results shown in [30, 31]. Output of these layers is passed to a dense layer of 24 units after applying 10% dropout. Again the output of this dense layer is passed to the last layer that is a dense layer with sigmoid activation function, after, applying the 10% dropout.
An LSTM model (M4) with the similar architecture has 20, 30 and 400 units in this order in its first three LSTM layers (Fig. 4). However, all the dropout and dense layers are having same specifications.

Two-dimensional convolutional neural network (2D CNN) (M3)

As shown in Fig. 5, the Two-Dimensional Convolutional layer (Conv2D) of proposed Two-Dimensional Convolutional Neural Network (2D CNN) (M3) is comprised of 24 filters of 3 × 3 kernels size along with the ReLU activation function. After that a batch normalization layer is added, which itself is followed by three blocks of Conv2D and 2-Dimensional (2D) max pooling layers. Conv2D layers of these blocks have 16 filters of 5 × 5 kernel size, and 2D max pooling layers are of 2 × 2 pool size. These blocks are followed by a flatten layer that is followed by a dense layer of 10 units. After that, 10% dropout is applied to avoid the overfitting of the model. Last layer of this 2D CNN model is a dense layer with sigmoid activation function.

Spoof detection systems

This section discusses the two-spoof detection systems (System_1 and System_2) that are developed for the implementation of the proposed ASV system. Both System_1 and System_2 use the static–dynamic hybrid combination of CQCC features at frontend and different arrangements of M1, M2, M3 and M4 models at backend.

Voting protocol based two-level ASV system (System_1)

The two-level ASV system with voting protocol i.e. System_1 focuses to the spoof detection task. It accepts the input speech signal if it is bonafide, and rejects it if it is spoofed by any of the SS, VC and replay attacks. Models M1, M2 and M3 provide the corresponding labels: bonafide or spoofed as output. Figure 6 shows the proposed System_1 that has models M2 and M3 at the first level and M1 resides at the second level, where F is treated as a global variable.
Purpose of putting models M2 and M3 at level one is that both of these models are equally good, when evaluated for Equal Error Rate (EER). This adds fairness in the classification result of this level. M1 is the most powerful model. Hence, it is put at the second level. Firstly, each input audio file is applied to the models M2 and M3. Then, voting protocol is applied to their decisions. A find_binary () function maps these decisions to the Boolean values i.e. FALSE for spoofed decision (due to any of the SS, VC and replay) and TRUE for the bonafide decision made by the model. Voting protocol compares both outputs of find_binary () function for both the first level models. If the outputs from both the models is same, then it is returned as the final classification result of the system. Otherwise, the audio file is tested on the model M1 at the second level and its classification result after passing to find_binary () function is returned. At the end, proposed system returns TRUE or FALSE for input speech being Bonafide or spoofed, respectively. Function 2, added in the Appendix section, gives the pseudo code for the implemented voting protocol that uses find_binary () function.

Two-level ASV system with user identification and verification (System_2)

The System_2, as shown by Fig. 7, also executes its process in two stages/levels. In the first stage, it identifies the user id for the applied speech signal. Then, user’s voice signal is verified, whether it is bonafide or spoofed, in the second stage of the system. System uses User Identification and Verification Protocol to accomplish this task, where F and I are treated as global variables.
As a result, system identifies the validity of claimer along with the genuineness of the applied speech signal. Firstly, input audio signal is applied to the model M4 of the first stage. Model M4 predicts the identification of the user (Ui) out of already registered n users. This predicted identity is supplied to stage 2 where user identification and verification protocol is applied. At this stage, n number of instances {(M1U1), (M1U2), ……, (M1Un)} of model M1 resides, which are trained for n number of users {U1, U2,….,Un}. Model M1 checks whether the speech signal is bonafide or spoofed at this stage, and the decision is mapped with a valid integer value in variable A. set_terary () function maps to integer value THREE if the Ui and I are not same, maps to integer value ONE if the decision is Bonafide along with Ui and I are same, and maps to integer value TWO if the decision is spoofed. At the output variable, if A is ONE then the user is valid and speech is Bonafide, if A is TWO then the user is invalid and speech is spoofed, and if A is THREE then user is invalid. Function 3, appended in the appendix, gives pseudo code for the implementation of the System_2.

Experimental setups

This section of the paper deals with the experimental details for implementation of the proposed ASV system. The frontend feature extraction is implemented by using Octave on Linux Operating System. The training, development and evaluation of backend models are done with Anaconda platform on Windows operating system. All the used audios and labels are taken from training, development and evaluation sets of AllSpoofsASV Dataset. During training the deep learning models, Python’s inbuilt features are used for weight updation, that is backpropagation algorithm and loss functions are used. For the two class classification problems, binary cross entropy loss is used as the loss function. It finds out the probability or score for an utterance between zero and one. Categorical cross entropy loss function is used as loss function for multi class classification of user identities (specifically in the training of M4).
A learning rate is required for iterative updation of weights during the training process. In the proposed work, ADAM (Adaptive Momentum) optimizer algorithm is used to achieve the adaptive value of learning rate [43, 44]. It combines the advantages of Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Algorithm (RMSProp). AdaGrad defines the learning rate for each parameter to improve the performance of model sparse gradient, whereas RMSProp makes the use of average of latest values of gradients of weights. ADAM algorithm passes both the gradient and square gradient to the exponential moving average function. For heavy models and large size of datasets, it can solve practical problems efficiently [4345]. System arrangement for different comparisons and analysis are discussed later in this section.
The performance of the proposed architectures and systems are evaluated with the help of two evaluation measures Equal Error Rate (EER) and Percentage Accuracy. Spoof detection systems are evaluated by using EER and user identification system is evaluated by percentage accuracy. EER is the equal value of False Acceptance Rate (FAR) and False Rejection Rate (FRR) [27, 28], where FAR is ratio of number of spoofed utterances having score more than or equal to the threshold Ψ to the total number of spoofed utterances and FRR is ration of the number of bonafide utterances having score less than the threshold Ψ value to the total number of bonafide utterances. The mathematical representation FAR and FRR is given by Eqs. (5) and  (6), respectively. EER aims to calculate the FAR and FRR with the help of threshold Ψ. For the equal values of these parameters, it declares the EER for the system.
$$\mathrm{FAR}=\frac{\mathrm{Total\,count\,of\,utterances\,with\,score} \ge\Psi }{\mathrm{Total\,count\,of\,spoofed\,utterances}}$$
(5)
$$\mathrm{FRR}= \frac{\mathrm{Total\,count\,of\,bonafide\, utterances\,with\,score}< \Psi }{\mathrm{Total\,count\,of\,bonafide\,utterances}}$$
(6)
Percentage accuracy is calculated with the help of correct predictions and total number of input samples to be checked. Mathematical formula of percentage accuracy is given by Eq. (7).
$$\mathrm{Percentage\,Accuracy}=\frac{\mathrm{Count}\left(\mathrm{correct\,predictions}\right)}{\mathrm{Count}\left(\mathrm{input\,samples}\right)}\times 100$$
(7)
In this case, the division of total correctly predicted user samples by the total number of user input samples is multiplied by 100.

Frontend features extraction

For spoof detection task, firstly, model M1 is trained with only 30 static CQCC features calculated by doing some modifications in find_CQCC_features () function. Mean of m_frames frames for each coefficient of 30 features is used. A vector of 1 × 30 dimensions is extracted in case of static features and Model M1 is trained up to five epochs with the batch size of 512. Secondly, Model M1 is trained with the static–dynamic hybrid CQCC features calculated by find_CQCC_features () function. All 30 static, 30 delta and 30 delta-delta CQCC features for all m_frames frames (without taking mean) are used in this arrangement. A matrix of 90 × m_frames dimensions is extracted for each audio in this case. To balance the comparison criteria, this arrangement has also been trained up to five epochs with the batch size of 512.
Equal Error Rate (EER) for both the arrangements is found out to compare the performances of the feature sets. The comparative analysis for evaluation data with both features is shown in Table 2.
Table 2
Comparative analysis of different CQCC features
Features
Development Set (EER)
Evaluation set (EER)
(D1)
(D2)
(D3)
(D4)
(D5)
Average (mean + sd)
Static CQCC
0.114
0.113
0.112
0.112
0.111
0.112 ± 0.001
0.136
Static–Dynamic CQCC
0.017
0.018
0.019
0.018
0.018
0.018 ± 0.0006
0.032
Values in bold show the final and best performing results

Backend deep learning models with System_1

The proposed work compares performance of all the backend deep learning models M1, M2 and M3, implemented individually, with voting protocol based System_1 by using static–dynamic CQCC features at the front end and AllSpoofASV dataset. Model M1 is trained with the batch size of 512 up to five epochs, Model M2 is trained with the batch size of 512 for 20 epochs and model M3 is trained with the batch size of 500 for 15 epochs. For the training of all three models, patience of two is used for early stopping criteria, binary cross entropy loss function is used to measure the loss and ADAM optimizer is used for optimization purpose in both the systems [43, 44].
As described earlier, trained models M1, M2 are used at level 1and M3 is used at level 2 for development of voting protocol based spoof detection system System_1. The performance analysis of M1, M2, M3 and System_1 is done by using the parameter EER. Table 3 shows the comparative values of EER for evaluation datasets for all the three backend models and System_1.
Table 3
Comparison of backend spoof detection models
Model
Development set (EER)
Evaluation set (EER)
System_1 (EER)
(D1)
(D2)
(D3)
(D4)
(D5)
Average (mean + sd)
M2
0.019
0.017
0.017
0.019
0.017
0.017 ± 0.0009
0.043
0.029
M3
0.019
0.020
0.018
0.019
0.019
0.019 ± 0.0006
0.043
M1
0.019
0.018
0.017
0.017
0.017
0.017 ± 0.0008
0.032
Values in bold show the final and best performing results

Model M4

User identification model M4 is trained individually for eight users (n) with the batch 512 up to 80 epochs using categorical cross entropy loss function. Model M4 is tested by using parameter percentage accuracy. Percentage accuracy of the model is calculated for evaluation set, as shown by Table 4.
Table 4
Performance analysis for LSTM (M4)
Model
%Accuracy
Evaluation Set %Accuracy
(D1)
(D2)
(D3)
(D4)
(D5)
Average (mean + sd)
M4
99.4
96.5
97.9
97.8
97.9
97.9 ± 0.91
97.1

System_1 and System_2

System_2 uses trained model M4 for user identification task is used at stage 1 and n number of instances of model M1 are used at stage 2. However, the training of Model M1 in System_2 is different from System_1. In System_2, it is trained eight times separately for each user out of the total eight existing users. For this, bonafide and spoofed utterances of each specific user are taken. Firstly, user identification is done for eight users by the stage 1, and then, user identification verification protocol is invoked for verification at stage 2. The performances of System_1 and System_2 for spoof detection task are evaluated using the parameter EER for evaluations sets, as shown in Table 5.
Table 5
Performance of proposed systems
System
Equal error rate (EER)
 
Development set
Evaluation set
System_1
0.017
0.029
System_2
0.002
0.009

Results

This section presents the performance and comparison results of all systems discussed in third section. For obtaining the results, the proposed work uses the procedure adopted by state of the art works of [10, 15, 26]. As described earlier in “AllSpoofsASV dataset” section, the dataset used by the proposed system is already divided in training, development and evaluation sets. Therefore, it is not required to partition the dataset in ratios for training, development and evaluation samples. For evaluation in case of ASV systems, EER is the used evaluation protocol that is applied on the classification results of the model for spoof detection task [10, 15, 26]. Models for this work have been trained five times with the training set, and for each trained model development set is applied. Network parameters have been tuned for all the systems to obtain stable parameters. On the development results, EER evaluation protocol is applied and accuracy of the model is verified. Mean of all five development set test results is considered to show in presented tables. Evaluation set is applied on the model when it becomes stable after all training passes and EER is calculated for the classification result. Protocols of systems one and two are applied with the evaluation set performances of models. For the task of speaker identification, percentage accuracy is calculated as evaluation measure on development set results using five-fold validation approach. It is also evaluated for evaluation to check the performance.

Comparison of CQCC features

Models set for features comparison are trained five times and average i.e. mean + standard deviation (SD) of the results is taken to conclude the EER. It can be observed in Table 2 that combination of static and dynamic CQCC features is performing better than static CQCC features. Hence, this combination is used in the development of further proposed spoofed detection systems.

Comparison of used deep learning models with System_1

These models are trained five times and the EER evaluation measure is calculated on development set for each training for model. Table 3 represents the EER value for five training and development passes (presented by sequence of “Di” in Table 3) along with the average value of results. Then, the performance on evaluation set and System_1 are shown. Results presented in Table 3 shows that M1 outperforms the other two backend models for spoof detection, when implemented individually. However, voting protocol based System_1 outperforms all the three backend models. Voting protocol is applied once the ave performances of all the deep learning models are concluded.

Performance of model M4

The average percentage accuracy of the model M4 is calculated for evaluation set by averaging the five runnings, as shown by Table 4. The percentage accuracy, as described earlier, is calculated by Eq. (7) using correct predictions and total number of input samples to be checked. It can be observed from Table 4 that M4 performs satisfactory.

Comparative analysis for System_1 and System_2

The performances of System_1 and System_2 for spoof detection task are evaluated using the parameter EER for both development and evaluations sets, as shown in Table 5. It can easily be observed in Table 5 that System_2 is performing better than the System_1. However, System_2 is limited to the private or local domain because it uses limited number of users. An increase in number of users will add more complexity in development of an ASV system as for each user separate training model M1 is required, which is not practically feasible. Hence, System_1 performs satisfactory as it is applicable to the public domain.

Comparison of proposed system with existing systems

This section compares the performances of the proposed systems, System_1 and System_2, with some of the existing systems from the literature. Chettri et al. [10] have designed three Ensemble systems (E1, E2 and E3) made up of different classical and deep learning models, where the ensemble system E1 performs the best among them. Cai et al. [15] have trained ResNet deep learning model with CQCC, LFCC, IMFCC, Short Term Fourier Transform (STFT) grams, and Group Delay (GD) gram features. However, it is trained only for replay attack. Kumar et al. [26] have trained a Time Delay Shallow Neural Network (TDSNN) with CQCC, IMFCC, Linear Frequency Band Cepstral Coefficients (LFBC) and LFCC features for SS, VC and replay attacks. ASVspoof 2019 challenge has provided a GMM model trained with LFCC and CQCC features at frontend for SS, VC and replay attacks [27]. Jung et al. [46] has trained a Deep Neural Network Model with 7 spectrograms, i-vectors and raw waveforms only for replay attack detection. Table 6 shows the comparison of these systems with proposed systems of this paper. Although, some systems from literature seem to be good for detection of a particular attack type. However, proposed system is also performing good for the detection of all three kinds of spoofing attacks in one run.
Table 6
Comparison of proposed system with existing systems
Works
Backend
Frontend features
Evaluation set
SS, VC
Replay
EER
Chettri et al. [10]
Ensemble 1
MFCC, IMFCC, SCM, i-vectors, long term average spectrum
0.0264
Ensemble 2
0.0611
Cai et al. [15]
ResNet Fusion
CQCC, LFCC, IMFCC, STFT, GD grram
0.0066
ASVspoof 2019 Challenge [27]
GMM
CQCC
0.0043
GMM
0.0987
GMM
LFCC
0.0271
GMM
0.1196
Kumar et al. [26]
TDSNN
CQCC, IMFCC, LFBC, LFCC
0.057
TDSNN
0.064
Jung et al. [46]
DNN
7 spectrograms, i-vectors, raw waveforms
0.0245
Proposed work
System_1
Static -Dynamic Hybrid CQCC
0.029
System_2
0.009
*✔ Indicates that a particular attack is addressed and ✖ indicates that a particular attack is not addressed

Conclusion

Undoubtedly, the ASV systems are highly exposed to spoofing attacks. However, their performance is fine enough that industry is attracted to use them in practical applications. Initiative to design a single dataset can provide new insights to the spoof detection task. AllSpoofsASV Dataset, a variation of ASVspoof 2019 dataset, is a small step towards this. Combination of different feature coefficients with hybrid deep learning models can help in development of robust ASVs. This paper shows that a combination of static and dynamic CQCC performs better with LSTM models than only static features. Also, comparison of results shows model LSTM with Time Distributed Wrappers (M1) outperforms the models LSTM (M2) and CNN (M3), when evaluated by Equal Error Rate (EER). However, the two-level voting protocol based spoof detection system System_1 that uses M2, M3 at level 1 and M1 at level 2 performs best of them all. As model LSTM (M4) provides satisfactory performance it can be used particularly for speaker identification with spoof detection. Also, two-level spoof detection system with user identification and verification System_2 that uses M4 at stage 1 and M1 at stage 2 performs better than System_1. However, it is limited to limited number of users. Using it for public domain or an organization with more and variable number of speakers will increase the complexity and requirement of storage space for the system. For future work, more attacks like twins and mimicry should can be added into the dataset, and more hybrid possible combinations of features and deep learning models can be exploited. Considering the importance of the spoof detection in ASV, more efficient and complex structures like VGG-family of deep learning models can also be used as future extension of the proposed work.

Declarations

Conflict of interest

The submitted work does not have any conflict of interest.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix

Literatur
1.
Zurück zum Zitat Beranek B (2013) Voice biometrics: success stories, success factors and what’s next. Biometr Technol Today 2013(7):9–11MathSciNetCrossRef Beranek B (2013) Voice biometrics: success stories, success factors and what’s next. Biometr Technol Today 2013(7):9–11MathSciNetCrossRef
2.
Zurück zum Zitat Indumathi A, Chandra E (2012) Survey on speech synthesis. Signal Process Int J (SPIJ) 6(5):140 Indumathi A, Chandra E (2012) Survey on speech synthesis. Signal Process Int J (SPIJ) 6(5):140
3.
Zurück zum Zitat Lim R, Kwan E (2011) Voice conversion application (VOCAL). In: 2011 international conference on uncertainty reasoning and knowledge engineering, vol 1. IEEE, pp 259–262 Lim R, Kwan E (2011) Voice conversion application (VOCAL). In: 2011 international conference on uncertainty reasoning and knowledge engineering, vol 1. IEEE, pp 259–262
4.
Zurück zum Zitat Mohammadi SH, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82CrossRef Mohammadi SH, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82CrossRef
5.
Zurück zum Zitat Patil HA, Kamble MR (2018) A survey on replay attack detection for automatic speaker verification (ASV) system. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1047–1053 Patil HA, Kamble MR (2018) A survey on replay attack detection for automatic speaker verification (ASV) system. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1047–1053
6.
Zurück zum Zitat Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153CrossRef Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153CrossRef
7.
Zurück zum Zitat Hautamäki RG, Kinnunen T, Hautamäki V, Leino T, Laukkanen AM (2013) I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In: Interspeech, pp 930–934 Hautamäki RG, Kinnunen T, Hautamäki V, Leino T, Laukkanen AM (2013) I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In: Interspeech, pp 930–934
8.
Zurück zum Zitat Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen AM (2014) Comparison of human listeners and speaker verification systems using voice mimicry data. Target 4000:5000 Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen AM (2014) Comparison of human listeners and speaker verification systems using voice mimicry data. Target 4000:5000
9.
Zurück zum Zitat Lindberg J, Blomberg M (1999) Vulnerability in speaker verification-a study of technical impostor techniques. In: Sixth European conference on speech communication and technology Lindberg J, Blomberg M (1999) Vulnerability in speaker verification-a study of technical impostor techniques. In: Sixth European conference on speech communication and technology
10.
Zurück zum Zitat Chettri B, Stoller D, Morfi V, Ramírez MAM, Benetos E, Sturm BL (2019) Ensemble models for spoofing detection in automatic speaker verification. arXiv:1904.04589. arXiv preprint Chettri B, Stoller D, Morfi V, Ramírez MAM, Benetos E, Sturm BL (2019) Ensemble models for spoofing detection in automatic speaker verification. arXiv:​1904.​04589. arXiv preprint
11.
Zurück zum Zitat Sahidullah M, Delgado H, Todisco M, Yu H, Kinnunen T, Evans N, Tan ZH (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015 Sahidullah M, Delgado H, Todisco M, Yu H, Kinnunen T, Evans N, Tan ZH (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015
12.
Zurück zum Zitat Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017) Audio replay attack detection with deep learning frameworks. In: Interspeech, pp 82–86 Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017) Audio replay attack detection with deep learning frameworks. In: Interspeech, pp 82–86
13.
Zurück zum Zitat Campbell JP (1995) Testing with the YOHO CD-ROM voice verification corpus. In: 1995 international conference on acoustics, speech, and signal processing, vol 1. IEEE, pp 341–344 Campbell JP (1995) Testing with the YOHO CD-ROM voice verification corpus. In: 1995 international conference on acoustics, speech, and signal processing, vol 1. IEEE, pp 341–344
14.
Zurück zum Zitat Chakroborty S, Saha G (2009) Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. Int J Signal Process 5(1):11–19 Chakroborty S, Saha G (2009) Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. Int J Signal Process 5(1):11–19
15.
Zurück zum Zitat Cai W, Wu H, Cai D, Li M (2019) The DKU replay detection system for the ASVspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion. arXiv:1907.02663. arXiv preprint Cai W, Wu H, Cai D, Li M (2019) The DKU replay detection system for the ASVspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion. arXiv:​1907.​02663. arXiv preprint
16.
Zurück zum Zitat Balamurali BT, Lin KE, Lui S, Chen JM, Herremans D (2019) Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7:84229–84241CrossRef Balamurali BT, Lin KE, Lui S, Chen JM, Herremans D (2019) Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7:84229–84241CrossRef
17.
Zurück zum Zitat Dua M, Aggarwal RK, Biswas M (2017) Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In: International conference on computer and applications (ICCA), pp 158–162 Dua M, Aggarwal RK, Biswas M (2017) Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In: International conference on computer and applications (ICCA), pp 158–162
18.
Zurück zum Zitat Sahidullah M, Kinnunen T, Hanilçi C (2015) A comparison of features for synthetic speech detection. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), pp 2087–2091 Sahidullah M, Kinnunen T, Hanilçi C (2015) A comparison of features for synthetic speech detection. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), pp 2087–2091
19.
Zurück zum Zitat Pal M, Paul D, Saha G (2018) Synthetic speech detection using fundamental frequency variation and spectral features. Comput Speech Lang 48:31–50CrossRef Pal M, Paul D, Saha G (2018) Synthetic speech detection using fundamental frequency variation and spectral features. Comput Speech Lang 48:31–50CrossRef
20.
Zurück zum Zitat Todisco M, Delgado H, Evans NW (2016) Articulation rate filtering of CQCC features for automatic speaker verification. In: Interspeech, pp 3628–3632 Todisco M, Delgado H, Evans NW (2016) Articulation rate filtering of CQCC features for automatic speaker verification. In: Interspeech, pp 3628–3632
21.
Zurück zum Zitat Jelil S, Das RK, Prasanna SM, Sinha R (2017) Spoof detection using source, instantaneous frequency and cepstral features. In: Interspeech, pp 22–26 Jelil S, Das RK, Prasanna SM, Sinha R (2017) Spoof detection using source, instantaneous frequency and cepstral features. In: Interspeech, pp 22–26
22.
Zurück zum Zitat Dua M, Aggarwal R, Kadyan V, Dua S (2012) Punjabi Speech to text system for connected words, pp 206–209 Dua M, Aggarwal R, Kadyan V, Dua S (2012) Punjabi Speech to text system for connected words, pp 206–209
23.
Zurück zum Zitat Dua M, Aggarwal RK, Biswas M (2018) Discriminative training using noise robust integrated features and refined HMM modeling. J Intell Syst 29(1):327–344CrossRef Dua M, Aggarwal RK, Biswas M (2018) Discriminative training using noise robust integrated features and refined HMM modeling. J Intell Syst 29(1):327–344CrossRef
24.
Zurück zum Zitat Dua M, Aggarwal RK, Biswas M (2019) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Hum Comput 10(2) Dua M, Aggarwal RK, Biswas M (2019) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Hum Comput 10(2)
25.
Zurück zum Zitat Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl 31(10):6747–6755CrossRef Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl 31(10):6747–6755CrossRef
26.
Zurück zum Zitat Kumar MG, Kumar SR, Saranya MS, Bharathi B, Murthy HA (2019) Spoof detection using time-delay shallow neural network and feature switching. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, pp 1011–1017 Kumar MG, Kumar SR, Saranya MS, Bharathi B, Murthy HA (2019) Spoof detection using time-delay shallow neural network and feature switching. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, pp 1011–1017
28.
Zurück zum Zitat Huang L, Pun CM (2019) Audio replay spoof attack detection using segment-based hybrid feature and Dense Net-LSTM network. In: ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2567–2571 Huang L, Pun CM (2019) Audio replay spoof attack detection using segment-based hybrid feature and Dense Net-LSTM network. In: ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2567–2571
29.
Zurück zum Zitat Mobiny A, Najarian M (2018) Text-independent speaker verification using long short-term memory networks. arXiv:1805.00604. arXiv preprint Mobiny A, Najarian M (2018) Text-independent speaker verification using long short-term memory networks. arXiv:​1805.​00604. arXiv preprint
30.
Zurück zum Zitat Dua M, Jain C, Kumar S (2021) LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Human Comput Dua M, Jain C, Kumar S (2021) LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Human Comput
31.
Zurück zum Zitat Mittal A, Dua M (2021) Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. International J Swarm Intell Mittal A, Dua M (2021) Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. International J Swarm Intell
32.
Zurück zum Zitat Mittal A, Dua M (2021) Constant Q cepstral coefficients and long short-term memory model-based automatic speaker verification system. In: Proceedings of international conference on intelligent computing, information and control systems, pp 895–904 Mittal A, Dua M (2021) Constant Q cepstral coefficients and long short-term memory model-based automatic speaker verification system. In: Proceedings of international conference on intelligent computing, information and control systems, pp 895–904
33.
Zurück zum Zitat Chettri B, Mishra S, Sturm BL, Benetos E (2018) Analysing the predictions of a cnn-based replay spoofing detection system. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 92–97 Chettri B, Mishra S, Sturm BL, Benetos E (2018) Analysing the predictions of a cnn-based replay spoofing detection system. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 92–97
34.
Zurück zum Zitat Valenti G, Delgado H, Todisco M, Evans NW, Pilati L (2018) An end-to-end spoofing countermeasure for automatic speaker verification using evolving recurrent neural networks. In: Odyssey, pp 288–295 Valenti G, Delgado H, Todisco M, Evans NW, Pilati L (2018) An end-to-end spoofing countermeasure for automatic speaker verification using evolving recurrent neural networks. In: Odyssey, pp 288–295
35.
Zurück zum Zitat Kamble MR, Sailor HB, Patil HA, Li H (2019) Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Trans Signal Inf Process 9 Kamble MR, Sailor HB, Patil HA, Li H (2019) Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Trans Signal Inf Process 9
36.
Zurück zum Zitat Lai CI, Abad A, Richmond K, Yamagishi J, Dehak N, King S (2019) Attentive filtering networks for audio replay attack detection. In: ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6316–6320 Lai CI, Abad A, Richmond K, Yamagishi J, Dehak N, King S (2019) Attentive filtering networks for audio replay attack detection. In: ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6316–6320
38.
Zurück zum Zitat Brown JC, Puckette MS (1992) An efficient algorithm for the calculation of a constant Q transform. J Acoust Soc Am 92(5):2698–2701CrossRef Brown JC, Puckette MS (1992) An efficient algorithm for the calculation of a constant Q transform. J Acoust Soc Am 92(5):2698–2701CrossRef
39.
Zurück zum Zitat Brown JC (1991) Calculation of a constant Q spectral transform. J Acoust Soc Am 89(1):425–434CrossRef Brown JC (1991) Calculation of a constant Q spectral transform. J Acoust Soc Am 89(1):425–434CrossRef
40.
Zurück zum Zitat Yang J, Das RK, Li H (2018) Extended constant-Q cepstral coefficients for detection of spoofing attacks. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1024–1029 Yang J, Das RK, Li H (2018) Extended constant-Q cepstral coefficients for detection of spoofing attacks. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1024–1029
41.
Zurück zum Zitat Glover JC, Lazzarini V, Timoney J (2011) Python for audio signal processing. In: Linux Audio Conference 2011, May 6-8 2011, Maynooth, Ireland Glover JC, Lazzarini V, Timoney J (2011) Python for audio signal processing. In: Linux Audio Conference 2011, May 6-8 2011, Maynooth, Ireland
42.
Zurück zum Zitat Cheuk KW, Anderson H, Agres K, Herremans D (2019) nnAudio: an on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. arXiv:1912.12055. arXiv preprint Cheuk KW, Anderson H, Agres K, Herremans D (2019) nnAudio: an on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. arXiv:​1912.​12055. arXiv preprint
43.
Zurück zum Zitat Dinkel H, Qian Y, Yu K (2018) Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 26(11):2002–2014CrossRef Dinkel H, Qian Y, Yu K (2018) Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 26(11):2002–2014CrossRef
44.
Zurück zum Zitat Kingma D, Ba J (2014) Adam: a method for stochastic optimization. In: Proc. Int. Conf. Learn. Representations, pp 1–13 Kingma D, Ba J (2014) Adam: a method for stochastic optimization. In: Proc. Int. Conf. Learn. Representations, pp 1–13
46.
Zurück zum Zitat Jung JW, Shim HJ, Heo HS, Yu HJ (2019) Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge. arXiv:1904.10134. arXiv preprint Jung JW, Shim HJ, Heo HS, Yu HJ (2019) Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge. arXiv:​1904.​10134. arXiv preprint
Metadaten
Titel
Static–dynamic features and hybrid deep learning models based spoof detection system for ASV
verfasst von
Aakshi Mittal
Mohit Dua
Publikationsdatum
19.11.2021
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 2/2022
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-021-00565-w

Weitere Artikel der Ausgabe 2/2022

Complex & Intelligent Systems 2/2022 Zur Ausgabe

Premium Partner