Synthetic speech detection using phase information

doi:10.1016/j.specom.2016.04.001

Speech Communication

Volume 81, July 2016, Pages 30-41

https://doi.org/10.1016/j.specom.2016.04.001 Get rights and content

Highlights

•
Phase information based synthetic speech detectors (RPS, MGD) are analyzed.
•
Training using real attack samples and copy-synthesized material is evaluated.
•
Evaluation of the detectors against unknown attacks, including channel effect.
•
Detectors work well for voice conversion and adapted synthetic speech impostors.

Abstract

Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.

Introduction

In speech processing, speech synthesis and analysis areas alike, phase information has been traditionally discarded for most conventional applications. The spectral module information is highly correlated with the perceptual features of the speech and there are well established techniques to process them. Phase information has subtler perceptual effects (Alsteris and Paliwal, 2007) (Saratxaga et al., 2012) and tricky features like wrapping make it more difficult to model and process.

This unawareness for phase information in most speech processing techniques can indeed be exploited to detect such a processing on speech, tracing the unintended perturbations of the natural phase patterns left behind by this processing. One particular case where detecting natural speech manipulations can be critical is the speaker verification field.

The first speaker verification (SV) systems tried to resolve the problem of detecting if a voice was certainly from a claimant speaker and not from other (Rosenberg, 1976). The improvement of the SV systems allowed a high success rate solving the problem of naive speaker verification, but the parallel advance in speech manipulation techniques has posed a new menace to these systems: impostors forging speech signals that imitate a particular speaker's voice. This threat was first pointed by Pellom and Hansen (1999) and Masuko et al. (2000), and has received more and more attention in literature as new voice adaptation and transformation techniques have made more feasible to mimic a speaker's voice with less and less material from the original speaker. A detailed survey is published in Wu et al. (2015a).

Nowadays two are the main speech processing techniques that allow the creation of synthetic speech spoofing signals: First, the statistical speech synthesizers (Yoshimura et al., 1999) (Tokuda et al., 2002) using voices adapted to a particular speaker (Yamagishi et al., 2009) even with minimum quality material (Yamagishi et al., 2010). Secondly, the voice conversion (VC) techniques (Jin et al., 2008) (Kinnunen et al., 2012). Both techniques can be used to generate spoofing signals that can successfully deceive state-of-the-art SV systems with false acceptance rates (FAR) around 80% for synthetic speech and 5% for VC.

A number of countermeasures have been proposed for these attacks. In Satoh et al. (2001), a countermeasure based on the average inter-frame difference was proposed to discriminate between natural and synthetic speech from an HMM-based speech synthesis system. Another similar countermeasure which also uses an average pair-wise distance between consecutive frames was proposed to detect voice-converted speech (Alegre et al., 2013a). Rather than capturing the inter-frame distortions, in Wu et al. (2013) and Alegre et al. (2013b), modulation-based features and local binary pattern-based features were proposed to utilize long-term spectro-temporal information for synthetic speech detection. In Sizov et al. (2015), a countermeasure which uses the same front-end as ASV was proposed to discriminate natural and voice-converted speech. All these countermeasures derive features from magnitude spectra and work well for specific previously known attack techniques.

Phase based parameters are good candidates to detect synthetic speech due to the usual phase information neglect of many speech processing techniques. Phase information can be analyzed in many ways (instantaneous phase, short-term group delay (Banno et al., 1998), anticausal cepstrum (Drugman et al., 2011), and others), but not all the parameters are suitable for statistical modeling as required by a classifier. Phase-based countermeasures proposed by the authors of this work have been used for both synthetic and voice-converted speech detection. In Wu et al. (2012) synthetic speech detectors (SSD) based on cosine normalized phase and modified-group delay (MGD) (Yegnanarayana and Murthy, 1992) are evaluated with converted spoofing signals. In Wu et al. (2013), modulation spectrum derived from the modified group delay spectrum was used for synthetic speech detection. These works have confirmed the effectiveness of phase information in detecting synthetic speech with matched vocoder.

Relative Phase Shift (RPS) representation (Saratxaga et al., 2009) for the harmonic phase has also been used to build SSD systems aimed to detect spoofing signals created with adapted synthetic voices (De Leon et al., 2011) (De Leon et al., 2012) with good results. The initial works were focused on evaluating the actual capability of the RPSs to detect the phase modifications due to the synthetic generation of the spoofing signals. Consequently synthesized impostors were used to model the spoofing attacks. This approach has the double downside of requiring the adaptation of synthetic voices to generate the spoofing samples, and, more important, using particular attacks to train the synthetic models yields that their performance will be attack-dependent, and they will not be able to detect spoofing signals created with another attacking technique.

Once the validity of the RPS based SSD was demonstrated, the problem of avoiding attack dependence of the SSD was addressed in Sanchez et al. (2014) and Sanchez et al. (2015b). In these works, the authors analyze the use of copy-synthesized signals to create the imposter models. This way, the models are not dependent on the particular features of a specific synthesizer, but they can detect any signal created with a vocoder. Multi-vocoder models trained and tested with completely unrelated signals were evaluated with good results.

Recently, the use of phase for synthetic speech detection has been widely adopted, either alone or combined with other parameters, and using different classifiers. Many systems include group delay derived parameters like MGD or all-pole group delay function (APGD) (Sahidullah et al., 2015)(Alam et al., 2015). Other reported phase parameters are cosine phase (Liu et al., 2015), relative phase (Wang et al., 2015), instantaneous frequency (i.e. time derivative of the phase) (Patel and Patil, 2015), baseband phase difference (BDP) and phase at the CGI (pitch synchronous phase) (Xiao et al., 2015) or the RPS (Villalba et al., 2015) (Sahidullah et al., 2015)(Sanchez et al., 2015a).

In this paper we review and evaluate two phase based SSD systems known for their good performance in statistical modeling and classification: a MGD based and a RPS based SSD system, benchmarking them against a spectral module based (MFCC) baseline system. In this work we especially analyze the optimal use of training material comparing the strategy of using “real” spoofing signals versus using copy-synthesis signals from the natural ones.

Recently the work in this area has been promoted by the ASVSpoof2015, the Automatic Speaker Verification Spoofing and Countermeasures Challenge (Wu et al., 2014). The participants were invited to submit the results of independent SSD modules for evaluation. Spoofing detection systems were tested with a database (the so-called ASVSpoof database), containing different spoofing techniques such as speech synthesis and voice conversion. The performance of the different systems was assessed by the organization using standard metrics. This database has been made available to the public, and we are using it in this work to evaluate our SSD systems.

The performance of the systems with unknown signals is also evaluated using a completely unrelated set of signals from the Blizzard Challenge (Black and Tokuda, 2005). This is the most popular international event for TTS system evaluations, where independent participants build synthetic voices using a common speech corpus and send some samples to be evaluated. They are, undoubtedly, a representative sample of the current technology in speech synthesis, and, consequently, of the kind of likely spoofing technique.

Furthermore, the tests with a completely unrelated database, as the Blizzard Challenge one, introduce the channel-mismatch issue for spoofing detection. While in the ASVSpoof Challenge the same recording channel is assumed for every signal, the channel information of Blizzard Challenge data is different from ASVSpoof data. The robustness to the channel of the different SSDs has been little studied in literature and will be analyzed in this work for the proposed systems.

The paper is organized as follows. First, the phase representation and parameterization methods – RPS and MGD – are described. Then, in Section 3, the Synthetic Speech Detection System is described. 4th section is devoted to describe the databases used in both the training and test phases, and in the 5th section the evaluation experiments are detailed. Finally, some conclusions are drawn.

Section snippets

Phase representation and parameterization

We will evaluate two different phase-based systems: the Relative Phase Shift (RPS), based on the phase shift of the harmonic components of the speech signal, and the Modified Group Delay (MGD), which includes both magnitude and phase related information. Both systems are described below.

Synthetic Speech Detectors (SSD)

In this work we will compare different Synthetic Speech Detectors (SSD) systems. The purpose of the SSD systems is to discriminate between natural speech signals and synthetically generated ones. SSD blocks are intended to work jointly with speaker verification (SV) systems, trying to detect synthetically generated speaker adapted impostor signals that can cheat the SV system. If the SSD system requires knowing the supposed speaker identity to perform the classification task (i.e. it uses

ASVSpoof database

This database was created for the Automatic Speaker Verification Spoofing and Countermeasures Challenge (Wu et al., 2014), and comprises natural and spoofed speech. It is fully described in Wu et al. (2015c) but a brief summary is provided here.

The natural speech information was collected from 106 speakers (61 female and 45 male). There are no remarkable channel or background noise effects. Taking these genuine human signals as a basis, 10 different spoofing algorithms (named S1 to S10) are

Experiments and results

We have evaluated the phase-based SSD systems in two experiments using two evaluation sets, as explained in the previous section. For both of them, the systems have been trained with the training and development sets of the ASVSpoof DB, including additional signals generated by copy-synthesis of the human subset, using the three vocoders explained in 4.1.3 . In the first experiment, the test material belongs to the same database as the training material (the ASVspoof DB) whereas in the second,

Conclusions

In this paper we have reviewed two phase based methods to detect spoofing using synthetic speech: both are based in GMM models for natural and synthetic signals but one of them uses Modified Group Delay parameters to train the models while the other uses DCT-mel-RPS parameters. We also use a MFCC based system as baseline. We have focused on attacks created with speaker adapted synthetic speech and voice conversion systems which use parameter manipulation followed by speech generation using

Acknowledgments

This work has been partially supported by the Basque Government (ElkarOla Project, KK-2015/00,098) and the Spanish Ministry of Economy and Competitiveness (Restore project, TEC2015-67,163-C2-1-R).

References (46)

L.D. Alsteris et al.
Short-time phase spectrum in speech processing: a review and some experimental results
Digit. Signal Process.
(2007)
T. Drugman et al.
Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation
Speech Commun.
(2011)
H. Kawahara et al.
Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds
Speech Commun.
(1999)
WuZ. et al.
Spoofing and countermeasures for speaker verification
Surv. Speech Commun.
(2015)
J. Alam et al.
Development of CRIM System for the Automatic Speaker Verification Spoofing and Countermeasures Challenge 2015
F. Alegre et al.
Spoofing Countermeasures to Protect Automatic Speaker Verification from Voice Conversion
F. Alegre et al.
A new speaker verification spoofing countermeasure based on local binary patterns
H. Banno et al.
Efficient representation of short-time phase based on group delay
A.W. Black et al.
The Blizzard Challenge 2005: Evaluating corpus-based speech synthesis on common datasets
P.L. De Leon et al.
Detection of synthetic speech for the problem of imposture
(2011)

P.L. De Leon et al.

Evaluation of speaker verification security and detection of HMM-based synthetic speech

IEEE Trans. Audio. Speech. Lang. Process.

(2012)

D. Erro et al.

Harmonics plus noise model based vocoder for statistical parametric speech synthesis

IEEE J. Sel. Top. Signal Process.

(2014)

R.M. Hegde et al.

Significance of the modified group delay feature in speech recognition

IEEE Trans. Audio, Speech Lang. Process.

(2007)

Q. Jin et al.

Is voice transformation a threat to speaker identification?

S. King

Measuring a decade of progress in Text-to-Speech

Loquens

(2014)

S. King et al.

The Blizzard Challenge 2012

T. Kinnunen et al.

Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech

LiuY. et al.

Simultaneous Utilization of Spectral Magnitude and Phase Information to Extract Supervectors for Speaker Verification Anti-Spoofing

T. Masuko et al.

Imposture using synthetic speech against speaker verification based on spectrum and pitch

T.B. Patel et al.

Combining Evidences from Mel Cepstral, Cochlear Filter Cepstral and Instantaneous Frequency Features for Detection of Natural vs. Spoofed Speech

B.L. Pellom et al.

An experimental study of speaker verification sensitivity tocomputer voice-altered imposters

A.E. Rosenberg

Automatic speaker verification: a review

Proc. IEEE

(1976)

M. Sahidullah et al.

A Comparison of Features for Synthetic Speech Detection

Cited by (48)

Subband fusion of complex spectrogram for fake speech detection
2023, Speech Communication
The phase information was shown useful in fake speech detection. However, the most common reason why phase-based features are not widely used is phase wrapping. This makes the original phase hard to model directly. Therefore, it remains a challenge how to utilize the phase information effectively. To address this issue, this paper proposes a novel subband fusion of the complex spectrogram method for fake speech detection. The complex spectrogram is used as the input feature, containing both amplitude and phase spectrogram. In addition, subbands of the complex spectrogram are modeled separately. The idea is motivated by the fact that each frequency band has a different effect on the fake speech detection task. Finally, to make full use of the subbands, the subband results are fused. Experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 0.68% and a minimum tandem detection cost function (min t-DCF) of 0.0224.
Analysis of Instantaneous Frequency Components of Speech Signals for Epoch Extraction
2023, Computer Speech and Language
Citation Excerpt :
In Alsteris and Paliwal (2004) and Schluter and Ney (2001), it was shown that the phase spectrum -based feature extraction improved the performance of automatic speech recognition. In addition, several studies have reported that features representing the phase spectrum are useful in speaker recognition and detection of synthetic speech (Nakagawa et al., 2012; Wang et al., 2010; Bastys et al., 2010; Rajan et al., 2013; Saratxaga et al., 2016). More details of the application of the phase spectrum in speech processing can be found in recent review articles (Mowlaee et al., 2016; Gerkmann et al., 2015).
The major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is difficult to observe these discontinuities (corresponding to GCIs) in the speech signal because of the superimposed time-varying response of the vocal tract system. This paper examines the phase part of different frequency components of the speech signal to extract epochs. Three analysis methods to decompose the speech signal into different frequency components are considered. These methods are the short-time Fourier transform (STFT), narrow bandpass filtering (NBPF), and single frequency filtering (SFF). The locations of the discontinuities in the speech signal are obtained from the instantaneous frequency (IF) (i.e., the time derivative of the phase) of each of the frequency components. A method for automatic detection of epochs using the amplitude weighted IF is proposed. Performance of the proposed epoch detection method is compared with four state-of-the-art methods in clean and telephone quality speech. The performance of the proposed method is comparable with the performance of the existing epoch detection methods for clean speech but better for telephone quality speech.
Voice spoofing detector: A unified anti-spoofing framework
2022, Expert Systems with Applications
Citation Excerpt :
In De Leon, Pucher, Yamagishi, Hernaez, and Saratxaga (2012), relative phase shift (RPS) features were extracted from the speech segments of the audio signal and used with the GMM for speech synthesis detection. Similarly, RPS was used with the GMM for synthetic speech detection in Saratxaga, Sanchez, Wu, Hernaez, and Navas (2016). In Janicki (2017), long term prediction residual signals comprised of 23 different parameters were used with the SVM to classify the human and cloned speech.
Voice controlled systems (VCS) in Internet of Things (IoT), speaker verification systems, voice-based biometrics, and other voice-assistant-enabled systems are vulnerable to different spoofing attacks i.e., replay, cloning, cloned-replay, etc. VCS are not only susceptible to these attacks in a non-network environment, but they are also vulnerable to multi-order spoofing attacks in networked IoT. Additionally, deepfakes with artificially generated audio pose a great threat to the all systems having voice-interfaces. Most of the existing countermeasures against these voice spoofing attacks work for only one specific attack (e.g. voice replay) and fail to generalize this for other classes of spoofing attacks. Additionally, generalization is also crucial for cross-corpora evaluation. Thus, there exists a need to develop a unified voice anti-spoofing framework capable of detecting multiple spoofing attacks. This work presents a unified anti-spoofing framework that uses novel (ATCoP-GTCC) features to combat the variety of voice spoofing attacks. The proposed novel acoustic-ternary co-occurrence patterns (ATCoP) encode the co-occurrence of similar patterns between the center and neighboring samples. Our experiments demonstrate that ATCoP can better capture the microphone induced distortions in replays, unnatural prosody and algorithmic artifacts in cloned samples, and both the distortions and artifacts in cloned-replays including compression on multi-hop attacks in the spoofing samples. The performance of ATCoP could be further enhanced by the Gammatone cepstral coefficients. To evaluate the effectiveness of the proposed anti-spoofing system for multi-order replay and cloned-replay attacks detection, we created a diverse voice spoofing detection corpus (VSDC) containing multi-order replay and cloned-replay audios against the bonafide and cloned audio recordings, respectively. Experimental results obtained on VSDC, ASVspoof 2019, Google’s LJ Speech, and YouTube deepfakes datasets illustrate the effectiveness of the proposed system in terms of accurate detection for a variety of voice spoofing attacks.
Towards protecting cyber-physical and IoT systems from single- and multi-order voice spoofing attacks
2021, Applied Acoustics
Citation Excerpt :
Existing techniques [10–16] have used different magnitude- and phase-based features for voice cloning/synthesis detection. In [10,11], relative phase shift features were used, whereas, cochlear filter cepstral coefficients (CFCC) and CFCC-instantaneous frequency features were employed in [12] with the GMM for voice cloning detection. In [13], inverted constant-Q coefficients, inverted CQCC, inverted C-Q block coefficients, and inverted C-Q linear block coefficients were employed for speech synthesis detection.
Voice-controlled systems (VCSs), a new class of cyber-physical systems (CPS), and Internet of Things (IoT) devices are increasingly employing smart speakers such as Google Home and Amazon Alexa, and other voice assistants to enable management of various remote operations at home and offices. However, these smart speakers and hence VCSs are susceptible to various voice spoofing attacks i.e. replay, cloning, etc., in a non-network environment as well as in a multi-hop network setup. These diverse spoofing threats on VCSs require an urgent need to develop a robust spoofing countermeasure for VCSs capable of detecting a variety of voice spoofing attacks. This paper presents a spoofing countermeasure that uses novel acoustic ternary patterns (ATP) with Gammatone cepstral coefficients (GTCC) features to counter the voice spoofing attacks on VCSs in single- and multi-hop network environments. Our experimental analysis demonstrates that the proposed ATP features when combined with GTCC can effectively detect the distortions in replayed samples, unnatural prosody present in the cloned samples, and both distortions and unnatural patterns of stress and intonation in cloned-replay samples. The proposed ATP-GTCC features are used to train the SVM for development of a spoofing countermeasure to cater all possible forgeries. Experimental results based on highly diversified ASVspoof 2019 and VSDC datasets signify the effectiveness of the proposed countermeasure for reliable detection of 1st- and 2nd-order replay, cloning, and cloned-replay attacks.
Inter-component phase processing of quasipolyharmonic signals
2021, Applied Acoustics
The paper presents a generalization of theoretical and experimental research in the field of inter-component phase signal processing based on instantaneous phase estimates of multiple or rational frequency harmonic components. We propose to model harmonic phase of each component of quasipolyharmonic signal with consideration of relative delays that occur on different frequencies during the signal propagation. Based on the proposed harmonic phase model, it is argued the inter-component phase relations carry the information about parameters of these relative delays. We introduce the general expression for the inter-component phase relations estimates, showing their temporal constancy and invariance to the time–frequency shifts and fluctuations of the harmonic amplitudes. These properties correspond to the findings obtained for signal propagation experiments with prior knowledge of harmonic phases. Applicability of proposed estimates for processing of natural signals is justified by results of past speech processing research (including speaker identification and speech enhancement) and novel experiments on condition monitoring of industrial machines. By employing the proposed harmonic phase model, we discuss why the earlier research on speech structure using higher-order spectra techniques did not reveal the non-linear nature of speech. We carry out simple experiments on condition monitoring of industrial machines to demonstrate the potential of distinguishing between different configurations of shaft misalignment based on the distribution of standard deviation of inter-component phase relations.
One-Class Neural Network With Directed Statistics Pooling for Spoofing Speech Detection
2024, IEEE Transactions on Information Forensics and Security

View all citing articles on Scopus

View full text