1 Introduction

Noise pollution has become an indiscernible limitation of today’s society. Through a constant increment in magnitude and severity [9], environmental noise impairs human’s health and well-being more than ever before [15]. However, negative background auditory interferences, such as those produces by transportation noise [38], industrial noise [1], or urban noise [79], not only impair human’s cognitive [62, 73] and communicative [44, 45] skills, but also limit the performance of general audio and specific speech-driven applications, such as, automatic speech recognition [40], speech emotion recognition [2, 61], and speaker verification [24, 28, 37, 56]. Hence, audio enhancement, which generally aims at extracting targeted signals, is broadly exploited to improve audio and speech quality for real-life applications [16, 26, 55]. Two of the main procedures for enhancing audio and speech are source separation and denoising: The former aims to extract a target audio from a mixture of multiple overlapping signals [66]; the latter attempts to suppress the background noise [49, 78]. With the advance of artificial intelligence, neural network based models for audio enhancement have been presented, being particularly efficient in source separation [18, 20, 35, 36, 68] and denoising [5, 27, 32, 34, 47, 60, 71] tasks—the performance of classic algorithms is often overtaken by artificial neural networks [7].

Despite the ongoing efforts and the already achieved outcomes [13, 14, 39], the enhancement of in-the-wild audio sources is still an open research topic for which more research is required. One of the still open challenges in audio enhancement is the need for more robust methods to be developed. Although an audio enhancement model relies on its noise generalisation—which is limited to the data size and diversity of the training noises—in real scenarios, audio may simultaneously be corrupted by multiple kinds of noises, including unseen noises [30]. Furthermore, the non-stationary nature of real-life noises yields a level of uncertainty in realistic applications unapproachable, for instance, by the existing speech enhancement methods, which typically process a single noise recording at once [5, 47, 49].

Another challenge still open in the audio enhancement domain is the development of methods able to efficiently handle interference components while preserving, at the same time, the essential features of the signal. The currently available enhancing technology, often characterised by the application of aggressive methods for the estimation of noise and other interference components [48, 63, 76] is indeed not yet able to preserve a signal’s essential properties, such as, the speech’s naturalness. An efficient and precise distinction of target and background becomes particularly challenging when the interfering components present similar acoustic properties to the target signal to be retained. For instance, within the speech domain, separating an undesired user in the background speaking the same language as the target one might be particularly challenging, especially in noisy audio samples [17, 36]. Finally, a third challenge to be faced is the need for further improvements in the development of intelligent audio enhancement technology equipped with an autonomous decision making system. This becomes crucial in specific circumstances where preserving environmental interferences, such as alarms, might be essential for security reasons; thus, making the target noise to be removed very specific. In such a scenario, a selective noise suppression system should be capable to identify and preserve the allowed noises, i. e., the “positive noises”, while suppressing the undesired ones, i. e., the “negative noises”. Although this mechanism is crucial in real-life scenarios, where the noisy audio samples contain often important signals, such as alarms or other acoustic warnings aimed to prevent, e. g., traffic accidents, existing technology for selective noise suppression is still unable to process problems with this level of complexity. The main limitation of prior approaches is that they mostly exploit spatial information where multi-channel recordings are available, e. g., binaural hearing aids using end-to-end (e2e) wireless technology [3, 72]. However, these approaches rely on assumptions regarding the spatial properties of the target signal and the different noise sources, which limits considerably their adaptability to unseen environments; thus, becoming almost inapplicable in real-world scenarios.

1.1 Contributions of the presented work

All in all, the future development of in-the-wild audio enhancement technology should mainly focus on three challenges: (i) robustness, developed by increasing a system’s ability to handle unseen noises; (ii) efficiency, promoted by refining a system’s capacity to preserve signal’s essential properties; (iii) decision making, encouraged by improving a system’s capability to autonomously identify the important signals. In order to contribute to the alleviation of the described challenges, we introduce the Neuro-Holistic Audio-eNhancement SystemFootnote 1 (N-HANS), a neural network-based toolkit for in-the-wild audio enhancement developed with Tensorflow in Python. The objectives of N-HANS are therefore three-fold: (i) successfully process unseen noises through a robust technology especially tailored for audio denoising; (ii) efficiently preserve signals’ essential properties through their accurate separation from similar interfering sources; (iii) properly identify and retain important signals through a intelligent selective noise suppression system capable to autonomously discriminate between positive and negative noises. Hence, the main contributions of this work can be summarised as follows:

  • We present, to the best of our knowledge, the first audio enhancement toolkit with the functionality of selective noise suppression.

  • We propose a neural network architecture named ±Auxiliary Network using a novel fusion method to project information from auxiliary input references.

  • We present N-HANS, an open-source audio enhancement toolkit specially tailored for in-the-wild applications trough a three-fold functionality: denoising, selective noise suppression, and source separation. Along with the toolkit, we also provide an user-friendly command line interface.

The rest of this manuscript is organised as follows. In Section 2, the related work is outlined. In Section 3, we give an overview of the N-HANS framework and introduce the system’s input processing. Section 4 presents the proposed ±auxiliary network, which works as back end for the system. Section 5 discusses the performed experiments and their evaluation. Section 6 illustrates the system’s performance by visualising a selection of audio examples processed by N-HANS. Finally, concluding remarks and future research directions are drawn in Section 7.

2 Related work

Although a variety of methods for speech enhancement have been presented (for an overview, cf. Table 1), the open-source toolkits currently available focus only on one specific task, i. e., either audio denoising or source separation, while methods presenting those functionalities in the same tool have not yet been developed. Furthermore, the performance of many of them is limited by specific acoustic conditions, for instance, presenting a predisposition in handling only stationary noises. This is the case of VoiceBoxFootnote 2, a speech processing toolkit that provides classic signal processing algorithms for a wide range of audio tasks, including denoising. Similarly, CtuCopyFootnote 3, based on the combination of Wiener filtering and spectra subtraction methods, was developed for audio feature extraction and speech denoising. These two tools, since using classical signal processing methods for denoising, expect accurate noise power estimation, which can only be assured under stationary noise conditions. Indeed, for the processing of non-stationary noises, these classic approaches are characterised by a decline in their performance.

Table 1 List of the most popular open-source toolkits for speech enhancement

With the always increasing use of artificial neural networks, promising denoising toolkits based on neural networks, such as SETKFootnote 4, SE Toolkit [25], SEDNN [75], SEGAN [47], and U-Net [5, 60] have been presented in the literature. However, these methods were specifically designed for audio denoising, thus, presenting difficulties to be used for source separation. Similarly, Untwist [50] and Asteroid [46], are two neural network-based toolkits for source separation: the former includes the most basic neural network architecture, i. e., Multi-Layer Perception (MLP); the latter—recently proposed and considered ‘superior’ in literature—integrates a variety of neural networks, such as ConvTasnet [36], Deep clustering [18], and Chimera++ [69]. Nevertheless, none of both present functionalities for denoising applications. Finally, source separation methods based on non-negative matrix factorisation (NMF) such as OpenBlissart [70], have been also presented. Similarly, Flexible Audio Source Separation Toolbox (FASST) [51], considers Gaussian mixture model (GMM) and hidden Markov model (HMM) for the NMF training. The NMF-based method for source separation has also been extended to a denoising task in GCC-NMF4, which applies the generalised cross correlation (GCC) spatial localisation method; yet, GCC-NMF can be only considered for deonising, but not for source separation applications.

Footnote 5 To the best of our knowledge, N-HANS is the first publicly available neural network based tool presenting both: audio denoising and source separation functionalities in one toolkit framework. In addition, N-HANS provides the solution to selective noise suppression, i. e., suppressing only unwanted noises while preserving others—pertaining a natural audio surrounding can be particularly important when relevant signals are involved, e. g., alarms or other acoustic warnings. Furthermore, the performance of the currently available machine learning based speech enhancement tools, such as those indicated in Table 1, is limited to the diversity of speakers and noise types in training set, which impairs their application in real-life scenarios, where unseen speakers can appear and multiple noise types exist simultaneously. Differently, N-HANS, by leveraging auxiliary networks that learn to identify and generalise the characteristics of unseen speakers and speech surroundings, presents a more adaptive performance, w. r. t. the existing methods, in real-life scenarios.

3 System overview: methodology

N-HANS, embedded with two trained models sharing an identical architecture, faces the challenge of handling unseen noises by considering individual configurations, i. e., each model is conditioned on additional environmental backgrounds in order to adapt it to unseen noises from the real life. In addition, through its audio source separation and selective noise suppression system, based on an ±Auxiliary (A) Network (cf. Figure 1), N-HANS recovers a target audio while removing the interfering sources. To the best of our knowledge, the presented fusion method, used to inject the context information into the conditional residual network, has not been proposed in previous research (for further details cf. Section 4). First of all, the log magnitude spectrum is extracted from the input contaminated audio and from the positive and negative recordings by taking the logarithmic absolute value of the Short-Time Fourier Transformation (STFT)—extracted using a 25 ms Hanning window shifted by 10 ms—which are fed into the Enhanced and the ±Auxiliary Networks separately. Then, the + A Network processes the extracted positive spectrum to produce a positive embedding vector, while the −A Network processes the negative spectrum to emit a negative one. The positive and negative embedding vectors can be seen as the representations of the characteristics of the unseen audio contents and are then injected into the enhancement network to emit the denoised or separated audio. Through positive and negative context awareness, N-HANS encourages a system’s adaptability and applicability to different unseen noisy environments and audio sources, e. g., different speakers.

Fig. 1
figure 1

System framework. From left to right, the input (noisy audio) and the recordings: positive (+ rec), negative (−rec); the Log Spectrum extraction block; the Networks: ±Auxiliary (A), Enhanced; and the system’s output (denoised audio). The + A Network processes the + rec to produce a positive embedding vector (+ emb), i. e., the components to be preserved. The −A Network processes the −rec to obtain the negative embedding vector (−emb) that hints at the components to be suppressed. The enhanced network processes the noisy audio, as well as the positive and negative embeddings, in order to generate the desired output

3.1 Input audio processing

N-HANS processes contaminated audio conditioned on additional positive and negative recordings, which indicate the audio content to be preserved and suppressed, respectively.Footnote 6 The audio files used in our experiments (cf. Section 5) are: Librispeech [43] and Audioset [12] for denoising and selective noise suppression; and VoxCeleb Corpus [6, 41] for source separation. The sampling frequency of all audio files is 16 kHz, i. e., each frame consists of 400 samples with a resulting feature vector of 201 frequencies.Footnote 7 In Table 2, an overview of N-HANS’s input and output information for the considered tasks is given.

Table 2 N-HANS overview. The input, i. e., the raw input and the positive (+ ) and negative (−) recordings; as well as the output, are indicated for the three considered tasks: denoising, source separation, and selective noise suppression

The three inputs, i. e., the raw input (original audio file to be enhanced), the positive recording (containing interferences to be preserved), and the negative recording (containing interferences to be suppressed), are processed by the Enhanced and ±Auxiliary Networks (cf. Section 4). The contaminated segment M, consisting of N successive frames (N = 35 in the experiments) from the log magnitude spectrum of the contaminated audio, leads to \(M \in \mathbb {R}^{N \times F}\). The positive context C+ is a B frames segment of the log magnitude spectrum extracted from the positive recording. The negative context C is retrieved from the log magnitude spectrum of the negative recording using the same process, leading to \(C_{+}, C_{-} \in \mathbb {R}^{L \times F}\) (L = 200 in the experiments). The positive and the negative contexts, containing the information to be preserved and suppressed, respectively, from the raw input, are used to create the positive and negative embeddings involved in the enhancement process (cf. Section 4). In order to aggregate the acoustic characteristics of the audio content to be preserved and suppressed, sufficient acoustic information should be considered—the larger the size of the positive and negative contexts, the more information would be supplied to the system. The target segment \(T \in \mathbb {R}^{N \times F}\), with the same size of the contaminated segment, represents the ideal output segment: denoised audio for the denoising and selective noise suppression tasksFootnote 8; and a separated source for the source separation task. The centre frames are indicated as Mc, \(T^{c} \in \mathbb {R}^{1 \times F}\), for the contaminated and target segments, respectively.

4 Approach: ±auxiliary networks

The architecture of the proposed ±Auxiliary (A) Networks is based on stacks of residual blocks [17] as depicted in Fig. 2. Residual networks (Resnets), which introduce skip-connections to the conventional neural networks framework—resulting in smoother loss landscape and enabling a substantially deeper architecture [31]—have shown to be successful in both the computer vision and audio domains [17, 23, 67]. A basic residual block contains two convolutional layers, where batch normalisation [21, 52] followed by a rectified linear unit (ReLU) [42] are applied between the convolutional layers. The residual block’s input is then added to the output of the second convolutional layer after channel conversion via a 1 × 1 convolution. Again, batch normalisation and ReLU activation are applied to produce the block’s output.

Fig. 2
figure 2

Architecture of the ±Auxiliary (A) Networks. The + A and −A Networks process the positive and negative contexts (C+ and C) via a sequence of 4 residual blocks to produce positive and negative embeddings (e+ and e). To estimate the contamination frame (CF), the Enhanced Network processes the contaminated segment M (noisy or overlapping segment) through a sequence of 8 residual blocks, each additionally conditioned by the e+ and e

The N-HANS architecture consists of three subnetworks, each containing a sequence of residual blocks. An embedding network processes the positive context, i. e., the + A Network, in order to emit the positive embedding. Similarly, another embedding network with the same architecture (−A Network) processes the negative context to emit the negative embedding. Then, the enhanced network processes the two embeddings and the contaminated segment, by this emitting a contamination frame (CF), which estimates the audio components that need to be eliminated in the centre frame of the contaminated segment. Finally, the estimated target frame (F),Footnote 9 i. e., the difference in the centre frame between the contaminated segment and the estimated contamination frame, is computed. To minimise the mean squared error between the estimated target frame and the true target frame, i. e., the centre frame of the target segment, the model is trained considering stochastic gradient descent (cf. Section 4.2).

4.1 Auxiliary embedding network

In order to enable an individual management of the positive and negative contexts, these are separately processed in two embedding networks which share an identical structure but present different training parameters. Each of the two embedding networks, made up of a sequence of four residual blocks, takes an audio context as input and emits an embedding vector that may contain valuable acoustic information obtained from the context segment. The specifications of each embedding network are given in Table 3.

Table 3 Specifications of the auxiliary embedding network

The output feature map of the last residual block in each embedding network is averaged across all locations (time steps and frequency bins), leading to a positive and negative embedding vector (cf. (1) and (2), respectively). The positive embedding vector is defined as

$$ e^{+} = \textbf{avg}(f^{+A}(C^{+})), $$
(1)

while the negative embedding vector is defined as

$$ e^{-} = \textbf{avg}(f^{-A}(C^{-})), $$
(2)

both with a fixed length of 512. f+A and fA denote the operation of the residual blocks sequence in the positive and negative networks (+ A and −A), which through their own learning parameters separately process the positive and negative contexts (C+ and C) to produce the positive and negative embeddings: \(e^{+}, e^{-} \in \mathbb {R}^{512}\). The two embeddings are subsequently injected into the enhanced network to assist the audio denoising, source separation, and selective noise suppression tasks.

4.2 Enhanced network

The enhanced network, aimed to process the contaminated segment and the positive and negative embeddings, comprises a sequence of 8 conditional residual blocks, each of them presenting different kernel size, stride, and number of channels (cf. Table 4). Each conditional residual block, made up of two convolutional layers, processes the block input \(M_{in} \in \mathbb {R}^{T \times F \times C_{in}}\) (cf. Figure 3). In the first convolutional layer, the learnt positive and negative embeddings are projected to a vector with a length equals to the number of output feature maps in the layer by applying a trainable fully-connected layer. The projected embeddings are then added to every locations of the feature maps, leading to

$$ M_{1} = \textbf{conv}(M_{in}) + e_{1}^{+} + e_{1}^{-}, $$
(3)

which has the shape of T × F × C1, and

$$ e_{1}^{+} = e^{+} W_{1}^{+} + b_{1}^{+}, $$
(4)
$$ e_{1}^{-} = e^{-} W_{1}^{-} + b_{1}^{-}, $$
(5)

denote the linearly projected embedding vectors with the length of C1. \(W_{1}^{+}\), \(b_{1}^{+}\), and \(W_{1}^{-}\), \(b_{1}^{-}\) are trainable parameters. The projected embedding vectors are extended to the size of the convolution output by using array broadcasting.

Fig. 3
figure 3

Conditional residual block: the learnt positive and negative embeddings (e+ and e) are injected in the two convolutional layers of the enhanced network. Block’s input (Min), output of skip connection path (Msc), first convolutional layer output (M1), second convolutional layer output (M2), and block’s output (Mout) are also indicated

Table 4 Specifications of the enhanced network

Further, for the second convolutional layer, M1 is processed similarly, resulting in

$$ M_{2} = \textbf{conv}(M_{1}) + e_{2}^{+} + e_{2}^{-}, $$
(6)

with the shape of T × F × C2, where

$$ e_{2}^{+} = e^{+} W_{2}^{+} + b_{2}^{+}, $$
(7)
$$ e_{2}^{-} = e^{-} W_{2}^{-} + b_{2}^{-} $$
(8)

are projected embedding vectors with the length of C2. By doing this, all convolutional layers in the enhanced network are conditioned on the information from the positive and negative contexts, allowing the model to better estimate the components that need to be preserved and suppressed in the contaminated segment. Besides, in the skip connection path, the input of the conditional residual block is converted to have the same channels as the output of the second convolutional layer through 1 × 1 convolution, leading to

$$ M_{sc} = \textbf{conv}_{1 \times 1}(M_{in}), $$
(9)

which has the shape of T × F × C2 and is added to the main path to achieve the block output

$$ M_{out} = M_{2} + M_{sc}. $$
(10)

Again, batch normalisation is applied for each convolutional layer of the conditional residual block, followed by ReLU activation functions (cf. Figure 3).

The output of the last conditional residual block is additionally convolved along the time axis, and then flattened to a vector. The flattened vector is projected to the length of F (F = 201 in experiments), through a fully-connected layer, representing the estimated contamination frame:

$$ \hat{S}^{c} = \textbf{conv}_{\textbf{T}}(f^{enh}(M, e^{+}, e^{-}))W_{o} + b_{o}, $$
(11)

where fenh denotes the operation of the conditional residual blocks in the enhanced network, and convT indicates the convolution only along time direction. Wo,bo are the fully connected layer’s learnable parameters. We subtract the estimated contamination frame from the central frame of the contaminated spectrum to obtain the estimated target frame:

$$ \hat{T}^{c} = M^{c} - \hat{S}^{c}. $$
(12)

During the training phase, we optimise the network parameters using stochastic gradient descent (SGD) with a learning rate of 0.1, to minimise the weighted mean squared error (MSE) between the estimated target frame and the true centre frame of the target spectrum:

$$ L = || (\hat{T}^{c}(f) - T^{c}(f)) \times w(f) ||^{2}, $$
(13)

where f ∈ [1,F] stands for each frequency bin in the target frame. Further, w(f) is defined as

$$ w(f) = 2 - \frac{f}{F}, $$
(14)

and hence, the low frequencies are given more weight to better follow speech characteristics. At evaluation time, the same positive and negative contexts are used to process all contaminated segments that belong to a given audio sample from the test set. Thus, each estimated target centre frame takes into account the entire information of positive and negative contexts. Concatenating the target centre frames leads to the estimated target spectrum, and inverse short-time Fourier transform (iSTFT) is then used to reconstruct the target audio using the phase of the contaminated audio.

5 Usability: experimental results & evaluation

N-HANS, initially developed using Python 3 and TensorFlow 1.14, has been also made compatible with TensorFlow 2 according to the code migration official guidance provided by the platform. Its source code is freely available for developers in a GitHub public repository, and for users who want to directly apply N-HANS, trained models are also accessible via command line interface.Footnote 10 Although N-HANS was implemented to process 16 kHz Waveform Audio File Format (WAV) for input and output, i. e., one of the most standard and broadly used audio formats, input files in other formats or sample rates are also handled through an embedded format conversion based on pysox [4], which internally transfer them into WAV format.Footnote 11 Both, single or multiple audio files organised into a directory, can be provided as an input for N-HANS, which achieves an optimal performance with GPU-accelerationFootnote 12 but is also capable of running with CPU only. To test N-HANS functionalities, a series of experiments, presented in the following, were conducted.

5.1 Speech denoising & selective noise suppression

5.1.1 Dataset & evaluation metrics

To evaluate the performance of N-HANS in the denoising and selective noise suppression tasks, the LibriSpeech [43] and AudioSet [12] databases were considered. The LibriSpeech database provides large-scale clean speech utterances, which consist of approximately 1 000 hours of read speech derived from over 8 000 public domain audiobooks, containing its own train, development, and test splits. The AudioSet corpus contains more than two million human-labelled 10-second environmental sound clips drawn from YoutTube videos. According to AudioSet’s ontology, excluding the noise recordings labelled as ‘Human sounds’, we considered 16 198 samples for training, 636 for development, and 714 for test.Footnote 13 A variety of evaluation metrics, including log spectral distortion (LSD), signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), Mel cepstral distortion (MCD), and segmental SNR (SSNR), which are widely used in prior work [22], were taken into account to assess the performance of N-HANS in several Signal-to-Noise Ratio (SNR) conditions.

As selective noise suppression has not been explored in the literature yet, it is not possible to compare N-HANS performance on selective noise suppression with previous work. To this end, in order to assure a fair comparison, we performed a baseline considering the same model as the one proposed in N-HANS, but being conditioned only on the negative noise contexts. We leverage only the negative embedding subnetwork to learn a negative noise embedding, this is subsequently fed into the denoising subnetwork to assist in the identification of the noise to be suppressed.

5.1.2 Data processing

For selective noise suppression, to create a large and diverse dataset of in-the-wild speech corrupted by two different types of daily-life noise, we mixed each clean spoken utterance from LibriSpeech with two randomly selected environmental recordings from AudioSet. The two environmental recordings were considered as positive and negative noise, respectively.Footnote 14 The positive noise, negative noise, and spoken utterance, were truncated—by removing the exceeding signal tails—in order to set them to the same length. Subsequently, to create contaminated audio for training, the positive and negative noises were mixed with each utterance with two randomly selected SNRs in the range of − 3,0,1,3,5,8dB: one selected for the positive noise, i. e., SNR(+ ); the other for the negative one, i. e., SNR(−). Afterwards, the contaminated segments were randomly selected from the log magnitude spectrum of the contaminated audio, and the positive and negative contexts were created from the parts that did not appear in the contaminated segment.

For test, each pair of positive and negative noises was mixed with each utterance by considering all the possible permutations of SNR pairs in the range of 0,3,5,8dB. To encourage the model’s robustness, a larger variety of SNRs were considered in the training process. Positive and negative contexts were chosen from the beginning of the positive and negative noises, respectively. The test and validation sets were created once, and were consistent across all experiments.

5.1.3 Results on selective noise suppression

The experimental results show that the baseline model, trained on exactly the same data as our proposed architecture, is outperformed by N-HANS on the selective noise suppression task. In Table 5, baseline results (given in parentheses) are indicated for all the evaluation metrics and all the different SNR combinations. The performance gains over the baseline are attributed to the introduction of the complementary auxiliary network that learns the positive noise embedding. Differently, the baseline model, which does not present the positive embedding, is agnostic to the noisy sources that should be preserved; thus, it ends up by removing some of them. This comparison makes evident the importance of introducing a positive embedding to guarantee an efficient selective noise suppression. The best N-HANS performance on this task is achieved in conditions in which the speech surroundings contain more energy for the positive noise than for the negative, i. e., lower SNR for the positive noise, or higher SNR for the negative one. Indeed, considering the lowest level of SNR on the positive noise (i. e., 0dB), a higher SNR on the negative one yielded generally to a better performance in all the evaluation metrics; cf. the results for SNR(+ ) = 0dB and SNR(−) = 8dB in Table 5. This is probably due to the fact that the intensity difference between positive and negative noises provides an additional cue for discriminating them; thus, less negative noise is more easily suppressed by the system while consistently preserving the positive one.

Table 5 Test results for the selective noise suppression task. SNR(+ ) and SNR(−) represent the Signal-to-Noise ratio (SNR) applied to the positive and negative noises, respectively

5.1.4 Results on speech denoising

By supplying as positive recording a silent audio segment, the N-HANS selective noise suppression system works as an environment-aware speech denoising system. The system processes an in-the-wild speech audio samples and attempts to remove the noise to its greatest extent, based on the identification of the speech surroundings, which is indicated by the negative recording. Considering the same test set evaluation metrics, the experimental results achieved for the LibriSpeech and AudioSet corpora indicate that N-HANS produces an audio output of reliable quality—in comparison to other systems [22, 74]—in terms of speech distortion, as indicated by the levels of LSD, SDR, and MCD (cf. Table 6). Furthermore, our system yielded to high STOI results for all the evaluated conditions, even for the lower SNR (STOI = 0.81 for 0dB; cf. Table 6), which indicates the high speech intelligibility of the output.

Table 6 Test results for the speech denoising task with N-HANS trained on the LibriSpeech and AudioSet corpora considering the Evaluation Metrics: LSD, SDR, PESQ, STOI, MCD, and SSNR (cf. the caption of Table 5)

In order to gain understanding of the proposed method w. r. t. existing approaches, we compare the denoising performance of N-HANS with several state-of-the-art methods recently presented, including: SEGAN [47], Wavenet [49], MMSE-GAN [57], and DCUnet-20 [5]. The comparison is based on two publicly available database: The Diverse Environments Multichannel Acoustic Noise Database (DEMAND) [59], and the Voice Bank corpus [64]; the training and test splits are according to Choi et al. [5]. The comparisons of results are given in terms of PESQ, SSNR, and three additional evaluation metrics [19]: CSIG, i. e., the Mean Opinion Score (MOS) predictor of signal distortion; CBAK, i. e., the MOS predictor of background-noise intrusiveness; and COVL, i. e., the MOS predictor of the overall signal quality. In Table 7, the comparisons of results are given.

Table 7 Test results for the speech denoising task with SEGAN, Wavenet, MMSE-GAN, DCUnet-20, and N-HANS considering the Evaluation Metrics: CSIG, CBAK, COVL, PESQ, and SSNR (cf. the caption of Table 5)

Since N-HANS is initially trained for selective noise suppression, the functionality of ‘conventional’ denoising, i. e., removing all the surrounding noises while extracting only clean speech, is a side-product of supplying a silent segment as positive recording. Besides, our denoising model is originally trained on Librispeech and AudioSet, hence, its test performance on the Voice Bank and DEMAND corpora is not optimised. Nevertheless, even in such non-optimised conditions, N-HANS can still achieve comparable results w. r. t. state-of-the art methods, as shown in Table 7. The proposed denoising model performs slightly better than SEGAN in terms of CSIG and COVL, indicating that our method achieves better signal quality and less distortion. Differently, concerning the other three evaluation metrics, N-HANS seems to under-perform in removing noises of the DEMAND corpus. In order to carry out a fair comparison, we trained N-HANS with the training set of DEMAND and the Voice Bank corpus, which improved its performance on the test set for all the considered evaluation metrics. Despite a general improvement, our denoising method cannot reach the performance of DCUnet-20, which is due to the fact that DCUnet is trained using the loss function wSDR [5]—specially designed for enhancing hearing experience. However, to benefit from this kind of loss, the model needs as input an audio sample sufficiently long, resulting, in practice, in a lower RTF of the model for inference. Although applying wSDR and employing longer input has the potential to further improve the evaluation results from N-HANS, we consider that using wMSE (cf. (13)) ensures real-time processing, which enables the use of the proposed model for many to most realistic applications while keeping reliable denoising performance.

5.2 Source separation

5.2.1 Dataset & evaluation metrics

In order to evaluate the performance of N-HANS in the source separation task, the outcomes from our system were compared to two of the recently proposed state-of-the-art baselines for speech source separation [18, 35], which are not conditioned on any additional recordings. The experiments were conducted on the large and diverse VoxCeleb data set [6, 41], which provides more than 2 000 hours of single-channel recordings, encompassing more than one million utterances (4 − 12 seconds length each) extracted from Youtube interviews, including more than 7 000 speakers from different nationalities. Since the dataset contains two versions, i. e., VoxCeleb1 and VoxCeleb2—each of them with its own training and test partitioning (consisting of distinct speakers)—we considered as training and test sets the union of the two corresponding sets from both versions. To assess the system performance, according to previous work [65], the three objective evaluation metrics signal-to-distortion ratio (SDR), signal-to-artefacts ratio (SAR), and signal-to-interference ratio (SIR), were considered.

5.2.2 Data processing

In order to improve the separation quality, we enlarged the size of the training set by randomly creating the model inputs. At each iteration, we randomly selected two speakers, i. e., the target speaker and the interference speaker, and an utterance from each of them, i. e., the target utterance and the interference utterance. To create a mixture utterance, the two utterances were truncated to have the same length and subsequently mixed using a random SNR selected from a wide range (− 5 to 25dB), i. e., either − 5,0,5,10,15,20 or 25dB. The positive and negative contexts, which we also refer to as target and interference contexts, were created from the parts that did not appear in the mixture utterance. For creating the test set, target and interference utterances were mixed using an more constricted range of SNR (from − 5 to 5dB), i. e., either − 5,− 3,− 1,0,1,3 or 5dB—these SNR values have been selected in order to ensure a fair comparison between our algorithm and previous work [18]. In order to encourage the model’s capability of handling real-life scenarios, a wider range of SNRs was used for training. The positive (target) and negative (interference) contexts were chosen from the beginning of the target and interference utterances, respectively. Following the procedure considered for selective noise suppression and speech denoising, the test and validation sets for source separation were created once, and were consistently used across all experiments.used in previous works [18, 33, 36], since it contains much larger and more diverse daily-life environments; thus, promoting a more realistic understanding of the performance of the model in real-life audio applications. Experiments were performed for female and male speakers (both as target and interference) separately and together, i. e., considering two female speakers (f+f), two male speakers (m+m), and speakers of different genders (f+m); overall result including all speakers (all) are also reported in Table 8.

Table 8 Test results for the speech separation task considering the evaluation metrics signal-to-distortion ratio (SDR), signal-to-artifacts ratio (SAR), and signal-to-interference ratio (SIR), for the baseline with Deep Clustering (DC), Conv-Tasnet, and N-HANS methods

5.2.3 Results

The performance of N-HANS as a speech separation system was compared with the outcomes of two baseline models re-implemented on the VoxCeleb dataset [18]: one based on Deep Clustering (DC) [10, 11]; the other based on Conv-Tasnet [36]. In separating the speech signals, concerning signal-to-distortion ratio, signal-to-artifacts ratio, and signal-to-interference ratio— evaluation metrics computed using the BSSEval toolbox [58]—N-HANS considerably outperforms the DC baseline by a large margin: two tailed t-test yielded p < .0004 for the three evaluated metrics across the evaluated groups (cf. SDR, SAR, and SIR, for DC and N-HANS in Table 8). Concerning the Conv-Tasnet baseline, N-HANS presents also a significant improvement for SDR and SIR: p < .008; differently, despite N-HANS outperforms Conv-Tasnet in all the evaluated conditions concerning SAR, the improvement for this evaluation metric is not statistically significant: p = .558 (cf. the results for all the conditions of SAR for Conv-Tasnet and N-HANS in Table 8). The presented results show that although DC and Conv-Tasnet had achieved good separation outcomes on the WSJ0 and the TIMIT corpus [10, 11], its performance decreased noticeable when processing the VoxCeleb corpus, which presents a higher complexity. Indeed, the superior performance of N-HANS on this challenging dataset indicates the robustness of the method herein presented.

Concerning speaker’s gender, both models, i. e., the baseline (DC) and N-HANS, perform better on speakers of different genders w. r. t. speakers of the same gender (cf. f+m in Table 8). This phenomenon can be attributed to the fact that speech signals from speakers with the same gender share similar acoustic properties, which makes the mixed spectrum more challenging to separate. Comparing the outcomes of the experiments performed on speakers with the same gender w. r. t. those on speakers with different genders, we observe a larger average performance gap between our proposed model and the baseline methods for all three evaluation metrics. We conclude from this observation that especially in the challenging same gender condition, conditioning our model on the additional context recordings resulted in valuable information assisting the separation process. In addition, our source separation system overtakes the label permutation problem [29]—a question that has been only recently afforded [10, 29, 77]. The enhanced network, by learning the additional target and interference recordings, receives indications of the speaker labels, i. e., ‘target’ and ‘interference’, which encourages the separation of a mixture utterance and prevents, at the same time, the label permutation problem.

6 Performance visualisation

To illustrate the performance of N-HANS for its three functionalities, i. e., selective noise suppression, speech denoising, and speech source separation, we depict its processing procedure for some noisy speech samples in Figs. 45, and 6, respectively.

Fig. 4
figure 4

Spectrograms illustrating the audio components involved in the N-HANS selective noise suppression system, i. e., the clean spoken utterance (speech), the contaminated audio (noisy), the ideal result (target), the negative and positive noises, and the achieved outcome (denoised)

Fig. 5
figure 5

Spectrograms illustrating the audio components involved in the N-HANS denoising system, i. e., the clean spoken utterance (speech), the contaminated audio (noisy), the interfering noise, and the achieved outcome (denoised)

Fig. 6
figure 6

Spectrograms illustrating the audio components involved in the N-HANS source separation system, i. e., the mixture between the two speakers, and the target and interference speakers before (above) and after (below) to be separated by the system

6.1 Selective noise suppression

In each sub-figure of Fig. 4, an example of the spoken clean utterance, the noisy background (mixture of positive and negative noises), the target (mixture of the spoken utterance and the positive noise), the positive and negative noises, and the denoised sample (the system output), are presented. The N-HANS selective noise suppression system takes the noisy spectrum as input and removes only the negative noise; thus, the output is expected to be closest to the target spectrum, which includes only speech and positive noise. For a consistent negative noise that concentrates in some narrow frequency range, e. g., that shown in Fig. 4a, our system is able to sort out the noise and maximally retain the speech components and positive noise. Furthermore, processing noise that skips across a wide range of frequency axis, as shown in Fig. 4b, is usually a big challenge for most denoising systems; N-HANS shows also a good performance in such conditions. Finally, the system’s capability to recover speech signals under strong noise conditions is displayed in Fig. 4c, where the speech components masked by the negative noise in the noisy spectrum reappear in the denoised output.

6.2 Denoising

In each sub-figure of Fig. 5, an example of the clean utterance, the noisy utterance, the background noise, and the denoised sample (the system output), are displayed to represent the processing procedure of the N-HANS system for speech denoising. In Fig. 5a, a spoken utterance severely covered with a strong industrial noise at an SNR of 0dB is shown. Despite the difficulty in visually recognising the speech content in the noisy spectrum, the system is able to recover the main voice components, enhancing therefore the speech quality of the noisy audio sample. When processing non-continuous noises, i. e., those characterised by specific and isolated impulses, such as those shown in Fig. 5b and in c, the denoising system is capable to remove the noise based on additional noisy recordings. Note that when the surrounding environment contains a noise type with acoustic properties similar to those from the voice (cf. Figure 5c), in order to suppress the noise as much as possible, the system might distort in some extent the estimated speech spectrum. Yet, such distortions have very limited influence to normal human hearing perception. In addition, although the examples given for selective noise suppression referred only to narrow band and non-stationary noises (cf. Fig. 4), the high performance of N-HANS in the suppression of wide-band stationary noise (cf. Figure 5), indicates that wide-band stationary noises, characterised by reach contexts, promote the system’s ability to capture the noise to be suppressed— a principle that keeps valid for selective noise suppression too.

6.3 Speech separation

In each sub-figure of Fig. 6, an example of the mixture utterance, the target and interference utterances, and the resulting separated output, are displayed in order to represent N-HANS source separation performance. For each example, the mixture speech was composed from two utterances, each produced by a different speaker from the VoxCeleb test set. The source separation system takes the mixture as input and produces the separated target and interference speech, which are depicted on the right column. Figure 6a shows the separation performance for two speakers of different genders, while Fig. 6b and c do so for two speakers of the same gender. For the three mixture conditions, the system successfully separates the target from the interference speaker, as shown by the comparison between the separated target and the separated interference w. r. t. their original spectrum, i. e., the target and the interference (cf. plots below w. r. t. the plots above, for each sub-figure). This is particularly clear in the Fig. 6a, i. e., in the separation of speakers from different genders. For instance, in the mixture utterance, although the target speech is particularly disturbed by the interference speech at 0.8 s and 1.5 s, the system is able to suppress the interference components to its maximum extent (cf. Fig. 6a). In addition, the target utterance presents high resolution in the sound wave at low-frequency (under 1 kHz), which is smeared by the interference speech in the mixture spectrum. The source separation system can jointly estimate the amounts of speech components in each time-frequency bin in order to recover the target speech with high clarity in low-frequency range.

7 Conclusions and outlook

We have shown N-HANS—an open source toolkit for audio denoising, source separation, and selective noise suppression, based on our proposed ±Auxiliary Network. As such, to the best of the authors’ knowledge it is the first toolkit to provide selective noise suppression. Conditioned on reference recordings, N-HANS is capable of adapting to different unseen environments and audio sources, such as speakers. N-HANS can perform audio enhancement as front end to interface with other audio-related tools such as openXBOW [54] and auDeep [8], both of which have been broadly applied for audio features extraction. Future work for audio enhancement should focus on improving speech intelligibility in extreme low SNR cases, to overcome the distortions that occasionally introduced in audio. For audio separation, more work will be needed to extend the system to any number of audio sources, including, for instance, music source separation.