Real-time joint noise suppression and bandwidth extension of noisy reverberant wideband speech

Esteban Gómez*†and Tom Bäckström*

*Department of Information and Communications Engineering, Aalto University, Espoo, Finland
Voicemod S.L., Valencia, Spain

{esteban.gomezmellado, tom.backstrom}@aalto.fi

net-structure

Abstract

Artificially extending the bandwidth of speech in band-limited scenarios using 16kHz (known as wideband) or lower sample rates such as in VoIP or some Bluetooth applications, can significantly improve its perceptual quality. Typically, clean speech is assumed as input to estimate the missing spectral information. However, such an assumption falls short if it has been contaminated by noise or reverb, resulting in audible artifacts. We propose a low-complexity multitasking neural network capable of performing noise suppression and bandwidth extension 16kHz to 48kHz (fullband) in real-time on a CPU, mitigating such issues even if the noise cannot be completely removed from the input. Instead of employing a monolithic model, we adopt a modular approach and complexity reduction methods that result in a more compact model than the sum of its parts while improving its performance.

Bandwidth extension Noise suppression Real-time Deep learning Multitasking


Audio examples: Noisy dry speech

In the following examples, the input to the model corresponds to noisy dry speech. The network's task is to simultaneously suppress the noise and to predict the missing spectral content without extending the bandwidth of any residual noise if present.

Example #1
Noisy input (16kHz)
Ground truth output (48kHz)
NS + BWE
NSL + BWE
NSL + BWEF
NS + BWEL
NSF + BWEL
NSL + BWEL

Example #2
Noisy input (16kHz)
Ground truth output (48kHz)
NS + BWE
NSL + BWE
NSL + BWEF
NS + BWEL
NSF + BWEL
NSL + BWEL

Example #3
Noisy input (16kHz)
Ground truth output (48kHz)
NS + BWE
NSL + BWE
NSL + BWEF
NS + BWEL
NSF + BWEL
NSL + BWEL

Audio examples: Noisy reverberant speech

In the following examples, the input to the model corresponds to noisy reverberant speech. In this case, along with suppressing the noise, the bandwidth must be extended for both the speech features and the room characteristics, without propagating spurious extensions of any residual noise if present.

Example #4
Noisy input (16kHz)
Ground truth output (48kHz)
NS + BWE
NSL + BWE
NSL + BWEF
NS + BWEL
NSF + BWEL
NSL + BWEL

Example #5
Noisy input (16kHz)
Ground truth output (48kHz)
NS + BWE
NSL + BWE
NSL + BWEF
NS + BWEL
NSF + BWEL
NSL + BWEL

Example #6
Noisy input (16kHz)
Ground truth output (48kHz)
NS + BWE
NSL + BWE
NSL + BWEF
NS + BWEL
NSF + BWEL
NSL + BWEL

Acknowledgment

The calculations presented in this publication were carried out using the computer resources of the Aalto University of Science “Science-IT” project.