Low-complexity Real-time Neural Network for Blind Bandwidth Extension of Wideband Speech

Esteban Gómez*†, Mohammad Hassan Vali and Tom Bäckström

*Department of Information and Communications Engineering, Aalto University, Espoo, Finland
Voicemod S.L., Valencia, Spain

{esteban.gomezmellado, mohammad.vali, tom.backstrom}@aalto.fi

bbwexnet-overview

Abstract

Speech is streamed at 16 kHz or lower sample rates in many applications (e.g. VoIP, Bluetooth headsets). Extending its bandwidth can produce significant quality improvements. We introduce BBWEXNet, a lightweight neural network that performs blind bandwidth extension of speech from 16 kHz (wideband) to 48 kHz (fullband) in real-time in CPU. Our low latency approach allows running the model with a maximum algorithmic delay of 16 ms, enabling end-to-end communication in streaming services and scenarios where the GPU is busy or unavailable. We propose a series of optimizations that take advantage of the U-Net architecture and vector quantization methods commonly used in speech coding, to produce a model whose performance is comparable to previous real-time solutions, but approximately halving the memory footprint and computational cost. Moreover, we show that the model complexity can be further reduced with a marginal impact on the perceived output quality.

Bandwidth extension Speech processing Real-time Deep learning

Click here to download the paper.

Audio examples

Unprocessed 16k Baseline 16k48k Baseline Shuffle 16k48k BBWEXNet NSVQ6 16k48k BBWEXNet NSVQ8 16k48k BBWEXNet NSVQ6 Shuffle 16k48k BBWEXNet NSVQ8 Shuffle 16k48k BBWEXNet NSVQ6 Shuffle 16k32k BBWEXNet NSVQ8 Shuffle 16k32k Ground truth 48k

Acknowledgment

The calculations presented in this publication were carried out using the computer resources of the Aalto University of Science “Science-IT” project.