Demo - Robust real-time bandwidth extension for headsets with a microphone conversion target

Robust real-time bandwidth extension for headsets with a microphone conversion target

Esteban Gómez and Tom Bäckström

Department of Information and Communications Engineering, Aalto University, Espoo, Finland

{esteban.gomezmellado, tom.backstrom}@aalto.fi

net-structure

Abstract

Headsets are widely used in video conferences and online events, offering a compact solution of headphones, a microphone, and an analog-to-digital (ADC) converter in a single product. However, their microphones often have lower perceptual audio quality than studio microphones: The typically smaller diaphragm sizes are unable to faithfully capture the lower frequency range of human speech and their ADC converters are oftentimes limited to 16kHz (also known as wideband) or lower sample rates. This limitation is particularly prevalent in Bluetooth headsets using Hands-Free Profile (HFP) or Headset Profile (HSP). Artificially extending the input's bandwidth can improve perceptual audio quality; however, in practical applications, reverberation and background noise can significantly degrade the performance of bandwidth extension algorithms. Furthermore, the maximum achievable quality in the extended bandwidth is limited by the source characteristics.

We propose a multitasking neural network for speech enhancement that effectively reduces noise and reverberation of a wideband input while generating a clean, full-band audio output (48kHz). Differently from previous approaches, our solution leverages anechoic studio microphone recordings to produce a high quality target, thus, allowing simultaneous reconstruction and enhancement of both low frequencies differences due to diaphragm size limitations as well as high frequencies due to ADC constraints. Our network is designed to operate in real-time and low-complexity setups, allowing it to run on consumer laptops using CPU only or to be deployed on-device for hardware solutions. Model code and dataset recordings are freely available to encourage further research.

Bandwidth extension Dereverberation Noise suppression Real-time Low-complexity

🎧 We highly recommend using wired headphones to better discern the differences between the audio samples provided below. Additionally, please be sure that your system playback is set to a minimum sample rate of 48kHz.

Bandwidth extension Top ↑

In the following examples, the model input consists of wideband clean dry speech. The network's task is to extend the bandwidth and modify the frequency response to match that of the target microphone.

Source headset microphone: IGH1000 → Target studio microphone: C414

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: IGH1000 → Target studio microphone: SM57

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: NSX10 → Target studio microphone: C414

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: NSX10 → Target studio microphone: SM57

Input

Predicted (no conversion)

Predicted (converted)

Target

Bandwidth extension + dereverberation Top ↑

In the following examples, the model input consists of wideband clean reverberant speech. The network's task is to attenuate the reverberation, extend the bandwidth and modify the frequency response to match that of the target microphone.

Source headset microphone: IGH1000 → Target studio microphone: C414

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: IGH1000 → Target studio microphone: SM57

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: NSX10 → Target studio microphone: C414

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: NSX10 → Target studio microphone: SM57

Input

Predicted (no conversion)

Predicted (converted)

Target

Bandwidth extension + noise suppression Top ↑

In the following examples, the model input consists of wideband noisy dry speech. The network's task is to remove the noise, extend the bandwidth and modify the frequency response to match that of the target microphone.

Source headset microphone: IGH1000 → Target studio microphone: C414

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: IGH1000 → Target studio microphone: SM57

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: NSX10 → Target studio microphone: C414

Input

Predicted (no conversion)

Predicted (converted)

Target

Source headset microphone: NSX10 → Target studio microphone: SM57

Input

Predicted (no conversion)

Predicted (converted)

Target

Bandwidth extension + derevereration + noise suppression Top ↑

In the following examples, the model input consists of wideband noisy dry speech. The network's task is to remove the noise, attenuate the reverberation, extend the bandwidth and modify the frequency response to match that of the target microphone.