Headsets are widely used in video conferences and online events, offering a compact solution of headphones, a microphone, and an analog-to-digital (ADC) converter in a single product.
However, their microphones often have lower perceptual audio quality than studio microphones: The typically smaller diaphragm sizes are unable to faithfully capture the lower frequency
range of human speech and their ADC converters are oftentimes limited to 16kHz (also known as wideband) or lower sample rates.
This limitation is particularly prevalent in Bluetooth headsets using Hands-Free Profile (HFP) or Headset Profile (HSP).
Artificially extending the input's bandwidth can improve perceptual audio quality; however, in practical applications, reverberation and background noise can significantly degrade
the performance of bandwidth extension algorithms. Furthermore, the maximum achievable quality in the extended bandwidth is limited by the source characteristics.
We propose a multitasking neural network for speech enhancement that effectively reduces noise and reverberation of a wideband input while generating a clean, full-band audio output
(48kHz). Differently from previous approaches, our solution leverages anechoic studio microphone recordings to produce a high quality target, thus, allowing simultaneous
reconstruction and enhancement of both low frequencies differences due to diaphragm size limitations as well as high frequencies due to ADC constraints. Our network is designed to
operate in real-time and low-complexity setups, allowing it to run on consumer laptops using CPU only or to be deployed on-device for hardware solutions. Model code and dataset
recordings are freely available to encourage further research.
🎧 We highly recommend using wired headphones to better discern the differences between the audio samples provided below. Additionally, please be sure that your system playback is set to a minimum sample rate of 48kHz.
In the following examples, the model input consists of wideband clean dry speech. The network's task is to extend the bandwidth and modify the frequency response to match that of the target microphone.
Source headset microphone: IGH1000 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: IGH1000 → Target studio microphone: SM57
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: SM57
In the following examples, the model input consists of wideband clean reverberant speech. The network's task is to attenuate the reverberation, extend the bandwidth and modify the frequency response to match that of the target microphone.
Source headset microphone: IGH1000 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: IGH1000 → Target studio microphone: SM57
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: SM57
In the following examples, the model input consists of wideband noisy dry speech. The network's task is to remove the noise, extend the bandwidth and modify the frequency response to match that of the target microphone.
Source headset microphone: IGH1000 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: IGH1000 → Target studio microphone: SM57
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: SM57
Input
Predicted (no conversion)
Predicted (converted)
Target
Bandwidth extension + derevereration + noise suppression Top ↑
In the following examples, the model input consists of wideband noisy dry speech. The network's task is to remove the noise, attenuate the reverberation, extend the bandwidth and modify the frequency response to match that of the target microphone.
Source headset microphone: IGH1000 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: IGH1000 → Target studio microphone: SM57
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: C414
Input
Predicted (no conversion)
Predicted (converted)
Target
Source headset microphone: NSX10 → Target studio microphone: SM57