USED: Universal Speaker Extraction and Diarization


ArXiv: arXiv:2309.10674

Authors

  • Junyi Ao (The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China)
  • Mehmet Sinan Yildirim (Department of ECE, National University of Singapore, Singapore)
  • Meng Ge (Department of ECE, National University of Singapore, Singapore)
  • Shuai Wang (Shenzhen Research Institute of Big Data, Shenzhen, China)
  • Ruijie Tao (Department of ECE, National University of Singapore, Singapore)
  • Yanmin Qian (Shanghai Jiao Tong University, Shanghai, China)
  • Liqun Deng (Huawei Noah's Ark Lab)
  • Longshuai Xiao (AI and All-scenario Intelligence Development, Huawei)
  • Haizhou Li (The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China)

Contact Email: junyiao1@link.cuhk.edu.cn

Abstract

Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios.

Contents

1 Noisy LibriMix (Min Mode)
1.1 Libri2Mix
1.2 Libri3Mix
2 Noisy LibriMix (Max Mode)
2.1 Libri2Mix
2.2 Libri3Mix
3 Noisy SparseLibriMix
3.1 Overlap Ratio 0
3.2 Overlap Ratio 0.4
3.3 Overlap Ratio 0.8

Audio Samples

Noisy Libri2Mix (Min Mode)

Mixture Speech and Target Speeches

Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2

Baseline: SpEx+

Speaker 1 (SI-SDRi = 13.58 dB) Speaker 2 (SI-SDRi = 1.00 dB)

USED

Speaker 1 (SI-SDRi = 13.94 dB) Speaker 2 (SI-SDRi = 13.00 dB)

Noisy Libri3Mix (Min Mode)

Mixture Speech and Target Speeches

Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2 Ground Truth of Speaker 3

Baseline: SpEx+

Speaker 1 (SI-SDRi = 10.94 dB) Speaker 2 (SI-SDRi = 13.35 dB) Speaker 3 (SI-SDRi = 15.14 dB)

USED

Speaker 1 (SI-SDRi = 16.60 dB) Speaker 2 (SI-SDRi = 13.77 dB) Speaker 3 (SI-SDRi = 16.49 dB)

Noisy Libri2Mix (Max Mode)

Mixture Speech and Target Speeches

Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2

Baseline: SpEx+

Speaker 1 (SI-SDRi = 10.77 dB) Speaker 2 (SI-SDRi = 8.86 dB)

USED

Speaker 1 (SI-SDRi = 11.30 dB) Speaker 2 (SI-SDRi = 16.22 dB)

Noisy Libri3Mix (Max Mode)

Mixture Speech and Target Speeches

Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2 Ground Truth of Speaker 3

Baseline: SpEx+

Speaker 1 (SI-SDRi = 22.17 dB) Speaker 2 (SI-SDRi = 5.97 dB) Speaker 3 (SI-SDRi = 15.99 dB)

USED

Speaker 1 (SI-SDRi = 23.86 dB) Speaker 2 (SI-SDRi = 12.38 dB) Speaker 3 (SI-SDRi = 16.91 dB)

Noisy SparseLibri3Mix

Overlap Ratio 0

Mixture Speech and Target Speeches
Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2 Ground Truth of Speaker 3

Baseline: SpEx+
Speaker 1 (SI-SDRi = 22.19 dB) Speaker 2 (SI-SDRi = 11.86 dB) Speaker 3 (SI-SDRi = 6.33 dB)

USED
Speaker 1 (SI-SDRi = 22.71 dB) Speaker 2 (SI-SDRi = 15.62 dB) Speaker 3 (SI-SDRi = 17.16 dB)

Overlap Ratio 0.4

Mixture Speech and Target Speeches
Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2 Ground Truth of Speaker 3

Baseline: SpEx+
Speaker 1 (SI-SDRi = 13.95 dB) Speaker 2 (SI-SDRi = 0.88 dB) Speaker 3 (SI-SDRi = 4.59 dB)

USED
Speaker 1 (SI-SDRi = 20.13 dB) Speaker 2 (SI-SDRi = 12.27 dB) Speaker 3 (SI-SDRi = 13.12 dB)

Overlap Ratio 0.8

Mixture Speech and Target Speeches
Mixture Speech
Ground Truth of Speaker 1 Ground Truth of Speaker 2 Ground Truth of Speaker 3

Baseline: SpEx+
Speaker 1 (SI-SDRi = 13.58 dB) Speaker 2 (SI-SDRi = 0.94 dB) Speaker 3 (SI-SDRi = 6.59 dB)

USED
Speaker 1 (SI-SDRi = 18.28 dB) Speaker 2 (SI-SDRi = 11.92 dB) Speaker 3 (SI-SDRi = 13.25 dB)

*This page is modified from https://speechresearch.github.io/fastspeech2/.