A unified system for voice cloning and voice conversion through diffusion probabilistic modeling (INTERSPEECH 2022 review)

Under review as a conference paper at INTERSPEECH 2022, pdf

Abstract

Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker's voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.

1 Comparison with multimodal systems

In this section we provide audio samples used in AB test for multimodal systems comparison.

1.1 NAUTILUS

VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2
GT:

Cloning (on 5 minutes)

VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2
Text: and the whole ship creaking, groaning, and jumping like a manufactory Text: The proper course to pursue is to offer your name and address
Nautilus:
Ours:

Conversion

VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2
Source:
Nautilus:
Ours:
Source:
Nautilus:
Ours:

1.2 A Unified Speaker Adaptation Method

Cloning (on 10 samples)

p254 p236 p264 p345
GT:
Unified:
Ours:

2 Comparison with single-task systems

2.1 Voice Conversion

In this section we provide a subset of audio samples used in Mean Opinion Score evaluation for voice conversion task. We compare baselines with our unified diffusion-based model in few-shot (FSL) scenario for speakers from VCTK unseen during training. All models were train on LibriTTS dataset.

Choose a pair of voices:
Source Target

FS-PPG-VC BNE-PPG-VC Ours
FSL:

2.2 Voice Cloning

2.2.1 VCTK

In this section we provide a subset of audio samples used in Mean Opinion Score evaluation for voice cloning task on speakers from VCTK.

Choose a speaker:
p231

Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow.

StyleSpeech FastSpeech Tacotron-SMA Grad-TTS Ours
FSL:

Text: Throughout the centuries people have explained the rainbow in various ways.

StyleSpeech FastSpeech Tacitron-SMA Grad-TTS Ours
FSL:

2.2.2 Internal speakers

Voice cloning results for some speakers for additional test.

Choose a speaker:
scarlett

Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow.

StyleSpeech FastSpeech Tacotron-SMA Grad-TTS Ours
FSL:

Text: Does Jane know about your new job? No, and don't you dare tell her! She will be furious!

StyleSpeech FastSpeech Tacotron-SMA Grad-TTS Ours
FSL:

Text: Throughout the centuries people have explained the rainbow in various ways.

StyleSpeech FastSpeech Tacitron-SMA Grad-TTS Ours
FSL:

2.2.3 Voice Cloning dependence on amount of data

We studied the influence of the amount of adaptation data on the quality of the speech synthesized by our system in a few-shot mode. The dataset was decreased from 60 seconds to 30, 15 and 5 which correspond to 6, 3 and 1 audio samples.

Choose a speaker:
scarlett

Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow.

5s 15s 30s 60s
FSL:

Text: Throughout the centuries people have explained the rainbow in various ways.

5s 15s 30s 60s
FSL:




March 2022