A unified system for voice cloning and voice conversion through diffusion probabilistic modeling (INTERSPEECH 2022 review)

Under review as a conference paper at INTERSPEECH 2022, pdf

Abstract

Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker's voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.

1 Comparison with multimodal systems

In this section we provide audio samples used in AB test for multimodal systems comparison.

1.1 NAUTILUS


	VCC2TF1	VCC2TF2	VCC2TM1	VCC2TM2
GT:

Cloning (on 5 minutes)
	VCC2TF1	VCC2TF2	VCC2TM1	VCC2TM2
	Text: and the whole ship creaking, groaning, and jumping like a manufactory		Text: The proper course to pursue is to offer your name and address
Nautilus:
Ours:

Conversion
	VCC2TF1	VCC2TF2	VCC2TM1	VCC2TM2
Source:
Nautilus:
Ours:
Source:
Nautilus:
Ours:

1.2 A Unified Speaker Adaptation Method

Cloning (on 10 samples)
	p254	p236	p264	p345
GT:
Unified:
Ours:

2 Comparison with single-task systems

2.1 Voice Conversion

In this section we provide a subset of audio samples used in Mean Opinion Score evaluation for voice conversion task. We compare baselines with our unified diffusion-based model in few-shot (FSL) scenario for speakers from VCTK unseen during training. All models were train on LibriTTS dataset.

Choose a pair of voices:

Source

Target

FS-PPG-VC

BNE-PPG-VC

Ours

FSL:

2.2 Voice Cloning

2.2.1 VCTK

In this section we provide a subset of audio samples used in Mean Opinion Score evaluation for voice cloning task on speakers from VCTK.

Choose a speaker:

p231

Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow.

StyleSpeech

FastSpeech

Tacotron-SMA

Grad-TTS

Ours

FSL:

Text: Throughout the centuries people have explained the rainbow in various ways.

StyleSpeech

FastSpeech

Tacitron-SMA

Grad-TTS

Ours

FSL:

2.2.2 Internal speakers

Voice cloning results for some speakers for additional test.

Choose a speaker:

scarlett

Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow.

StyleSpeech

FastSpeech

Tacotron-SMA

Grad-TTS

Ours

FSL:

Text: Does Jane know about your new job? No, and don't you dare tell her! She will be furious!

StyleSpeech

FastSpeech

Tacotron-SMA

Grad-TTS

Ours

FSL:

Text: Throughout the centuries people have explained the rainbow in various ways.

StyleSpeech

FastSpeech

Tacitron-SMA

Grad-TTS

Ours

FSL:

2.2.3 Voice Cloning dependence on amount of data

We studied the influence of the amount of adaptation data on the quality of the speech synthesized by our system in a few-shot mode. The dataset was decreased from 60 seconds to 30, 15 and 5 which correspond to 6, 3 and 1 audio samples.

Choose a speaker: