A unified system for voice cloning and voice conversion through diffusion probabilistic modeling (INTERSPEECH 2022 review)
Under review as a conference paper at INTERSPEECH 2022, pdf
Abstract
Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker's voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.
1 Comparison with multimodal systems
In this section we provide audio samples used in AB test for multimodal systems comparison.
1.1 NAUTILUS
VCC2TF1 | VCC2TF2 | VCC2TM1 | VCC2TM2 | |
GT: | ||||
Cloning (on 5 minutes) |
||||
VCC2TF1 | VCC2TF2 | VCC2TM1 | VCC2TM2 | |
Text: and the whole ship creaking, groaning, and jumping like a manufactory | Text: The proper course to pursue is to offer your name and address | |||
Nautilus: | ||||
Ours: | ||||
Conversion |
||||
VCC2TF1 | VCC2TF2 | VCC2TM1 | VCC2TM2 | |
Source: | ||||
Nautilus: | ||||
Ours: | ||||
Source: | ||||
Nautilus: | ||||
Ours: |
1.2 A Unified Speaker Adaptation Method
Cloning (on 10 samples) |
||||
p254 | p236 | p264 | p345 | |
GT: | ||||
Unified: | ||||
Ours: | ||||
2 Comparison with single-task systems
2.1 Voice Conversion
In this section we provide a subset of audio samples used in Mean Opinion Score evaluation for voice conversion task. We compare baselines with our unified diffusion-based model in few-shot (FSL) scenario for speakers from VCTK unseen during training. All models were train on LibriTTS dataset.
Source | Target | |||
FS-PPG-VC | BNE-PPG-VC | Ours | ||
FSL: | ||||
2.2 Voice Cloning
2.2.1 VCTK
In this section we provide a subset of audio samples used in Mean Opinion Score evaluation for voice cloning task on speakers from VCTK.
p231 | ||||||||
Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow. | ||||||||
StyleSpeech | FastSpeech | Tacotron-SMA | Grad-TTS | Ours | ||||
FSL: | ||||||||
Text: Throughout the centuries people have explained the rainbow in various ways. | ||||||||
StyleSpeech | FastSpeech | Tacitron-SMA | Grad-TTS | Ours | ||||
FSL: | ||||||||
2.2.2 Internal speakers
Voice cloning results for some speakers for additional test.
scarlett | ||||||||
Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow. | ||||||||
StyleSpeech | FastSpeech | Tacotron-SMA | Grad-TTS | Ours | ||||
FSL: | ||||||||
Text: Does Jane know about your new job? No, and don't you dare tell her! She will be furious! | ||||||||
StyleSpeech | FastSpeech | Tacotron-SMA | Grad-TTS | Ours | ||||
FSL: | ||||||||
Text: Throughout the centuries people have explained the rainbow in various ways. | ||||||||
StyleSpeech | FastSpeech | Tacitron-SMA | Grad-TTS | Ours | ||||
FSL: | ||||||||
2.2.3 Voice Cloning dependence on amount of data
We studied the influence of the amount of adaptation data on the quality of the speech synthesized by our system in a few-shot mode. The dataset was decreased from 60 seconds to 30, 15 and 5 which correspond to 6, 3 and 1 audio samples.
scarlett | ||||||||
Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold of the end of the rainbow. | ||||||||
5s | 15s | 30s | 60s | |||||
FSL: | ||||||||
Text: Throughout the centuries people have explained the rainbow in various ways. | ||||||||
5s | 15s | 30s | 60s | |||||
FSL: | ||||||||
March 2022