Multi-speaker Tacotron: d-vectors
Implementation of d-vectors from here:
https://github.com/CorentinJ/Real-Time-Voice-Cloning
- Trained on VoxCeleb 1 and 2, and LibriSpeech-other
- Silent regions removed with VAD
- 256 dimensions
- Pretrained model trained 1.56M steps (20 days with a single GPU) with a batch size of 64
Cosine Similarities
Based on x-vectors extracted from synthesized speech.
system | seen | dev | test |
lde-3 | 0.842 | 0.492 | 0.549 |
d-vectors | 0.838 | 0.466 | 0.551 |
Audio Samples
System | Seen Speakers (training) | Unseen Speakers (dev) | Unseen Speakers (test) |
|
p225 | p234 | p245 | p334 |
p360 | p304 | p343 | p264 |
p363 | p252 | p339 | p351 |
nat |
|
|
|
|
|
|
|
|
|
|
|
|
copy synth |
|
|
|
|
|
|
|
|
|
|
|
|
x-vector |
|
|
|
|
|
|
|
|
|
|
|
|
LDE-3 |
|
|
|
|
|
|
|
|
|
|
|
|
d-vector |
|
|
|
|
|
|
|
|
|
|
|
|