Submitted to ICASSP 2020.
Our multi-speaker Tacotron was pre-trained on the Nancy dataset (from Blizzard 2011) and warm-start trained on VCTK. Zero-shot speaker adaptation was accomplished by transfer learning -- speaker embeddings were extracted from separately-trained (on VoxCeleb) speaker verification systems and then included during TTS model training. During synthesis, embeddings from unseen speakers are input to the TTS model to adapt to their voice, without any model fine-tuning.
Embedding | Dimension | Pooling | Objective | Normalized? | EER | DCF(min,0.01) |
i-VecN | 400 | m | EM | ✔ | 5.329 | 0.493 |
x-Vec | 512 | m,s | S | 3.298 | 0.343 | |
x-VecN | 512 | m,s | S | ✔ | 3.213 | 0.342 |
LDE-1 | 512 | m | S | 3.415 | 0.366 | |
LDE-1N | 512 | m | S | ✔ | 3.446 | 0.365 |
LDE-2 | 512 | m | AS(2) | 3.674 | 0.364 | |
LDE-2N | 512 | m | AS(2) | ✔ | 3.664 | 0.386 |
LDE-3 | 512 | m | AS(3) | 3.033 | 0.314 | |
LDE-3N | 512 | m | AS(3) | ✔ | 3.171 | 0.327 |
LDE-4 | 512 | m | AS(4) | 3.112 | 0.315 | |
LDE-4N | 512 | m | AS(4) | ✔ | 3.271 | 0.327 |
LDE-5 | 256 | m | AS(2) | 3.287 | 0.343 | |
LDE-5N | 256 | m | AS(2) | ✔ | 3.367 | 0.351 |
LDE-6 | 200 | m | AS(2) | 3.266 | 0.396 | |
LDE-6N | 200 | m | AS(2) | ✔ | 3.266 | 0.396 |
LDE-7 | 512 | m,s | AS(2) | 3.091 | 0.303 | |
LDE-7N | 512 | m,s | AS(2) | ✔ | 3.171 | 0.328 |
System | Seen Speakers (training) | Unseen Speakers (dev) | Unseen Speakers (test) | |||||||||
p225 | p234 | p245 | p334 | p360 | p304 | p343 | p264 | p363 | p252 | p339 | p351 | |
nat | ||||||||||||
copy synth | ||||||||||||
x-vector | ||||||||||||
LDE-1 | ||||||||||||
LDE-1N | ||||||||||||
LDE-2 | ||||||||||||
LDE-2N | ||||||||||||
LDE-3 | ||||||||||||
LDE-3N | ||||||||||||
LDE-4 | ||||||||||||
LDE-4N | ||||||||||||
LDE-5 | ||||||||||||
LDE-5N | ||||||||||||
LDE-6 | ||||||||||||
LDE-6N | ||||||||||||
LDE-7 | ||||||||||||
LDE-7N |