Zero-Shot Multi-Speaker Text-to-Speech with State-of-the-Art Neural Speaker Embeddings

Authors: Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi

Submitted to ICASSP 2020.

Our multi-speaker Tacotron was pre-trained on the Nancy dataset (from Blizzard 2011) and warm-start trained on VCTK. Zero-shot speaker adaptation was accomplished by transfer learning -- speaker embeddings were extracted from separately-trained (on VoxCeleb) speaker verification systems and then included during TTS model training. During synthesis, embeddings from unseen speakers are input to the TTS model to adapt to their voice, without any model fine-tuning.

Speaker Embeddings

We tried various different ASV models for extracting speaker embeddings. Information about each system is in the table below. LDE-3 was our best system in terms of similarity of unseen speakers.

Embedding Dimension Pooling Objective Normalized? EER DCF(min,0.01)
i-VecN 400 m EM 5.329 0.493
x-Vec 512 m,s S 3.298 0.343
x-VecN 512 m,s S 3.213 0.342
LDE-1 512 m S 3.415 0.366
LDE-1N 512 m S 3.446 0.365
LDE-2 512 m AS(2) 3.674 0.364
LDE-2N 512 m AS(2) 3.664 0.386
LDE-3 512 m AS(3) 3.033 0.314
LDE-3N 512 m AS(3) 3.171 0.327
LDE-4 512 m AS(4) 3.112 0.315
LDE-4N 512 m AS(4) 3.271 0.327
LDE-5 256 m AS(2) 3.287 0.343
LDE-5N 256 m AS(2) 3.367 0.351
LDE-6 200 m AS(2) 3.266 0.396
LDE-6N 200 m AS(2) 3.266 0.396
LDE-7 512 m,s AS(2) 3.091 0.303
LDE-7N 512 m,s AS(2) 3.171 0.328

Audio Samples

System Seen Speakers (training) Unseen Speakers (dev) Unseen Speakers (test)
p225p234p245p334 p360p304p343p264 p363p252p339p351
nat
copy synth
x-vector
LDE-1
LDE-1N
LDE-2
LDE-2N
LDE-3
LDE-3N
LDE-4
LDE-4N
LDE-5
LDE-5N
LDE-6
LDE-6N
LDE-7
LDE-7N

d-vectors