Audio Samples for Multi-Speaker Tacotron

Zero-Shot Multi-Speaker Text-to-Speech with State-of-the-Art Neural Speaker Embeddings

Authors: Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi

Submitted to ICASSP 2020.

Our multi-speaker Tacotron was pre-trained on the Nancy dataset (from Blizzard 2011) and warm-start trained on VCTK. Zero-shot speaker adaptation was accomplished by transfer learning -- speaker embeddings were extracted from separately-trained (on VoxCeleb) speaker verification systems and then included during TTS model training. During synthesis, embeddings from unseen speakers are input to the TTS model to adapt to their voice, without any model fine-tuning.

Speaker Embeddings

We tried various different ASV models for extracting speaker embeddings. Information about each system is in the table below. LDE-3 was our best system in terms of similarity of unseen speakers.

Embedding Dimension Pooling Objective Normalized? EER DCF(min,0.01)

i-Vec^N 400 m EM ✔ 5.329 0.493

x-Vec 512 m,s S 3.298 0.343

x-Vec^N 512 m,s S ✔ 3.213 0.342

LDE-1 512 m S 3.415 0.366

LDE-1^N 512 m S ✔ 3.446 0.365

LDE-2 512 m AS(2) 3.674 0.364

LDE-2^N 512 m AS(2) ✔ 3.664 0.386

LDE-3 512 m AS(3) 3.033 0.314

LDE-3^N 512 m AS(3) ✔ 3.171 0.327

LDE-4 512 m AS(4) 3.112 0.315

LDE-4^N 512 m AS(4) ✔ 3.271 0.327

LDE-5 256 m AS(2) 3.287 0.343

LDE-5^N 256 m AS(2) ✔ 3.367 0.351

LDE-6 200 m AS(2) 3.266 0.396

LDE-6^N 200 m AS(2) ✔ 3.266 0.396

LDE-7 512 m,s AS(2) 3.091 0.303

LDE-7^N 512 m,s AS(2) ✔ 3.171 0.328

Audio Samples

System

Seen Speakers (training)

Unseen Speakers (dev)

Unseen Speakers (test)

p225

p234

p245

p334

p360

p304

p343

p264

p363

p252

p339

p351

nat

copy synth

x-vector

LDE-1

LDE-1^N

LDE-2

LDE-2^N

LDE-3

LDE-3^N

LDE-4

LDE-4^N

LDE-5

LDE-5^N

LDE-6

LDE-6^N

LDE-7

LDE-7^N

d-vectors

Embedding	Dimension	Pooling	Objective	Normalized?	EER	DCF(min,0.01)
i-Vec^N	400	m	EM	✔	5.329	0.493
x-Vec	512	m,s	S		3.298	0.343
x-Vec^N	512	m,s	S	✔	3.213	0.342
LDE-1	512	m	S		3.415	0.366
LDE-1^N	512	m	S	✔	3.446	0.365
LDE-2	512	m	AS(2)		3.674	0.364
LDE-2^N	512	m	AS(2)	✔	3.664	0.386
LDE-3	512	m	AS(3)		3.033	0.314
LDE-3^N	512	m	AS(3)	✔	3.171	0.327
LDE-4	512	m	AS(4)		3.112	0.315
LDE-4^N	512	m	AS(4)	✔	3.271	0.327
LDE-5	256	m	AS(2)		3.287	0.343
LDE-5^N	256	m	AS(2)	✔	3.367	0.351
LDE-6	200	m	AS(2)		3.266	0.396
LDE-6^N	200	m	AS(2)	✔	3.266	0.396
LDE-7	512	m,s	AS(2)		3.091	0.303
LDE-7^N	512	m,s	AS(2)	✔	3.171	0.328