d-vectors

Multi-speaker Tacotron: d-vectors

Trained on VoxCeleb 1 and 2, and LibriSpeech-other
Silent regions removed with VAD
256 dimensions
Pretrained model trained 1.56M steps (20 days with a single GPU) with a batch size of 64

Based on x-vectors extracted from synthesized speech.

System

Seen Speakers (training)

Unseen Speakers (dev)

Unseen Speakers (test)

p225

p234

p245

p334

p360

p304

p343

p264

p363

p252

p339

p351

nat

copy synth

x-vector

LDE-3

d-vector