This is an extension of our previous work, "Zero-Shot Multi-Speaker
Text-to-Speech with State-of-the-Art Neural Speaker Embeddings"
(ICASSP 2020). We try a number of different approaches for speaker
space augmentation to improve speaker similarity. The first approach
is artificial speaker augmentation, by speeding up and slowing down
the existing VCTK data to create artificial "speakers". The next one
is to include lower-quality ASR data to increase the number of speakers
seen during training. We also experiment with adding channel labels,
to counteract the lower quality of the data in the ASR corpora, as
well as adding dialect embeddings to better model speakers' accents.
Audio Samples
The large number of audio files takes some time to load, please wait a moment if they don't play right away.