Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

Authors: Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi

Accepted to Interspeech 2020.

This is an extension of our previous work, "Zero-Shot Multi-Speaker Text-to-Speech with State-of-the-Art Neural Speaker Embeddings" (ICASSP 2020). We try a number of different approaches for speaker space augmentation to improve speaker similarity. The first approach is artificial speaker augmentation, by speeding up and slowing down the existing VCTK data to create artificial "speakers". The next one is to include lower-quality ASR data to increase the number of speakers seen during training. We also experiment with adding channel labels, to counteract the lower quality of the data in the ASR corpora, as well as adding dialect embeddings to better model speakers' accents.

Audio Samples

The large number of audio files takes some time to load, please wait a moment if they don't play right away.
System Seen Speakers (training) Unseen Speakers (dev) Unseen Speakers (test)
p225p234p245p334 p360p304p343p264 p363p252p339p351
natural
copy synthesis
phone baseline
phone VTLP
phone 5c
phone 5c+CL
phone 5c+CL+DE1
phone 5c+CL+DE2
phone 5c+CL+DE3
phone 5c+CL+DE4
phone 5c+CL+DE5
char baseline
char VTLP
char 5c
char 5c+CL
char 5c+CL+DE1
char 5c+CL+DE2
char 5c+CL+DE3
char 5c+CL+DE4
char 5c+CL+DE5