A pre-print version of this paper can be found at https://arxiv.org/abs/1906.07414
All samples are synthesized using Speaker-Independent WaveNet vocoder, which was not trained on the target speaker speech
1st sentence
Reference samples
WaveNet | Recording | |
---|---|---|
Reference | ► Play | ► Play |
Text-to-speech speaker adaptation samples
5 utterances | 250 utterances | |||
---|---|---|---|---|
Supervised | Unsupervised | Supervised | Unsupervised | |
A1B | ► Play | ► Play | ► Play | ► Play |
A3a | ► Play | ► Play | ► Play | ► Play |
BaB | ► Play | ► Play | ► Play | ► Play |
Baa | ► Play | ► Play | ► Play | ► Play |
BaBall | ► Play | ► Play | ► Play | ► Play |
Baaall | ► Play | ► Play | ► Play | ► Play |
2nd sentence
Reference samples
WaveNet | Recording | |
---|---|---|
Reference | ► Play | ► Play |
Text-to-speech speaker adaptation samples
5 utterances | 250 utterances | |||
---|---|---|---|---|
Supervised | Unsupervised | Supervised | Unsupervised | |
A1B | ► Play | ► Play | ► Play | ► Play |
A3a | ► Play | ► Play | ► Play | ► Play |
BaB | ► Play | ► Play | ► Play | ► Play |
Baa | ► Play | ► Play | ► Play | ► Play |
BaBall | ► Play | ► Play | ► Play | ► Play |
Baaall | ► Play | ► Play | ► Play | ► Play |
Adapt with 5 utterances (~30 seconds)
Supervised | Unsupervised | |
---|---|---|
A1B | Fine-tune a single speaker bias | |
► Play | ► Play | |
BaB | Fine-tune multiple speaker biases | |
► Play | ► Play | |
BaBall | Fine-tune acoustic decoder network | |
► Play | ► Play |
Most strategies synthesize speech using a fine-tune WaveNet vocoder. WO-i is the only one using WORLD vocoder for reference purpose