A pre-print version of this paper can be found at https://arxiv.org/abs/1906.07414
All samples are synthesized using Speaker-Independent WaveNet vocoder, which was not trained on the target speaker speech
1st sentence
Reference samples
| WaveNet | Recording | |
|---|---|---|
| Reference | ► Play | ► Play |
Text-to-speech speaker adaptation samples
| 5 utterances | 250 utterances | |||
|---|---|---|---|---|
| Supervised | Unsupervised | Supervised | Unsupervised | |
| A1B | ► Play | ► Play | ► Play | ► Play |
| A3a | ► Play | ► Play | ► Play | ► Play |
| BaB | ► Play | ► Play | ► Play | ► Play |
| Baa | ► Play | ► Play | ► Play | ► Play |
| BaBall | ► Play | ► Play | ► Play | ► Play |
| Baaall | ► Play | ► Play | ► Play | ► Play |
2nd sentence
Reference samples
| WaveNet | Recording | |
|---|---|---|
| Reference | ► Play | ► Play |
Text-to-speech speaker adaptation samples
| 5 utterances | 250 utterances | |||
|---|---|---|---|---|
| Supervised | Unsupervised | Supervised | Unsupervised | |
| A1B | ► Play | ► Play | ► Play | ► Play |
| A3a | ► Play | ► Play | ► Play | ► Play |
| BaB | ► Play | ► Play | ► Play | ► Play |
| Baa | ► Play | ► Play | ► Play | ► Play |
| BaBall | ► Play | ► Play | ► Play | ► Play |
| Baaall | ► Play | ► Play | ► Play | ► Play |
Adapt with 5 utterances (~30 seconds)
| Supervised | Unsupervised | |
|---|---|---|
| A1B | Fine-tune a single speaker bias | |
| ► Play | ► Play | |
| BaB | Fine-tune multiple speaker biases | |
| ► Play | ► Play | |
| BaBall | Fine-tune acoustic decoder network | |
| ► Play | ► Play | |
Most strategies synthesize speech using a fine-tune WaveNet vocoder. WO-i is the only one using WORLD vocoder for reference purpose