Speech samples for "A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation"

Authors: Hieu-Thi Luong, Junichi Yamagishi

All samples are synthesized using Speaker-Independent WaveNet vocoder, which was not trained on the target speaker speech

All strategies

1st sentence

Reference samples

WaveNet Recording
Reference ► Play ► Play

Text-to-speech speaker adaptation samples

5 utterances 250 utterances
Supervised Unsupervised Supervised Unsupervised
A1B ► Play ► Play ► Play ► Play
A3a ► Play ► Play ► Play ► Play
BaB ► Play ► Play ► Play ► Play
Baa ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play
Baaall ► Play ► Play ► Play ► Play

2nd sentence

Reference samples

WaveNet Recording
Reference ► Play ► Play

Text-to-speech speaker adaptation samples

5 utterances 250 utterances
Supervised Unsupervised Supervised Unsupervised
A1B ► Play ► Play ► Play ► Play
A3a ► Play ► Play ► Play ► Play
BaB ► Play ► Play ► Play ► Play
Baa ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play
Baaall ► Play ► Play ► Play ► Play

Selected strategies

3rd sentence:

Reference samples

WaveNet Recording
Reference ► Play ► Play

Text-to-speech speaker adaptation samples

Adapt with 5 utterances (~30 seconds)

Supervised Unsupervised
A1B Fine-tune a single speaker bias
► Play ► Play
BaB Fine-tune multiple speaker biases
► Play ► Play
BaBall Fine-tune acoustic decoder network
► Play ► Play

Adapt with 250 utterances (~25 minutes)

Supervised Unsupervised
A1B Fine-tune a single speaker bias
► Play ► Play
BaB Fine-tune multiple speaker biases
► Play ► Play
BaBall Fine-tune acoustic decoder network
► Play ► Play

Adapt with 1000 utterances (~100 minutes)*

Supervised Unsupervised
A1B Fine-tune a single speaker bias
► Play ► Play
BaB Fine-tune multiple speaker biases
► Play ► Play
BaBall Fine-tune acoustic decoder network
► Play ► Play

*Not evaluated in the subjective test

asdasdas