Speech samples for "A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation"

Authors: Hieu-Thi Luong, Junichi Yamagishi

A pre-print version of this paper can be found at https://arxiv.org/abs/1906.07414

Japanese speech samples

All samples are synthesized using Speaker-Independent WaveNet vocoder, which was not trained on the target speaker speech

All strategies

1st sentence

Reference samples

WaveNet Recording
Reference ► Play ► Play

Text-to-speech speaker adaptation samples

5 utterances 250 utterances
Supervised Unsupervised Supervised Unsupervised
A1B ► Play ► Play ► Play ► Play
A3a ► Play ► Play ► Play ► Play
BaB ► Play ► Play ► Play ► Play
Baa ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play
Baaall ► Play ► Play ► Play ► Play

2nd sentence

Reference samples

WaveNet Recording
Reference ► Play ► Play

Text-to-speech speaker adaptation samples

5 utterances 250 utterances
Supervised Unsupervised Supervised Unsupervised
A1B ► Play ► Play ► Play ► Play
A3a ► Play ► Play ► Play ► Play
BaB ► Play ► Play ► Play ► Play
Baa ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play
Baaall ► Play ► Play ► Play ► Play

Selected strategies

3rd sentence:

Reference samples

WaveNet Recording
Reference ► Play ► Play

Text-to-speech speaker adaptation samples

Adapt with 5 utterances (~30 seconds)

Supervised Unsupervised
A1B Fine-tune a single speaker bias
► Play ► Play
BaB Fine-tune multiple speaker biases
► Play ► Play
BaBall Fine-tune acoustic decoder network
► Play ► Play

Adapt with 250 utterances (~25 minutes)

Supervised Unsupervised
A1B Fine-tune a single speaker bias
► Play ► Play
BaB Fine-tune multiple speaker biases
► Play ► Play
BaBall Fine-tune acoustic decoder network
► Play ► Play

Adapt with 1000 utterances (~100 minutes)*

Supervised Unsupervised
A1B Fine-tune a single speaker bias
► Play ► Play
BaB Fine-tune multiple speaker biases
► Play ► Play
BaBall Fine-tune acoustic decoder network
► Play ► Play

*Not evaluated in the subjective test

English speech samples (VCTK Corpus)

Most strategies synthesize speech using a fine-tune WaveNet vocoder. WO-i is the only one using WORLD vocoder for reference purpose

p254 Male | English

10 utterances 320 utterances
Supervised Unsupervised Supervised Unsupervised
NAT ► Play
MU-A1B ► Play ► Play
WO-i ► Play ► Play
BaB ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play

p236 Female | English

10 utterances 320 utterances
Supervised Unsupervised Supervised Unsupervised
NAT ► Play
MU-A1B ► Play ► Play
WO-i ► Play ► Play
BaB ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play

p264 Female | Scottish

10 utterances 320 utterances
Supervised Unsupervised Supervised Unsupervised
NAT ► Play
MU-A1B ► Play ► Play
WO-i ► Play ► Play
BaB ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play

p345 Male | American

10 utterances 320 utterances
Supervised Unsupervised Supervised Unsupervised
NAT ► Play
MU-A1B ► Play ► Play
WO-i ► Play ► Play
BaB ► Play ► Play ► Play ► Play
BaBall ► Play ► Play ► Play ► Play

asdasdas