Speech samples for "A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation"

A pre-print version of this paper can be found at https://arxiv.org/abs/1906.07414

Japanese speech samples

All samples are synthesized using Speaker-Independent WaveNet vocoder, which was not trained on the target speaker speech

1st sentence

Reference samples

	WaveNet	Recording
Reference	► Play	► Play

Text-to-speech speaker adaptation samples

2nd sentence

Reference samples

	WaveNet	Recording
Reference	► Play	► Play

Text-to-speech speaker adaptation samples

3rd sentence:

Reference samples

	WaveNet	Recording
Reference	► Play	► Play

Text-to-speech speaker adaptation samples

Adapt with 5 utterances (~30 seconds)

	Supervised	Unsupervised
A1B	Fine-tune a single speaker bias
A1B	► Play	► Play
BaB	Fine-tune multiple speaker biases
BaB	► Play	► Play
BaB^all	Fine-tune acoustic decoder network
BaB^all	► Play	► Play

Adapt with 250 utterances (~25 minutes)

	Supervised	Unsupervised
A1B	Fine-tune a single speaker bias
A1B	► Play	► Play
BaB	Fine-tune multiple speaker biases
BaB	► Play	► Play
BaB^all	Fine-tune acoustic decoder network
BaB^all	► Play	► Play

Adapt with 1000 utterances (~100 minutes)*

	Supervised	Unsupervised
A1B	Fine-tune a single speaker bias
A1B	► Play	► Play
BaB	Fine-tune multiple speaker biases
BaB	► Play	► Play
BaB^all	Fine-tune acoustic decoder network
BaB^all	► Play	► Play

*Not evaluated in the subjective test

Most strategies synthesize speech using a fine-tune WaveNet vocoder. WO-i is the only one using WORLD vocoder for reference purpose

p254 Male | English

p236 Female | English

p264 Female | Scottish

p345 Male | American