Speech samples for "Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech"

Authors: Hieu-Thi Luong, Junichi Yamagishi

Speech samples for the paper "Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech" which is accepted to IEEE ASRU 2019.

A pre-print version of the paper can be found at https://arxiv.org/abs/1909.06532

All samples are synthesized using a WaveNet vocoder fine-tuned to the target speaker.

VCC2018: SPOKE Task

We reenacted Voice Converservion Challenge 2018: SPOKE task to test our method. The base model is trained with VCTK open speech corpus.

1st sentence		VCC2TM1	VCC2TM2	VCC2TF1	VCC2TF2
		► Play	► Play	► Play	► Play
VCC2SM3	► Play	► Play	► Play	► Play	► Play
VCC2SM4	► Play	► Play	► Play	► Play	► Play
VCC2SF3	► Play	► Play	► Play	► Play	► Play
VCC2SF4	► Play	► Play	► Play	► Play	► Play

2nd sentence		VCC2TM1	VCC2TM2	VCC2TF1	VCC2TF2
		► Play	► Play	► Play	► Play
VCC2SM3	► Play	► Play	► Play	► Play	► Play
VCC2SM4	► Play	► Play	► Play	► Play	► Play
VCC2SF3	► Play	► Play	► Play	► Play	► Play
VCC2SF4	► Play	► Play	► Play	► Play	► Play

Cross-language speaker adaptation

A single bi-lingual speaker (English and Japanese) is used a target speaker for this task

The purpose of this task is to adapt the voice conversion model to a speaker who does not speak the target language:

• EE-E: the TTS model is trained using speech and transcript pf multiple English speakers, VC model is adapted using English speech of the target, then the model is used to convert utterances of unseen English source speakers

• EJ-E: the TTS model is trained using speech and transcript of multiple English speakers, VC model is adapted using Japanese speech of the target, then the model is used to convert utterances of unseen English source speakers

1st sentence		EE-E	EJ-E
		► Play	► Play
VCC2SM3	► Play	► Play	► Play
VCC2SF3	► Play	► Play	► Play

2nd sentence		EE-E	EJ-E
		► Play	► Play
VCC2SM3	► Play	► Play	► Play
VCC2SF3	► Play	► Play	► Play

Voice conversion for low-resource language

A single bi-lingual speaker (English and Japanese) is used a target speaker for this task

The purpose of this task is to create a voice conversion model for low-resource language, which text is not available, by using a base model of an abundant language (speech and transcript from multiple speakers are required).

Japanese play the role of low-resource while English is abundant language in our experiment. Interestingly, the converted Japanese speech has an English "accent" especially in case of EE-J model.

• JJ-J: the TTS model is trained using speech and transcript of multiple Japanese speakers, VC model is adapted using Japanese speech of the target, then the model is used to convert utterances of unseen Japanese source speakers.

• EJ-J: the TTS model is trained using speech and transcript of multiple English speakers, VC model is adapted using Japanese speech of the target, then the model is used to convert utterances of unseen Japanese source speakers.

• EE-J: the TTS model is trained using speech and transcript of multiple English speakers, VC model is adapted using English speech of the target, then the model is used to convert utterances of unseen Japanese source speakers.

1st sentence		JJ-J	EJ-J	EE-J
		► Play	► Play	► Play
JVS001	► Play	► Play	► Play	► Play
JVS004	► Play	► Play	► Play	► Play

2nd sentence		JJ-J	EJ-J	EE-J
		► Play	► Play	► Play
JVS001	► Play	► Play	► Play	► Play
JVS004	► Play	► Play	► Play	► Play

*Due to license restriction we cannot provide speech sample of the Japanese source speaker used in the paper. We use speech of two speakers of the JVS corpus as source speaker exlusively for this demostration. The adapted model and the target speaker is same as the one used in the paper.