Speech samples for "NAUTILUS: a Versatile Voice Cloning System"

Authors: Hieu-Thi Luong, Junichi Yamagishi

pre-print version of this paper can be found at https://arxiv.org/abs/2005.11004

You may interest in our follow-up work on cross-lingual text-to-speech and voice conversion.

Scenario A

In this scenario, we reenacted Voice Converservion Challenge 2018: SPOKE task to test our method. The same-gender and cross-gender voice conversions are treated as two separated entities.

Female target speakers		VCC2TF1	VCC2TF2
		► Play	► Play
System	Input	► Play	► Play
XV	"and the whole ship creaking, groaning, and jumping like a manufactory"	► Play	► Play
N10⁼	► Play VCC2SF4_30015.wav	► Play	► Play
N10^×	► Play VCC2SM3_30015.wav	► Play	► Play
NAUTILUS/TTS_u	"and the whole ship creaking, groaning, and jumping like a manufactory"	► Play	► Play
NAUTILUS/VCA_u⁼	► Play VCC2SF4_30015.wav	► Play	► Play
NAUTILUS/VCA_u^×	► Play VCC2SM3_30015.wav	► Play	► Play

Male target speakers		VCC2TM1	VCC2TM2
		► Play	► Play
System	Input	► Play	► Play
XV	"The proper course to pursue is to offer your name and address"	► Play	► Play
N10⁼	► Play VCC2SM4_30004.wav	► Play	► Play
N10^×	► Play VCC2SF3_30004.wav	► Play	► Play
NAUTILUS/TTS_u	"The proper course to pursue is to offer your name and address"	► Play	► Play
NAUTILUS/VCA_u⁼	► Play VCC2SM4_30004.wav	► Play	► Play
NAUTILUS/VCA_u^×	► Play VCC2SF3_30004.wav	► Play	► Play

* Speech samples of N13 and N17 can be found in the results of VCC2018.

Scenario B

In this scenario, we uses two American speakers as standard "easy" target speakers and two Mandarin-accent speakers as the unique "difficult" targets to evaluate the robustness of voice cloning methods.

American-accent target speakers (L1)		p294	p345
		► Play	► Play
System	Input	► Play	► Play
XV	"People look, but no one ever finds it"	► Play	► Play
FT		► Play	► Play
NAUTILUS/TTS_u		► Play	► Play
NAUTILUS/TTS_s		► Play	► Play
NAUTILUS/VCM_u	► Play p299_010.wav	► Play	► Play
NAUTILUS/VCM_s	► Play p299_010.wav	► Play	► Play

Mandarin-accent target speaker (L2)		MF6	MM6
		► Play	► Play
System	Input	► Play	► Play
XV	"When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow."	► Play	► Play
FT		► Play	► Play
NAUTILUS/TTS_u		► Play	► Play
NAUTILUS/TTS_s		► Play	► Play
NAUTILUS/VCM_u	► Play p311_006.wav	► Play	► Play
NAUTILUS/VCM_s	► Play p311_006.wav	► Play	► Play

Analysis

In this section, we present speech samples generated from several slightly different setups of the unsupervised voice cloning procedure. Check our paper for the configuration of each setup. The experiment conditions are the same as scenario B.

p345	TTS_u	VCM_u
► Play	"To the Hebrews it was a token that there would be no more universal floods."	p311_014.wav ► Play
► Play		p311_014.wav ► Play
N	► Play	► Play
A	► Play	► Play
B	► Play	► Play
C	► Play	► Play
D	► Play	► Play

p294	TTS_u	VCM_u
► Play	"Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob."	p299_003.wav ► Play
► Play		p299_003.wav ► Play
N	► Play	► Play
A	► Play	► Play
B	► Play	► Play
C	► Play	► Play
D	► Play	► Play

asdasdas