Speech samples for "NAUTILUS: a Versatile Voice Cloning System"

Authors: Hieu-Thi Luong, Junichi Yamagishi

pre-print version of this paper can be found at https://arxiv.org/abs/2005.11004


You may interest in our follow-up work on cross-lingual text-to-speech and voice conversion.

Scenario A

In this scenario, we reenacted Voice Converservion Challenge 2018: SPOKE task to test our method. The same-gender and cross-gender voice conversions are treated as two separated entities.

Female target speakers VCC2TF1 VCC2TF2
► Play ► Play
System Input ► Play ► Play
XV "and the whole ship creaking, groaning, and jumping like a manufactory" ► Play ► Play
N10= ► Play VCC2SF4_30015.wav ► Play ► Play
N10× ► Play VCC2SM3_30015.wav ► Play ► Play
NAUTILUS/TTSu "and the whole ship creaking, groaning, and jumping like a manufactory" ► Play ► Play
NAUTILUS/VCAu= ► Play VCC2SF4_30015.wav ► Play ► Play
NAUTILUS/VCAu× ► Play VCC2SM3_30015.wav ► Play ► Play


Male target speakers VCC2TM1 VCC2TM2
► Play ► Play
System Input ► Play ► Play
XV "The proper course to pursue is to offer your name and address" ► Play ► Play
N10= ► Play VCC2SM4_30004.wav ► Play ► Play
N10× ► Play VCC2SF3_30004.wav ► Play ► Play
NAUTILUS/TTSu "The proper course to pursue is to offer your name and address" ► Play ► Play
NAUTILUS/VCAu= ► Play VCC2SM4_30004.wav ► Play ► Play
NAUTILUS/VCAu× ► Play VCC2SF3_30004.wav ► Play ► Play

* Speech samples of N13 and N17 can be found in the results of VCC2018.

Scenario B

In this scenario, we uses two American speakers as standard "easy" target speakers and two Mandarin-accent speakers as the unique "difficult" targets to evaluate the robustness of voice cloning methods.

American-accent target speakers (L1) p294 p345
► Play ► Play
System Input ► Play ► Play
XV "People look, but no one ever finds it" ► Play ► Play
FT ► Play ► Play
NAUTILUS/TTSu ► Play ► Play
NAUTILUS/TTSs ► Play ► Play
NAUTILUS/VCMu ► Play p299_010.wav ► Play ► Play
NAUTILUS/VCMs ► Play ► Play

Mandarin-accent target speaker (L2) MF6 MM6
► Play ► Play
System Input ► Play ► Play
XV "When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow." ► Play ► Play
FT ► Play ► Play
NAUTILUS/TTSu ► Play ► Play
NAUTILUS/TTSs ► Play ► Play
NAUTILUS/VCMu ► Play p311_006.wav ► Play ► Play
NAUTILUS/VCMs ► Play ► Play

Analysis

In this section, we present speech samples generated from several slightly different setups of the unsupervised voice cloning procedure. Check our paper for the configuration of each setup. The experiment conditions are the same as scenario B.

p345 TTSu VCMu
► Play "To the Hebrews it was a token that there would be no more universal floods." p311_014.wav
► Play
► Play
N ► Play ► Play
A ► Play ► Play
B ► Play ► Play
C ► Play ► Play
D ► Play ► Play

p294 TTSu VCMu
► Play "Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob." p299_003.wav
► Play
► Play
N ► Play ► Play
A ► Play ► Play
B ► Play ► Play
C ► Play ► Play
D ► Play ► Play

asdasdas