pre-print version of this paper can be found at https://arxiv.org/abs/2005.11004
You may interest in our follow-up work on cross-lingual text-to-speech and voice conversion.
In this scenario, we reenacted Voice Converservion Challenge 2018: SPOKE task to test our method. The same-gender and cross-gender voice conversions are treated as two separated entities.
Female target speakers | VCC2TF1 | VCC2TF2 | |
---|---|---|---|
► Play | ► Play | ||
System | Input | ► Play | ► Play |
XV | "and the whole ship creaking, groaning, and jumping like a manufactory" | ► Play | ► Play |
N10= | ► Play VCC2SF4_30015.wav | ► Play | ► Play |
N10× | ► Play VCC2SM3_30015.wav | ► Play | ► Play |
NAUTILUS/TTSu | "and the whole ship creaking, groaning, and jumping like a manufactory" | ► Play | ► Play |
NAUTILUS/VCAu= | ► Play VCC2SF4_30015.wav | ► Play | ► Play |
NAUTILUS/VCAu× | ► Play VCC2SM3_30015.wav | ► Play | ► Play |
Male target speakers | VCC2TM1 | VCC2TM2 | |
---|---|---|---|
► Play | ► Play | ||
System | Input | ► Play | ► Play |
XV | "The proper course to pursue is to offer your name and address" | ► Play | ► Play |
N10= | ► Play VCC2SM4_30004.wav | ► Play | ► Play |
N10× | ► Play VCC2SF3_30004.wav | ► Play | ► Play |
NAUTILUS/TTSu | "The proper course to pursue is to offer your name and address" | ► Play | ► Play |
NAUTILUS/VCAu= | ► Play VCC2SM4_30004.wav | ► Play | ► Play |
NAUTILUS/VCAu× | ► Play VCC2SF3_30004.wav | ► Play | ► Play |
* Speech samples of N13 and N17 can be found in the results of VCC2018.
In this scenario, we uses two American speakers as standard "easy" target speakers and two Mandarin-accent speakers as the unique "difficult" targets to evaluate the robustness of voice cloning methods.
American-accent target speakers (L1) | p294 | p345 | |
---|---|---|---|
► Play | ► Play | ||
System | Input | ► Play | ► Play |
XV | "People look, but no one ever finds it" | ► Play | ► Play |
FT | ► Play | ► Play | |
NAUTILUS/TTSu | ► Play | ► Play | |
NAUTILUS/TTSs | ► Play | ► Play | |
NAUTILUS/VCMu | ► Play p299_010.wav | ► Play | ► Play |
NAUTILUS/VCMs | ► Play | ► Play |
Mandarin-accent target speaker (L2) | MF6 | MM6 | |
---|---|---|---|
► Play | ► Play | ||
System | Input | ► Play | ► Play |
XV | "When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow." | ► Play | ► Play |
FT | ► Play | ► Play | |
NAUTILUS/TTSu | ► Play | ► Play | |
NAUTILUS/TTSs | ► Play | ► Play | |
NAUTILUS/VCMu | ► Play p311_006.wav | ► Play | ► Play |
NAUTILUS/VCMs | ► Play | ► Play |
In this section, we present speech samples generated from several slightly different setups of the unsupervised voice cloning procedure. Check our paper for the configuration of each setup. The experiment conditions are the same as scenario B.