NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Authors: Xin Wang, Shinji Takaki, Junichi Yamagishi
preprint paper, codes, scripts and pretrained models (including NSF and WaveNet)
Date of published: 29 Oct 2018 (last updated 25 Dec 2018)

Table of Contents
English samples can be generated using the scripts and pretrained models in the above github link.
The English models, including WaveNet, were trained using the same data configuration as what is used in our another work.
Compared with the NSF in the paper, the NSF trained on CMU-articic SLT are slightly different:


Japanese samples (16kHz, synthetic speech only)

Comparison of NSF, WaveNet, WORLD vocoder

_AOZORAR_03372_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis (given natural acoustic features):
Text-to-speech (text-analyzer + acoustic models + waveform models):
_NIKKEIR_00257_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis (given natural acoustic features):
Text-to-speech (text-analyzer + acoustic models + waveform models):
_AOZORAR_09534_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis (given natural acoustic features):
Text-to-speech (text-analyzer + acoustic models + waveform models):
_NIKKEIR_03132_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis (given natural acoustic features):
Text-to-speech (text-analyzer + acoustic models + waveform models):

Ablation test on NSF (16kHz, please check the notation in the paper)

_AOZORAR_03372_T01
L1L2L3L4L5
Copy-synthesis (given natural acoustic features):
S1S2S3N1N2
Copy-synthesis (given natural acoustic features):
_NIKKEIR_00257_T01
L1L2L3L4L5
Copy-synthesis (given natural acoustic features):
S1S2S3N1N2
Copy-synthesis (given natural acoustic features):

HTS Engine + NSF

phrase01
HTS EngineHTS Engine + NSF
Text-to-speech (HTS-Engine: OpenJTalk + HMM acoustic models)
phrase02
HTS EngineHTS Engine + NSF
Text-to-speech (HTS-Engine: OpenJTalk + HMM acoustic models)
phrase03
HTS EngineHTS Engine + NSF
Text-to-speech (HTS-Engine: OpenJTalk + HMM acoustic models)
phrase04
HTS EngineHTS Engine + NSF
Text-to-speech (HTS-Engine: OpenJTalk + HMM acoustic models)
phrase05
HTS EngineHTS Engine + NSF
Text-to-speech (HTS-Engine: OpenJTalk + HMM acoustic models)


English samples (CMU-slt)

Comparison with WORLD vocoder, and WaveNet (16kHz)

1: Author of the danger trail, Philip Steels, etc.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Given natural acoustic features:

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Given natural acoustic features:

3: In a flash Philip followed its direction.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Given natural acoustic features:

4: Much, replied Jeanne, as tersely.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Given natural acoustic features:

5: I suppose you picked that lingo up among the Indians.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Given natural acoustic features:

6: It was Jeanne singing softly over beyond the rocks.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Given natural acoustic features:




Ablation test (16kHz)

On CMU-arctic SLT, we trained a few variants of the proposed model. We also added one new variant (NSF-MSE) not reported in the paper. NSF-S3 sounds OK on CMU-arctic SLT, which may be due to the flat natural F0 contours of SLT. The log-F0 GV of SLT is 21 on the test set.
On the Japanese data the log-F0 GV is 61, and the perceived pitch of the generated waveforms from NSF-S3 is more trembling.

1: Author of the danger trail, Philip Steels, etc.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Given natural acoustic features:
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Given natural acoustic features:

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Given natural acoustic features:
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Given natural acoustic features:

3: In a flash Philip followed its direction.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Given natural acoustic features:
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Given natural acoustic features:

4: Much, replied Jeanne, as tersely.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Given natural acoustic features:
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Given natural acoustic features:

5: I suppose you picked that lingo up among the Indians.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Given natural acoustic features:
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Given natural acoustic features:

6: It was Jeanne singing softly over beyond the rocks.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Given natural acoustic features:
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Given natural acoustic features:




NSF 32kHz samples

1: Author of the danger trail, Philip Steels, etc.
Natural:
Neural source-filter model
Given natural acoustic features:

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
Neural source-filter model
Given natural acoustic features:

3: In a flash Philip followed its direction.
Natural:
Neural source-filter model
Given natural acoustic features:

4: Much, replied Jeanne, as tersely.
Natural:
Neural source-filter model
Given natural acoustic features:

5: I suppose you picked that lingo up among the Indians.
Natural:
Neural source-filter model
Given natural acoustic features:

6: It was Jeanne singing softly over beyond the rocks.
Natural:
Neural source-filter model
Given natural acoustic features:




Some related works on waveform modeling


Acknowledgement
WORLD: Dr. Morise from Yamanashi University, Japane, https://github.com/mmorise/World
These synthetic speech samples were constructed using the CMU Arctic database. The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University. See http://festvox.org/cmu_arctic/ for more details.
クリエイティブ・コモンズ・ライセンス