First NSF¶
Messages¶
Paper:
Wang, X., Takaki, S. & Yamagishi, J. Neural source-filter-based waveform model for statistical parametric speech synthesis. in Proc. ICASSP 5916–5920 (2019). DOI:10.1109/ICASSP.2019.8682298
BibTex:
@inproceedings{wang2018neural, author = {Wang, Xin and Takaki, Shinji and Yamagishi, Junichi}, booktitle = {Proc. ICASSP}, doi = {10.1109/ICASSP.2019.8682298}, pages = {5916--5920}, publisher = {IEEE}, title = {{Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis}}, url = {https://ieeexplore.ieee.org/document/8682298/}, year = {2019} }Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database)
Another experiment was based on CMU_ARCTIC database, and the samples are also uploaded
Code is available. You need both the CURRENNT toolkit and scripts. This subfolder in the script repository is made for this project
Slides for ICASSP 2019 presentation can be found on this page. You can also directly download the PDF and pptx
Note that
Copy-synthesis refers to waveform generation given natural acoustic features
Text-to-speech refers to waveform generation given acoustic features predicted from the text input
Audio samples (Japanese)¶
Natural waveform samples cannot be released online due to the license issue.
Main test¶
_AOZORAR_03372_T01WORLD vocoder | WAD (WaveNet, mu-law, 10-bit) | WAC (single Gaussian WaveNet) | NSF (Neural source-filter model) | |
---|---|---|---|---|
Copy-synthesis | ||||
Text-to-speech |
WORLD vocoder | WAD (WaveNet, mu-law, 10-bit) | WAC (single Gaussian WaveNet) | NSF (Neural source-filter model) | |
---|---|---|---|---|
Copy-synthesis | ||||
Text-to-speech |
WORLD vocoder | WAD (WaveNet, mu-law, 10-bit) | WAC (single Gaussian WaveNet) | NSF (Neural source-filter model) | |
---|---|---|---|---|
Copy-synthesis | ||||
Text-to-speech |
WORLD vocoder | WAD (WaveNet, mu-law, 10-bit) | WAC (single Gaussian WaveNet) | NSF (Neural source-filter model) | |
---|---|---|---|---|
Copy-synthesis | ||||
Text-to-speech |
Ablation test¶
Please check the original paper for the meaning of system ID
_NIKKEIR_00257_T01L1 | L2 | L3 | L4 | L5 | |
---|---|---|---|---|---|
Copy-synthesis | |||||
S1 | S2 | S3 | N1 | N2 | |
Copy-synthesis |
HTS Engine+NSF¶
HTS-Engine: OpenJTalk + HMM acoustic models + vocoder.
HTS-Engine+NSF: OpenJTalk + HMM acoustic models + NSF.
phrase01HTS Engine | HTS Engine + NSF | |
---|---|---|
Text-to-speech |
HTS Engine | HTS Engine + NSF | |
---|---|---|
Text-to-speech |
HTS Engine | HTS Engine + NSF | |
---|---|---|
Text-to-speech |
HTS Engine | HTS Engine + NSF | |
---|---|---|
Text-to-speech |
HTS Engine | HTS Engine + NSF | |
---|---|---|
Text-to-speech |
Audio samples (English)¶
The models were trained using CMU_ARCTIC SLT voice.
Main test¶
1: Author of the danger trail, Philip Steels, etc.Natural: | |||
WORLD vocoder | WaveNet (mu-law, 10-bit) | Neural source-filter model | |
---|---|---|---|
Copy-synthesis |
2: To my surprise he began to show actual enthusiasm in my favor.
Natural: | |||
WORLD vocoder | WaveNet (mu-law, 10-bit) | Neural source-filter model | |
---|---|---|---|
Copy-synthesis |
3: In a flash Philip followed its direction.
Natural: | |||
WORLD vocoder | WaveNet (mu-law, 10-bit) | Neural source-filter model | |
---|---|---|---|
Copy-synthesis |
4: Much, replied Jeanne, as tersely.
Natural: | |||
WORLD vocoder | WaveNet (mu-law, 10-bit) | Neural source-filter model | |
---|---|---|---|
Copy-synthesis |
5: I suppose you picked that lingo up among the Indians.
Natural: | |||
WORLD vocoder | WaveNet (mu-law, 10-bit) | Neural source-filter model | |
---|---|---|---|
Copy-synthesis |
6: It was Jeanne singing softly over beyond the rocks.
Natural: | |||
WORLD vocoder | WaveNet (mu-law, 10-bit) | Neural source-filter model | |
---|---|---|---|
Copy-synthesis |
Ablation test¶
NSF-L3: using a single spectral amplitude distance (only STFT loss Ls1 in the paper)
NSF-S3: using noise excitation only
NSF-N2: setting b=0 in each transformation layer of the filter module
NSF-MSE: trained by minimizing the mean-square-error of waveform values
NSF-FFCond: using a simple feedforward layer as the condition module
NSF-noF0inCond: F0 is not fed to the filter module
NSF-noNoiseOnSine: sine waveform (in voiced regions) doesn’t contain additive noise
Natural: | ||||
NSF | NSF-L3 | NSF-S3 | NSF-N2 | |
---|---|---|---|---|
Copy-synthesis | ||||
NSF-MSE | NSF-FFCond | NSF-noF0inCond | NSF-noNoiseOnSine | |
Copy-synthesis |
2: To my surprise he began to show actual enthusiasm in my favor.
Natural: | ||||
NSF | NSF-L3 | NSF-S3 | NSF-N2 | |
---|---|---|---|---|
Copy-synthesis | ||||
NSF-MSE | NSF-FFCond | NSF-noF0inCond | NSF-noNoiseOnSine | |
Copy-synthesis |
3: In a flash Philip followed its direction.
Natural: | ||||
NSF | NSF-L3 | NSF-S3 | NSF-N2 | |
---|---|---|---|---|
Copy-synthesis | ||||
NSF-MSE | NSF-FFCond | NSF-noF0inCond | NSF-noNoiseOnSine | |
Copy-synthesis |
4: Much, replied Jeanne, as tersely.
Natural: | ||||
NSF | NSF-L3 | NSF-S3 | NSF-N2 | |
---|---|---|---|---|
Copy-synthesis | ||||
NSF-MSE | NSF-FFCond | NSF-noF0inCond | NSF-noNoiseOnSine | |
Copy-synthesis |
5: I suppose you picked that lingo up among the Indians.
Natural: | ||||
NSF | NSF-L3 | NSF-S3 | NSF-N2 | |
---|---|---|---|---|
Copy-synthesis | ||||
NSF-MSE | NSF-FFCond | NSF-noF0inCond | NSF-noNoiseOnSine | |
Copy-synthesis |
6: It was Jeanne singing softly over beyond the rocks.
Natural: | ||||
NSF | NSF-L3 | NSF-S3 | NSF-N2 | |
---|---|---|---|---|
Copy-synthesis | ||||
NSF-MSE | NSF-FFCond | NSF-noF0inCond | NSF-noNoiseOnSine | |
Copy-synthesis |