.. samples-xin documentation master file, created by sphinx-quickstart on Sun Apr 25 22:58:24 2021. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. _label-nsf-v1: First NSF ********* Messages -------- * Paper: Wang, X., Takaki, S. & Yamagishi, J. Neural source-filter-based waveform model for statistical parametric speech synthesis. in Proc. ICASSP 5916–5920 (2019). `DOI:10.1109/ICASSP.2019.8682298 `__ * BibTex:: @inproceedings{wang2018neural, author = {Wang, Xin and Takaki, Shinji and Yamagishi, Junichi}, booktitle = {Proc. ICASSP}, doi = {10.1109/ICASSP.2019.8682298}, pages = {5916--5920}, publisher = {IEEE}, title = {{Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis}}, url = {https://ieeexplore.ieee.org/document/8682298/}, year = {2019} } * Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database) * Another experiment was based on `CMU_ARCTIC database `_, and the samples are also uploaded * Code is available. You need both the `CURRENNT toolkit `_ and `scripts `_. `This subfolder `_ in the script repository is made for this project * Slides for ICASSP 2019 presentation can be found on `this page `_. You can also directly download the `PDF <./docs/ICASSP2019_SLP-L8.6_Wang_Xin_v2_public.pdf>`_ and `pptx <./docs/ICASSP2019_SLP-L8.6_Wang_Xin_v2_public.pptx.tar.gz>`_ * Note that Copy-synthesis refers to waveform generation given natural acoustic features Text-to-speech refers to waveform generation given acoustic features predicted from the text input | Audio samples (Japanese) ------------------------ Natural waveform samples cannot be released online due to the license issue. Main test ========= .. raw:: html _AOZORAR_03372_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
_NIKKEIR_00257_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
_AOZORAR_09534_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
_NIKKEIR_03132_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
Ablation test ============= Please check the original paper for the meaning of system ID .. raw:: html

Ablation test on NSF (16kHz, please check the notation in the paper)

_AOZORAR_03372_T01
L1L2L3L4L5
Copy-synthesis
S1S2S3N1N2
Copy-synthesis
_NIKKEIR_00257_T01
L1L2L3L4L5
Copy-synthesis
S1S2S3N1N2
Copy-synthesis
HTS Engine+NSF ============== HTS-Engine: OpenJTalk + HMM acoustic models + vocoder. HTS-Engine+NSF: OpenJTalk + HMM acoustic models + NSF. .. raw:: html phrase01
HTS EngineHTS Engine + NSF
Text-to-speech
phrase02
HTS EngineHTS Engine + NSF
Text-to-speech
phrase03
HTS EngineHTS Engine + NSF
Text-to-speech
phrase04
HTS EngineHTS Engine + NSF
Text-to-speech
phrase05
HTS EngineHTS Engine + NSF
Text-to-speech
| Audio samples (English) ----------------------- The models were trained using CMU_ARCTIC SLT voice. Main test ========= .. raw:: html 1: Author of the danger trail, Philip Steels, etc.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

3: In a flash Philip followed its direction.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

4: Much, replied Jeanne, as tersely.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

5: I suppose you picked that lingo up among the Indians.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

6: It was Jeanne singing softly over beyond the rocks.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis
Ablation test ============= * NSF-L3: using a single spectral amplitude distance (only STFT loss Ls1 in the paper) * NSF-S3: using noise excitation only * NSF-N2: setting b=0 in each transformation layer of the filter module * NSF-MSE: trained by minimizing the mean-square-error of waveform values * NSF-FFCond: using a simple feedforward layer as the condition module * NSF-noF0inCond: F0 is not fed to the filter module * NSF-noNoiseOnSine: sine waveform (in voiced regions) doesn't contain additive noise .. raw:: html 1: Author of the danger trail, Philip Steels, etc.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

3: In a flash Philip followed its direction.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

4: Much, replied Jeanne, as tersely.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

5: I suppose you picked that lingo up among the Indians.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

6: It was Jeanne singing softly over beyond the rocks.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis
.. toctree:: :hidden: :maxdepth: 1