First NSF

Messages

  • Paper:

    Wang, X., Takaki, S. & Yamagishi, J. Neural source-filter-based waveform model for statistical parametric speech synthesis. in Proc. ICASSP 5916–5920 (2019). DOI:10.1109/ICASSP.2019.8682298

  • BibTex:

    @inproceedings{wang2018neural,
    author = {Wang, Xin and Takaki, Shinji and Yamagishi, Junichi},
    booktitle = {Proc. ICASSP},
    doi = {10.1109/ICASSP.2019.8682298},
    pages = {5916--5920},
    publisher = {IEEE},
    title = {{Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis}},
    url = {https://ieeexplore.ieee.org/document/8682298/},
    year = {2019}
    }
    
  • Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database)

  • Another experiment was based on CMU_ARCTIC database, and the samples are also uploaded

  • Code is available. You need both the CURRENNT toolkit and scripts. This subfolder in the script repository is made for this project

  • Slides for ICASSP 2019 presentation can be found on this page. You can also directly download the PDF and pptx

  • Note that

    Copy-synthesis refers to waveform generation given natural acoustic features

    Text-to-speech refers to waveform generation given acoustic features predicted from the text input


Audio samples (Japanese)

Natural waveform samples cannot be released online due to the license issue.

Main test

_AOZORAR_03372_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
_NIKKEIR_00257_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
_AOZORAR_09534_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech
_NIKKEIR_03132_T01
WORLD vocoderWAD (WaveNet, mu-law, 10-bit)WAC (single Gaussian WaveNet) NSF (Neural source-filter model)
Copy-synthesis
Text-to-speech

Ablation test

Please check the original paper for the meaning of system ID

_NIKKEIR_00257_T01
L1L2L3L4L5
Copy-synthesis
S1S2S3N1N2
Copy-synthesis

HTS Engine+NSF

HTS-Engine: OpenJTalk + HMM acoustic models + vocoder.

HTS-Engine+NSF: OpenJTalk + HMM acoustic models + NSF.

phrase01
HTS EngineHTS Engine + NSF
Text-to-speech
phrase02
HTS EngineHTS Engine + NSF
Text-to-speech
phrase03
HTS EngineHTS Engine + NSF
Text-to-speech
phrase04
HTS EngineHTS Engine + NSF
Text-to-speech
phrase05
HTS EngineHTS Engine + NSF
Text-to-speech

Audio samples (English)

The models were trained using CMU_ARCTIC SLT voice.

Main test

1: Author of the danger trail, Philip Steels, etc.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

3: In a flash Philip followed its direction.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

4: Much, replied Jeanne, as tersely.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

5: I suppose you picked that lingo up among the Indians.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

6: It was Jeanne singing softly over beyond the rocks.
Natural:
WORLD vocoderWaveNet (mu-law, 10-bit)Neural source-filter model
Copy-synthesis

Ablation test

  • NSF-L3: using a single spectral amplitude distance (only STFT loss Ls1 in the paper)

  • NSF-S3: using noise excitation only

  • NSF-N2: setting b=0 in each transformation layer of the filter module

  • NSF-MSE: trained by minimizing the mean-square-error of waveform values

  • NSF-FFCond: using a simple feedforward layer as the condition module

  • NSF-noF0inCond: F0 is not fed to the filter module

  • NSF-noNoiseOnSine: sine waveform (in voiced regions) doesn’t contain additive noise

1: Author of the danger trail, Philip Steels, etc.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

2: To my surprise he began to show actual enthusiasm in my favor.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

3: In a flash Philip followed its direction.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

4: Much, replied Jeanne, as tersely.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

5: I suppose you picked that lingo up among the Indians.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis

6: It was Jeanne singing softly over beyond the rocks.
Natural:
NSFNSF-L3NSF-S3NSF-N2
Copy-synthesis
NSF-MSENSF-FFCondNSF-noF0inCondNSF-noNoiseOnSine
Copy-synthesis