Hn-NSF and s-NSF

Messages

  • Paper:

    Wang, X., Takaki, S. & Yamagishi, J. Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 402–415 (2020), DOI:10.1109/TASLP.2019.2956145

  • BibTex:

    @article{wangNSFall,
    author = {Wang, Xin and Takaki, Shinji and Yamagishi, Junichi},
    doi = {10.1109/TASLP.2019.2956145},
    journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    pages = {402--415},
    title = {{Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis}},
    url = {https://ieeexplore.ieee.org/document/8915761/},
    volume = {28},
    year = {2020}
    }
    
  • Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database)

  • Code is available. You need both the CURRENNT toolkit and scripts. This subfolder in the script repository is made for this project

  • New implementaion based on Pytorch is also available;

  • Note that

    Copy-synthesis refers to waveform generation given natural acoustic features

    Text-to-speech refers to waveform generation given acoustic features predicted from the text input


Samples (main test)

Natural waveform samples cannot be released online due to the license issue.

Utterance: _NIKKEIR_03132_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Utterance: _NIKKEIR_00257_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:



Utterance: _BTEC_00312_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:



Utterance: _AOZORAR_09534_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:



Utterance: _AOZORAR_03372_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)WaveNethn-NSFb-NSFs-NSF
Copy-synthesis:
Text-to-speech:

Samples (ablation test)

Ablation test on b-NSF (16kHz, please check the notation in the paper)

Utterance: _AOZORAR_03372_T01
b-NSF (trained on 5-hr. MGC+F0)
Copy-synthesis
L1L2L3
Copy-synthesis
S1S2
Copy-synthesis
N1N2
Copy-synthesis



Utterance: _NIKKEIR_03132_T01
b-NSF (trained on 5-hr. MGC+F0)
Copy-synthesis
L1L2L3
Copy-synthesis
S1S2
Copy-synthesis
N1N2
Copy-synthesis



Utterance: _NIKKEIR_00257_T01
b-NSF (trained on 5-hr. MGC+F0)
Copy-synthesis
L1L2L3
Copy-synthesis
S1S2
Copy-synthesis
N1N2
Copy-synthesis



Utterance: _AOZORAR_09534_T01
b-NSF (trained on 5-hr. MGC+F0)
Copy-synthesis
L1L2L3
Copy-synthesis
S1S2
Copy-synthesis
N1N2
Copy-synthesis