Hn-sinc-NSF

Messages

  • Paper:

    Wang, X. & Yamagishi, J. Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis. in Proc. SSW 1–6 (ISCA, 2019). doi:10.21437/SSW.2019-1

  • BibTex:

    @inproceedings{Wang2019,
    address = {ISCA},
    author = {Wang, Xin and Yamagishi, Junichi},
    booktitle = {Proc. SSW},
    doi = {10.21437/SSW.2019-1},
    pages = {1--6},
    publisher = {ISCA},
    title = {{Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis}},
    url = {http://www.isca-speech.org/archive/SSW{\_}2019/abstracts/SSW10{\_}O{\_}1-1.html},
    year = {2019}
    }
    
  • Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database)

  • Code is available. You need both the CURRENNT toolkit and scripts. This subfolder in the script repository is for this project

  • New implementaion based on Pytorch is also available;

  • Slides for SSW 2019 presentation can be found on this page. You may also directly download the PDF;

  • Note that

    Copy-synthesis refers to waveform generation given natural acoustic features

    Text-to-speech refers to waveform generation given acoustic features predicted from the text input


Audio samples

Natural waveform samples cannot be released online due to the license issue.

Utterance: _AOZORAR_09534_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFh-sinc1-NSFh-sinc2-NSFh-sinc3-NSF
Copy-synthesis:
Text-to-speech:



Utterance: _AOZORAR_03372_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFh-sinc1-NSFh-sinc2-NSFh-sinc3-NSF
Copy-synthesis:
Text-to-speech:

Utterance: _NIKKEIR_03132_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFh-sinc1-NSFh-sinc2-NSFh-sinc3-NSF
Copy-synthesis:
Text-to-speech:


Utterance: _NIKKEIR_00257_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFh-sinc1-NSFh-sinc2-NSFh-sinc3-NSF
Copy-synthesis:
Text-to-speech:



Utterance: _BTEC_00312_T01
Mel-spec. + F0 (15-hr data)WaveNethn-NSFh-sinc1-NSFh-sinc2-NSFh-sinc3-NSF
Copy-synthesis:
Text-to-speech: