First NSF¶

Messages¶

Paper:

Wang, X., Takaki, S. & Yamagishi, J. Neural source-filter-based waveform model for statistical parametric speech synthesis. in Proc. ICASSP 5916–5920 (2019). DOI:10.1109/ICASSP.2019.8682298
BibTex:
@inproceedings{wang2018neural,
author = {Wang, Xin and Takaki, Shinji and Yamagishi, Junichi},
booktitle = {Proc. ICASSP},
doi = {10.1109/ICASSP.2019.8682298},
pages = {5916--5920},
publisher = {IEEE},
title = {{Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis}},
url = {https://ieeexplore.ieee.org/document/8682298/},
year = {2019}
}
Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database)

Another experiment was based on CMU_ARCTIC database, and the samples are also uploaded

Code is available. You need both the CURRENNT toolkit and scripts. This subfolder in the script repository is made for this project

Slides for ICASSP 2019 presentation can be found on this page. You can also directly download the PDF and pptx

Note that

Copy-synthesis refers to waveform generation given natural acoustic features

Text-to-speech refers to waveform generation given acoustic features predicted from the text input

Audio samples (Japanese)¶

Natural waveform samples cannot be released online due to the license issue.

Main test¶

_AOZORAR_03372_T01

	WORLD vocoder	WAD (WaveNet, mu-law, 10-bit)	WAC (single Gaussian WaveNet)	NSF (Neural source-filter model)

Copy-synthesis
Text-to-speech

_NIKKEIR_00257_T01

	WORLD vocoder	WAD (WaveNet, mu-law, 10-bit)	WAC (single Gaussian WaveNet)	NSF (Neural source-filter model)

Copy-synthesis
Text-to-speech

_AOZORAR_09534_T01

	WORLD vocoder	WAD (WaveNet, mu-law, 10-bit)	WAC (single Gaussian WaveNet)	NSF (Neural source-filter model)

Copy-synthesis
Text-to-speech

_NIKKEIR_03132_T01

	WORLD vocoder	WAD (WaveNet, mu-law, 10-bit)	WAC (single Gaussian WaveNet)	NSF (Neural source-filter model)

Copy-synthesis
Text-to-speech

Ablation test¶

Please check the original paper for the meaning of system ID

_NIKKEIR_00257_T01

	L1	L2	L3	L4	L5

Copy-synthesis

	S1	S2	S3	N1	N2
Copy-synthesis

HTS Engine+NSF¶

HTS-Engine: OpenJTalk + HMM acoustic models + vocoder.

HTS-Engine+NSF: OpenJTalk + HMM acoustic models + NSF.

phrase01

	HTS Engine	HTS Engine + NSF

Text-to-speech

phrase02

	HTS Engine	HTS Engine + NSF

Text-to-speech

phrase03

	HTS Engine	HTS Engine + NSF

Text-to-speech

phrase04

	HTS Engine	HTS Engine + NSF

Text-to-speech

phrase05

	HTS Engine	HTS Engine + NSF

Text-to-speech

Audio samples (English)¶

The models were trained using CMU_ARCTIC SLT voice.

Main test¶

1: Author of the danger trail, Philip Steels, etc.

	WORLD vocoder	WaveNet (mu-law, 10-bit)	Neural source-filter model

Natural:

Copy-synthesis

2: To my surprise he began to show actual enthusiasm in my favor.

	WORLD vocoder	WaveNet (mu-law, 10-bit)	Neural source-filter model

Natural:

Copy-synthesis

3: In a flash Philip followed its direction.

	WORLD vocoder	WaveNet (mu-law, 10-bit)	Neural source-filter model

Natural:

Copy-synthesis

4: Much, replied Jeanne, as tersely.

	WORLD vocoder	WaveNet (mu-law, 10-bit)	Neural source-filter model

Natural:

Copy-synthesis

5: I suppose you picked that lingo up among the Indians.

	WORLD vocoder	WaveNet (mu-law, 10-bit)	Neural source-filter model

Natural:

Copy-synthesis

6: It was Jeanne singing softly over beyond the rocks.

	WORLD vocoder	WaveNet (mu-law, 10-bit)	Neural source-filter model

Natural:

Copy-synthesis

Ablation test¶

NSF-L3: using a single spectral amplitude distance (only STFT loss Ls1 in the paper)
NSF-S3: using noise excitation only
NSF-N2: setting b=0 in each transformation layer of the filter module
NSF-MSE: trained by minimizing the mean-square-error of waveform values
NSF-FFCond: using a simple feedforward layer as the condition module
NSF-noF0inCond: F0 is not fed to the filter module
NSF-noNoiseOnSine: sine waveform (in voiced regions) doesn’t contain additive noise

1: Author of the danger trail, Philip Steels, etc.