Hn-NSF and s-NSF¶

Messages¶

Paper:

Wang, X., Takaki, S. & Yamagishi, J. Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 402–415 (2020), DOI:10.1109/TASLP.2019.2956145
BibTex:
@article{wangNSFall,
author = {Wang, Xin and Takaki, Shinji and Yamagishi, Junichi},
doi = {10.1109/TASLP.2019.2956145},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
pages = {402--415},
title = {{Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis}},
url = {https://ieeexplore.ieee.org/document/8915761/},
volume = {28},
year = {2020}
}
Experiments were based on ATR-Ximera F009 voice (Japanese, commercial database)

Code is available. You need both the CURRENNT toolkit and scripts. This subfolder in the script repository is made for this project

New implementaion based on Pytorch is also available;

Note that

Copy-synthesis refers to waveform generation given natural acoustic features

Text-to-speech refers to waveform generation given acoustic features predicted from the text input

Samples (main test)¶

Natural waveform samples cannot be released online due to the license issue.

Utterance: _NIKKEIR_03132_T01

Mel-spec. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Utterance: _NIKKEIR_00257_T01

Mel-spec. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Utterance: _BTEC_00312_T01

Mel-spec. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Utterance: _AOZORAR_09534_T01

Mel-spec. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Utterance: _AOZORAR_03372_T01

Mel-spec. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (15-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Mel-spec. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

MGC coef. + F0 (1.6-hr data)	WaveNet	hn-NSF	b-NSF	s-NSF

Copy-synthesis:
Text-to-speech:

Samples (ablation test)¶

Ablation test on b-NSF (16kHz, please check the notation in the paper)

Utterance: _AOZORAR_03372_T01

	b-NSF (trained on 5-hr. MGC+F0)

Copy-synthesis
	L1	L2	L3
Copy-synthesis

	S1	S2
Copy-synthesis

	N1	N2
Copy-synthesis

Utterance: _NIKKEIR_03132_T01

	b-NSF (trained on 5-hr. MGC+F0)

Copy-synthesis
	L1	L2	L3
Copy-synthesis

	S1	S2
Copy-synthesis

	N1	N2
Copy-synthesis

Utterance: _NIKKEIR_00257_T01

	b-NSF (trained on 5-hr. MGC+F0)

Copy-synthesis
	L1	L2	L3
Copy-synthesis

	S1	S2
Copy-synthesis

	N1	N2
Copy-synthesis

Utterance: _AOZORAR_09534_T01

	b-NSF (trained on 5-hr. MGC+F0)

Copy-synthesis
	L1	L2	L3
Copy-synthesis

	S1	S2
Copy-synthesis

	N1	N2
Copy-synthesis