Speech samples for our pipeline rakugo speech synthesis models

This system uses a traditional pipeline speech synthesis model (shallow autoregressive neural acoustic model; SAR [1]) to model rakugo speech.

Training set (695 sentences), validation set (randomly selected 695 sentences), and test set (289 sentences comprising a story) are prepared using 12 stories (6 hours 24 minutes including inter-sentence pauses) from our rakugo speech database.

Input acoustic features are: 60-dimensional mel cepstrum, mel fo, voiced/unvoiced information, and 25-dimensional band aperiodicity. Input linguistic features are quinphones and manually labeled context features (optional). All the acoustic features are extracted from 48kHz/16bit waveforms every 5ms. The context features are constant in a sentence.

An estimated pair of mel cepstrum and mel fo is converted into 16kHz/16bit waveform using a WaveNet vocoder.

[1] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “An autoregressive recurrent mixture density network for parametric speech synthesis,” Proc. ICASSP, pp. 4895–4899, 2017.


Systems

Quinphone Manually labeled context features
QP -
QP-ATTR ATTR (related to character) only
QP-context All

Samples

Legal wife (W): いいも悪いも、あたしのほうからあの子のことが心配ですから、ね。泊まってあげてくださいなとお願い申し上げるんで。
Husband (H): ああそう。いやあーお前が言うならじゃあー泊まってきましょう、ハッ。アやっぱり仲がいいてえのは、いいね、うん穏やかでああそうかいそうかい。
H: じゃああたしは今晩あれのところへ、エー泊まってきますがただねえ、エーだいぶ夜も遅いから提灯持ちに、伴を連れてきたいが、誰か起きてるのかい?
W:それがみーんな先に休んでおりましてあの、飯炊きの権助シキャ起きてないんでございますが。

Natural

QP

QP-ATTR

QP-context