This system uses a traditional pipeline speech synthesis model (shallow autoregressive neural acoustic model; SAR [1]) to model rakugo speech.
Training set (695 sentences), validation set (randomly selected 695 sentences), and test set (289 sentences comprising a story) are prepared using 12 stories (6 hours 24 minutes including inter-sentence pauses) from our rakugo speech database.
Input acoustic features are: 60-dimensional mel cepstrum, mel fo, voiced/unvoiced information, and 25-dimensional band aperiodicity. Input linguistic features are quinphones and manually labeled context features (optional). All the acoustic features are extracted from 48kHz/16bit waveforms every 5ms. The context features are constant in a sentence.
An estimated pair of mel cepstrum and mel fo is converted into 16kHz/16bit waveform using a WaveNet vocoder.
[1] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “An autoregressive recurrent mixture density network for parametric speech synthesis,” Proc. ICASSP, pp. 4895–4899, 2017.
Quinphone | Manually labeled context features | |
QP | ✓ | - |
QP-ATTR | ✓ | ATTR (related to character) only |
QP-context | ✓ | All |
Legal wife (W): いいも悪いも、あたしのほうからあの子のことが心配ですから、ね。泊まってあげてくださいなとお願い申し上げるんで。 Husband (H): ああそう。いやあーお前が言うならじゃあー泊まってきましょう、ハッ。アやっぱり仲がいいてえのは、いいね、うん穏やかでああそうかいそうかい。 H: じゃああたしは今晩あれのところへ、エー泊まってきますがただねえ、エーだいぶ夜も遅いから提灯持ちに、伴を連れてきたいが、誰か起きてるのかい? W:それがみーんな先に休んでおりましてあの、飯炊きの権助シキャ起きてないんでございますが。 |
Natural QP QP-ATTR QP-context |