Rakugo speech synthesis: A challenging example of speech synthesis that entertains the audience

By Shuhei KATO1,2, Yusuke YASUDA1,2, Xin WANG2, Erica COOPER2, Shinji TAKAKI3*, and Junichi YAMAGISHI2,4

1 The Graduate University for Advanced Sciences (SOKENDAI), Japan.
2 National Institute of Informatics, Japan.
3 Nagoya Institute of Technology, Japan.
4 University of Edinburgh, UK.

* He is formerly a member of Yamagishi Laboratory, NII.

Special thanks to Yanagiya Sanza, who is a star rakugo performer, for the contribution of recording his professional rakugo performance and speech.

If you have a question or comment about this research, please send an email to skato@nii.ac.jp


Can current speech synthesis (text-to-speech: TTS) entertain the audience? In order to realize TTS that entertains the audience, we have been developing a rakugo speech synthesis.

Our papers and demos

What is rakugo?

Overview of rakugo

Rakugo is a traditional Japanese form of verbal entertainment like a combination of one-person stand-up comedy and comic storytelling. It has over 300 years of history (from the middle of the Edo era), and is characterized by its unique performance style. A rakugo performer performs improvisationally or from memory alone on a stage. He/she plays multiple characters, and their conversations or dialogues make the story progress. Almost no narrative sentences exist in the main part of a rakugo story.

Rakugo is popular even now

Rakugo is generally divided into Edo (Tokyo) rakugo and Kamigata (Osaka and Kyoto) rakugo. In Tokyo, about 600 professional performers are active. Tokyo has four major yoses, which are theaters that mainly perform rakugo every day, even on January 1. Some TV and radio programs are broadcasted every week. We can buy thousands of CDs and DVDs of rakugo performance. Many online videos are also available.

Four major yoses in Tokyo.

Rakugo performance

A performer sits on a zabuton (cushion) alone on a stage. He/she uses no properties other than a sensu (folding fan) and a tenugui (hand towel). Almost no narrative sentences exist in the main part of a rakugo story.

Shumputei Shotaro is performing rakugo.
This photo is transformed from “DP3M2471” by akira kawamura licensed under CC BY 2.0.

Structure of a rakugo story

A rakugo story has five parts: maeoki (greeting), makura (introduction), main part, ochi (punch line), and musubi (conclusion). Maeoki is not nesessary, and conclusion can appear when a performer ends his/her performance in the middle of the story due to time limitations. Makura is often improvised, but performers basically don't have conversations with audiences unlike stand-up comedy. Ochi (punch line) is the most important part of rakugo because the word “rakugo”(落語) is derived from “a story with ochi(落ち).”

Dialects used in traditional rakugo stories

Rakugo stories are generally divided into standards, which were established by the 1920s, and modern stories, which were created after the 1930s. The Japanese language used in standards is slightly old-fashioned. So, automatic analysis or tagging are practically impossible. This makes processing input words of rakugo speech difficult.

Characters appering in standards speak different Japanese dialects, sociolects, or idiolects according to their genders, ages, or social ranks. In other words, the dialects, sociolects, or idiolects help us to recognize who is speaking now.

Why have we been working on rakugo TTS?

Speech synthesis as an entertainer

Speech is a kind of media that transfers its content, the speaker's emotion, personality, intention, etc. to listeners. Needless to say, information transfer is the most important role of speech. At last, some of the best modern TTS can generate speech as natural as human speech. In other words, (some) TTS can now speak as natural as a human does.

However, information transfer is not the only role of speech. For example, verbal entertainment, including rakugo, entertains audience through the medium of speech. In other words, speech can stir listeners' emotions. Then, can current TTS perform as well as a professional entertainer ? We think no. Most current TTS systems, such as ones installed into voice assistants, speak monotonic or with limited speaking styles. Although many TTS researchers have been working actively, current TTS systems are still far from professional entertainers. We are approaching the realization of building entertaining TTS.

Rakugo TTS vs. audiobook TTS

Some of you may wonder why we are not working on audiobook TTS. We must explain the differences between rakugo and audiobooks.

Differences between rakugo and audiobooks

  1. The main part of a rakugo story consists of conversations or dialogues by characters. Almost no narrative parts exist in the main part unlike audiobooks.
  2. Rakugo speech is more casually pronounced because it is produced improvisationally or from memory. This affects the technical difficulty of modeling speech.
  3. Rakugo is inherently an entertainment. Any rakugo TTS has to entertain the audience. A Rakugo TTS that cannot entertain the audience is not a rakugo TTS.

We believe rakugo TTS with high quality helps the development of spoken dialogue systems because of 1 and 2. Following 3, we are wondering what will happen when we have a rakugo TTS that has as same quality as one of human professional!