An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells,
Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

College of Intelligence and Computing, Tianjin University, China

National Institute of Informatics, Japan

Centre for Speech Technology Research, University of Edinburgh, United Kingdom

National Research Council Canada

National Institute of Information And Communications Technology, Japan

Abstract. Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation ability of ZMM-TTS, a recent SSL-based multilingual TTS system. We conducted experiments on 12 languages using limited data with various finetuning configurations. We demonstrate that the similarity in phonetics between the pretraining and target languages, as well as the language category, affects the target language’s adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.

This page is for research demonstration purposes only.

Contents

Note

Fine-tune method

  • Paired-data fine-tuning: We used paired data {text, audio} and performed fine-tuning on both the txt2vec and vec2wav models.
  • Audio-only fine-tuning: We used audio-only data for fine-tuning the vec2wav model, and during testing, txt2vec processes the input in a zero-shot manner.
  • Zero-shot: Without employing any data for fine-tuning, both txt2vec and vec2wav were directly tested on zero-shot inference.

To analyze the impact of the number of speakers and the total amount of utterances from the fine-tuning data set on the final adaptation performance, we employed various configurations of fine-tuning data size as shown in the following Table.

Table 1: Fine-tuning data set configurations. S, M, and L denote small, medium, and large. For the same size of fine-tuning data, we use superscript to distinguish between different fine-tuning ap- proaches. {S1, S2, · · · , L4} represents audio-only fine-tuning, while {S1′, S2′, · · · , L4′} represents paired-data fine-tuning. In the subsequent sections, we use 0 to represent zero-shot inference.

Name S1 S2 S3 S4 M1 M2 M3 M4 L1 L2 L3 L4
Spk 2 4 10 20 2 4 10 20 2 4 10 20
Utt 12 6 2 1 25 12 5 2 50 25 10 5
Total 24 24 20 20 50 48 48 40 100 100 100 100

Samples across different languages

Here we only show the samples selected according to best CER under 25 configurations. The results corresponding to all configurations can be found in all audios. The language is represented by the corresponding ISO 639 code.

Language Fine-tune method Test text GT audios Synthesized audios
nld L1 Text1: nu zet ik mij neer om u te schrijven in geheel andere stemming onder den drang eener sterke behoefte om aan eene vertrouwde borst mijne bezwaren mijne bange vermoedens uit te storten
Text2: eckbert had mij zelfs met geen blik verwaardigd toen hij vertrokken was herkreeg ik mijne bewustheid en onder het smartelijke het beschamende van mijn toestand wierp ik mij op de causeuse neer en barstte in tranen los
ita L1‘ Text1: poche ore dopo viene la servetta per far la spesa giornaliera e rimettere in ordine la casa è una toscaninà tozza ma svelta muso duro e linguacciuta ben alzato
Text2: i trattenevo da lui qualche ora la conversazione però languiva poiché egli dopo avermi accolto con un sorriso mesto e muto di riconoscenza spesso richiudeva gli occhi
cmn S2‘ Text1: 相关%公司%股票%走势%农产品$
Text2: 目前%用户%对%空气%质量%满意度%普遍%较低$

Total objective evaluation results

The total objective metrics results can be found here.