INTERSPEECH

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells,
Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

▶ College of Intelligence and Computing, Tianjin University, China

▶ National Institute of Informatics, Japan

▶ Centre for Speech Technology Research, University of Edinburgh, United Kingdom

▶ National Research Council Canada

▶ National Institute of Information And Communications Technology, Japan

Abstract. Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation ability of ZMM-TTS, a recent SSL-based multilingual TTS system. We conducted experiments on 12 languages using limited data with various finetuning configurations. We demonstrate that the similarity in phonetics between the pretraining and target languages, as well as the language category, affects the target language’s adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.

This page is for research demonstration purposes only.

Contents

Fine-tune method
Samples
Total objective evaluation results

Note

Ground truth Italian and Dutch speech samples are from Multilingual LibriSpeech (MLS), which is licensed under CC BY 4.0 license
Ground truth Chinese speech samples are from AISHELL-3, which is licensed under Apache License v.2.0
Speech samples for other languages are from GlobalPhone, which cannot be demonstrated in the public domain. Hence, this page only list samples from the two databases above.

Fine-tune method

Paired-data fine-tuning: We used paired data {text, audio} and performed fine-tuning on both the txt2vec and vec2wav models.
Audio-only fine-tuning: We used audio-only data for fine-tuning the vec2wav model, and during testing, txt2vec processes the input in a zero-shot manner.
Zero-shot: Without employing any data for fine-tuning, both txt2vec and vec2wav were directly tested on zero-shot inference.

To analyze the impact of the number of speakers and the total amount of utterances from the fine-tuning data set on the final adaptation performance, we employed various configurations of fine-tuning data size as shown in the following Table.

Table 1: Fine-tuning data set configurations. S, M, and L denote small, medium, and large. For the same size of fine-tuning data, we use superscript to distinguish between different fine-tuning ap- proaches. {S1, S2, · · · , L4} represents audio-only fine-tuning, while {S1′, S2′, · · · , L4′} represents paired-data fine-tuning. In the subsequent sections, we use 0 to represent zero-shot inference.

Name	S1	S2	S3	S4	M1	M2	M3	M4	L1	L2	L3	L4
Spk	2	4	10	20	2	4	10	20	2	4	10	20
Utt	12	6	2	1	25	12	5	2	50	25	10	5
Total	24	24	20	20	50	48	48	40	100	100	100	100

Samples across different languages

Here we only show the samples selected according to best CER under 25 configurations. The results corresponding to all configurations can be found in all audios. The language is represented by the corresponding ISO 639 code.

Language	Fine-tune method	Test text
nld	L1	Text1: nu zet ik mij neer om u te schrijven in geheel andere stemming onder den drang eener sterke behoefte om aan eene vertrouwde borst mijne bezwaren mijne bange vermoedens uit te storten
nld	L1	Text2: eckbert had mij zelfs met geen blik verwaardigd toen hij vertrokken was herkreeg ik mijne bewustheid en onder het smartelijke het beschamende van mijn toestand wierp ik mij op de causeuse neer en barstte in tranen los
ita	L1‘	Text1: poche ore dopo viene la servetta per far la spesa giornaliera e rimettere in ordine la casa è una toscaninà tozza ma svelta muso duro e linguacciuta ben alzato
ita	L1‘	Text2: i trattenevo da lui qualche ora la conversazione però languiva poiché egli dopo avermi accolto con un sorriso mesto e muto di riconoscenza spesso richiudeva gli occhi
cmn	S2‘	Text1: 相关%公司%股票%走势%农产品$
cmn	S2‘	Text2: 目前%用户%对%空气%质量%满意度%普遍%较低$