An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Abstract. Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation ability of ZMM-TTS, a recent SSL-based multilingual TTS system. We conducted experiments on 12 languages using limited data with various finetuning configurations. We demonstrate that the similarity in phonetics between the pretraining and target languages, as well as the language category, affects the target language’s adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
This page is for research demonstration purposes only.
Contents
Note- Ground truth Italian and Dutch speech samples are from Multilingual LibriSpeech (MLS), which is licensed under CC BY 4.0 license
- Ground truth Chinese speech samples are from AISHELL-3, which is licensed under Apache License v.2.0
- Speech samples for other languages are from GlobalPhone, which cannot be demonstrated in the public domain. Hence, this page only list samples from the two databases above.
Fine-tune method
- Paired-data fine-tuning: We used paired data {text, audio} and performed fine-tuning on both the txt2vec and vec2wav models.
- Audio-only fine-tuning: We used audio-only data for fine-tuning the vec2wav model, and during testing, txt2vec processes the input in a zero-shot manner.
- Zero-shot: Without employing any data for fine-tuning, both txt2vec and vec2wav were directly tested on zero-shot inference.
To analyze the impact of the number of speakers and the total amount of utterances from the fine-tuning data set on the final adaptation performance, we employed various configurations of fine-tuning data size as shown in the following Table.
Table 1: Fine-tuning data set configurations. S, M, and L denote small, medium, and large. For the same size of fine-tuning data, we use superscript to distinguish between different fine-tuning ap- proaches. {S1, S2, · · · , L4} represents audio-only fine-tuning, while {S1′, S2′, · · · , L4′} represents paired-data fine-tuning. In the subsequent sections, we use 0 to represent zero-shot inference.
Name | S1 | S2 | S3 | S4 | M1 | M2 | M3 | M4 | L1 | L2 | L3 | L4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Spk | 2 | 4 | 10 | 20 | 2 | 4 | 10 | 20 | 2 | 4 | 10 | 20 |
Utt | 12 | 6 | 2 | 1 | 25 | 12 | 5 | 2 | 50 | 25 | 10 | 5 |
Total | 24 | 24 | 20 | 20 | 50 | 48 | 48 | 40 | 100 | 100 | 100 | 100 |
Samples across different languages
Here we only show the samples selected according to best CER under 25 configurations. The results corresponding to all configurations can be found in all audios. The language is represented by the corresponding ISO 639 code.
Language | Fine-tune method | Test text | GT audios | Synthesized audios |
---|---|---|---|---|
nld | L1 | Text1: nu zet ik mij neer om u te schrijven in geheel andere stemming onder den drang eener sterke behoefte om aan eene vertrouwde borst mijne bezwaren mijne bange vermoedens uit te storten | ||
Text2: eckbert had mij zelfs met geen blik verwaardigd toen hij vertrokken was herkreeg ik mijne bewustheid en onder het smartelijke het beschamende van mijn toestand wierp ik mij op de causeuse neer en barstte in tranen los | ||||
ita | L1‘ | Text1: poche ore dopo viene la servetta per far la spesa giornaliera e rimettere in ordine la casa è una toscaninà tozza ma svelta muso duro e linguacciuta ben alzato | ||
Text2: i trattenevo da lui qualche ora la conversazione però languiva poiché egli dopo avermi accolto con un sorriso mesto e muto di riconoscenza spesso richiudeva gli occhi | ||||
cmn | S2‘ | Text1: 相关%公司%股票%走势%农产品$ | ||
Text2: 目前%用户%对%空气%质量%满意度%普遍%较低$ |