Highlights of main outcomes

The VoicePersonae project has focused its research on the following four themes:
Theme 1: Increasing the accuracy of machine learning techniques for vocal identity
Theme 2: Improving the security and robustness of voice biometrics
Theme 3: Realization of new technologies for voice privacy protection
Theme 4: Application of research results to other modalities

In addition, we have been working diligently to drive the field forward by managing international challenges.

Theme 1: Voice identity modeling

Speech synthesis and voice conversion

There are two types of technologies related to the accurate reproduction of individual voices: voice conversion, which converts the voice of one speaker into that of another, and text-to-speech synthesis, which synthesizes the voices of multiple speakers from text. By integrating these methods, it became possible to consider text-to-speech synthesis and voice conversion in a unified framework rather than tackle the two research topics separately. It also become possible to mutually use training data in the databases. In addition, instead of merely improving the existing methods, it is potentially possible to develop a unified methodology for all kinds of speech generation tasks. The results were summarized in a journal paper, which was published on the top journal called IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP).

Selected References:

Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi (2019). Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet. Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019.

Hieu-Thi Luong, Junichi Yamagishi (2020). NAUTILUS: A Versatile Voice Cloning System. IEEE ACM Trans. Audio Speech Lang. Process..

Speech enhancement and synthesis

The speech waveform generation task is also closely related to other tasks, such as far-end speech enhancement, which extracts only clean speech from noisy speech (speech enhancement), and near-end listening enhancement, which transforms speech itself so speech sounds more intelligible in noisy conditions (speech intelligibility enhancement). Integration of speech synthesis techniques with these fields was also completed. We obtained good results on integrating speech synthesis and speech enhancement. One of the methods to control the speech style of text-to-speech systems using neural networks is known as the “Style Token Model”. We proposed a “Noise Token Model” that learns noise type as a latent variable, and this approach shows its superiority to conventional methods. Also, we have achieved results in speech intelligibility enhancement, which is not a simple conversion task since there is no correct teacher data and hence there has been no significant progress in this technology so far. We proposed iMetricGAN in which a complex and non-differentiable speech intelligibility index is regarded as the output value of a discriminator under the framework of generative adversarial network, and the discriminator is made to approximate the speech intelligibility index. This technology shows very promising results at the Hurricane Challenge 2.0 in 2020. Its paper was accepted to IEEE/ACM TASLP.

Selected References:

Haoyu Li, Junichi Yamagishi (2020). Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement. Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020.

Haoyu Li, Junichi Yamagishi (2021). Multi-Metric Optimization Using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement. IEEE ACM Trans. Audio Speech Lang. Process..

Neural vocoders: Fusion of DSP and DNN

In speech synthesis and voice conversion, a technique called vocoder is usually used to convert predicted acoustic features into a speech waveform. In speech enhancement, vocoders are also used to generate a clean speech waveform. In other words, the vocoder is a key technology that spans multiple fields in the speech waveform generation task. Before the spread of neural networks, the STRAIGHT vocoder based on signal processing by Kawahara et al. was often used, and after the spread of neural networks, Wavenet, a huge network by Google DeepMind, was often used. Both have their own merits and demerits; STRAIGHT does not achieve as high of a quality of synthesized speech, and Wavenet requires massive computational resources and time. Therefore, we proposed a neural vocoder method, “Neural Source Filter Model”, which is a tight integration of signal processing and deep learning, and introduces a neural network into the part of the classical source filter vocoder. In this method, the signal processing is performed as a part of the operation of deep learning, and the parameters of the filter part are estimated from the speech waveform by the stochastic gradient descent method, while the source part is driven by the fundamental frequency like the signal processing vocoder. Such a method combining signal processing and neural network has a very high. We made a series of presentations on this method and the first paper has a high citation count of 74 even though this technology was proposed only two years ago in 2019. We also released the code for our neural vocoders, which is not only widely used for research purposes but also adopted in the free singing voice synthesis software “NEUTRINO.”

Selected References:

Xin Wang, Shinji Takaki, Junichi Yamagishi (2020). Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. IEEE ACM Trans. Audio Speech Lang. Process..

Yi Zhao, Xin Wang, Lauri Juvela, Junichi Yamagishi (2020). Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020.

Fusion of speaker recognition and synthesis

We also tried to integrate speech generation tasks with speaker recognition, which recognizes the individuality of a voice and performs personal authentication, and language recognition, which recognizes language from speech. We first showed that speaker embedding vectors, which are used in speaker recognition models, can be used in speech synthesis to control the speaker identity of synthesized speech. In addition, we showed that we can control the dialect of speech synthesis by introducing an intermediate representation from a dialect recognition model into speech synthesis.

Selected References:

Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi (2020). Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020.

Rakugo speech synthesis

As a challenging application of speech modeling technology focusing on the individuality of the voice, we focused on Rakugo, a traditional Japanese performing art, and studied how to learn and reproduce the storytelling performance by speech synthesis. Rakugo stories mainly consist of conversations among characters, and these characters are all played by a single Rakugo storyteller, who changes his voice tone appropriately so that the listener can understand which character is speaking. With the cooperation of the Edo Rakugo storyteller, we constructed a corpus of Rakugo speech and built a Rakugo synthesis model. This unique research result was published in a high-impact journal “IEEE Access”.

Selected References:

Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, Junichi Yamagishi (2020). Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences. IEEE Access.

Theme 2: Voice biometric technology

Speech liveness detection and ASVspoof

Although the technology of reproducing an individual’s voice is expected to bring new value in entertainment, it may cause security problems in speaker recognition systems if it is misused. In addition, there is a possibility of this technology being used for telephone fraud and information manipulation. To solve this problem, we first constructed a large-scale speech database for liveness detection, and held a worldwide ASVspoof challenge 2019 to evaluate the performance of liveness detection algorithms on the common database. A paper summarizing the ranking results of the challenge participants was also presented at Interspeech 2019. The number of citations for this is 151. From the analysis results, we see that the top teams out of the 50 teams that participated in the challenge achieved very good results, confirming that they can achieve accurate liveness detection even if speech synthesis is greatly advanced (as long as when the training database is properly constructed). Through further detailed analysis, we have identified the necessary conditions for building a highly discriminative liveness detection model, and we have published a practice guideline and an open-source program that summarizes the essence of our findings to the community. The data and findings from ASVspoof are expected to be widely used in research and industry in the future.

Selected References:

Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H. Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md. Sahidullah, Junichi Yamagishi, Kong Aik Lee (2021). ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech. IEEE Trans. Biom. Behav. Identity Sci..

Explainability of liveness detection algorithms

Through the analysis of ASVspoof challenge, it was found that in order to discriminate various types of synthetic speech with a high accuracy, it is necessary to construct a liveness detection model from various features. Therefore we’re currently developing a biometric liveness detection algorithm that can provide evidence of the information and features used in the detection of liveness and hence improve the explainability and interpretability. Examples include visualizing the attention of a Graph Attention Network and a new model that identifies tampered and synthetic regions to provide evidence for liveness detection. We will continue to improve our method and develop it as XAI liveness detection.

Selected References:

Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas Evans (2021). Graph Attention Networks for Anti-Spoofing. Proc. Interspeech 2021.

Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans (2021). An Initial Investigation for Detecting Partially Spoofed Audio. Proc. Interspeech 2021.

The ASVspoof challenge 2021

To drive the field and accelerate research, the ASVspoof challenge was held again in 2021. Since the huge database we built in 2019 is still the most advanced fake speech dataset after two years, we released additional test scenarios to evaluate the performance over the telephone and performance for compressed streaming audio. Its analysis is currently in progress.

Selected References:

Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md. Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado (2021). ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. Proc. 2021 Automatic Speaker Verification and Spoofing Countermeasures Challenge (ASVspoof 2021 Workshop).

Theme 3: Voice privacy

Speaker anonymization methods

As it is becoming easier and easier to create personal speech synthesis systems from speech data on social media, there is a need for techniques to anonymize speaker information contained in speech, as well as other information that is appropriate to be protected. This is a very new research topic. Therefore, we first proposed a new method of speaker anonymization by combining speech synthesis technology and speaker recognition technology. Although there are various characteristics of speech that could potentially be anonymized, we proposed a speaker anonymization method that aims to change speaker identity while preserving the naturalness of speech and other speaker attributes such as age and gender that can be perceived from speech. We decompose speech into three types of information: intonation, phonetic information, and a speaker embedding vector which is then averaged using k-nearest speakers. A neural source filter model is then used to resynthesize the speech waveform, which enables high quality speech generation.

Selected References:

Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, Jean-François Bonastre (2019). Speaker Anonymization Using X-vector and Neural Waveform Models. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10).

Masking process of non-speaker features

In addition to speaker identity, there are many other types of information that should be considered when protecting privacy, such as the speaker’s gender, their dialect (due to race or environment, such as black, white, ethnicity, etc.), and information about the speaker’s speech impairment. It is desirable to be able to selectively mask other features according to the user’s wishes. We investigated the possibility of masking the speech content using a neural network structure like the speaker anonymization method described above. We first decomposed the speech signal into a series of local latent variables representing the speech content and a series of global latent variables representing the speaker information, and then constructed a network that re-synthesized the speech waveform. We then mask the speech content by replacing a part of the local latent variable series representing the speech content with another latent variable extracted from babble noise. By using this method, it became possible to mask a part of the speech without using special sounds such as beeps, while maintaining the original speaker identity.

Selected References:

Jennifer Williams, Junichi Yamagishi, Paul-Gauthier Noé, Cassia Valentini-Botinhao, Jean-François Bonastre (2021). Revisiting Speech Content Privacy. Proceedings of the Symposium of the Security & Privacy in Speech Communication.

Evaluation metrics for anonymization methods

As speaker anonymization is a very new research topic, we need to study and propose evaluation methods and metrics for speaker anonymization, too. Since the development of evaluation metrics is an important process that determines the direction of the field, it was discussed carefully. Specifically, our project members and EU H2020 COMPRISE members held discussions over a period of six months, from October 2019 to March 2020, and came to the conclusion that speaker anonymization methods should meet at least the following conditions:

Anonymized voice does not affect other processing (downstream tasks).
The speaker’s identity is not recognizable from the anonymized voice.
Anonymized voice is not re-identifiable regardless of the attacker’s knowledge or data holdings.
In multi-person conversations, it is possible to distinguish speakers appropriately even after anonymization.

We consider that speaker anonymization is not an end in itself, but that there are other tasks or applications that users want to perform. Such a task is generally called a downstream task, and the condition (1) indicates that the speaker anonymization method should not affect such a downstream task, which may be either a human listener or a machine. Next, (2) is the condition that the speaker-anonymized speech is not recognized as the person himself/herself by ordinary speaker recognition techniques. Condition (3) considers the risk that the voice after speaker anonymization will be identified again by another user with malicious intent. It is appropriate to estimate the risk of re-identification based on the worst-case scenario, although the knowledge and data possessed by such a malicious user may vary. In a conversation among multiple people, it is also necessary to make it possible for the listener to confirm the change of speakers even if the individuals cannot be identified. In other words, it is necessary to be able to distinguish speakers appropriately even after anonymization. This is condition (4). However, there were no evaluation metrics for (3) and (4). Therefore, we proposed a new metric called ZEBRA, which evaluates based on the worst-case risk of re-identification of speech after speaker anonymization, and another new metric that considers how similar the post-anonymized speech of multiple speakers is to each other.

Selected References:

Andreas Nautsch, Jose Patino, Natalia A. Tomashenko, Junichi Yamagishi, Paul-Gauthier Noé, Jean-François Bonastre, Massimiliano Todisco, Nicholas Evans (2020). The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment. Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020.

Paul-Gauthier Noé, Andreas Nautsch, Nicholas Evans, Jose Patino, Jean-François Bonastre, Natalia Tomashenko, Driss Matrouf (2022). Towards a unified assessment framework of speech pseudonymisation. Computer Speech & Language.

Voice Privacy challenge

We ran an international challenge on voice privacy protection, similar to the Voice Conversion Challenge and the ASVspoof challenge, in order to drive the field and accelerate research. The “Voice Privacy Challenge” was implemented in 2020. Specifically, we have conducted a mutual evaluation of speaker anonymization methods proposed by more than ten universities, companies, and research organizations As a result of the intercomparison of these different anonymization methods, it was confirmed that all of them can significantly reduce the possibility of personal identification, but at the same time, the usefulness of the anonymized voice for downstream tasks is also reduced, and the best method does not exist yet. However, it is a very significant finding in that it clarified the points that needs to be improved in the future.

Selected References:

Natalia A. Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco (2020). Introducing the VoicePrivacy Initiative. Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020.

Theme 4: Extension to other modalities

Deepfake detection

We apply our research results in the field of speech to other modalities and show that our research results are also effective in the fields of image processing and natural language processing. We first constructed a model that automatically detects fake face images, which are automatically generated by the deepfake and Face2Face technologies. Those fake images and videos have become a social problem mainly in Europe and the United States and have caused actual damage in Japan. It is demonstrated that our proposed neural network technology called the capsule network can identify fake faces with high accuracy on a dataset called FaceForensics++ and presented it at the international conference. It is one of the first deepfake detection models in the world. This work was published in 2018 and has 147 citations. In addition, we proposed a new network that not only performs the authenticity test on videos but also identifies the tampered pixel region at the same time. Image segmentation techniques were used to identify those regions. By indicating the tampered pixel regions, we can show the evidence that the image is fake, which improves explainability. The experimental results also show that the identification of tampered pixel regions is effective against unknown fake video generation methods and has various other advantages. This work was presented at IEEE BTAS 2019, and this paper has a high citation count of 125.

Selected References:

Huy Hong Nguyen, Junichi Yamagishi, Isao Echizen (2019). Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019.

Huy Hong Nguyen, Fuming Fang, Junichi Yamagishi, Isao Echizen (2019). Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos. 10th IEEE International Conference on Biometrics Theory, Applications and Systems, BTAS 2019, Tampa, FL, USA, September 23-26, 2019.

Multi-face deepfake detection

Deepfake detection so far has been based on the assumption that face detection has already been performed, and hence the input to the discriminant model has always been a single face image. When multiple faces appear in an image, all of them are detected in advance and judged for authenticity sequentially, which is very inefficient. Hence we constructed a new high-quality deepfake detection database called OpenForensics, which contains a large number of images with an arbitrary number of faces and proposed various end-to-end deep networks that simultaneously and efficiently detect the authenticity of multiple faces and identify the tampering area. This research result was accepted to ICCV2021.

Selected References:

Trung-Nghia Le, Huy Hong Nguyen, Junichi Yamagishi, Isao Echizen (2021). OpenForensics: Large-Scale Challenging Dataset for Multi-Face Forgery Detection and Segmentation In-the-Wild. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

MasterFace generation and detection

Speaker verification and face recognition systems share many common technologies. Therefore, our knowledge on vulnerability research of speaker recognition can be applied to face recognition systems. Specifically, we investigated whether it is possible to automatically generate a master face (corresponding to a so-called “master key”), whose features match those of many registered users, by automatically updating latent variables in publicly available deep generative models such as StyleGAN using the hill-climbing algorithm. We announced that such master faces can be generated for some face recognition systems at the famous biometrics conference IJCB2020. We found that the number of false acceptance cases increased as we optimized the generated face images. To deal with this vulnerability, it is necessary to introduce a defense model that automatically discriminates GAN images.

Selected References:

Huy Hong Nguyen, Junichi Yamagishi, Isao Echizen, Sébastien Marcel (2020). Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems. 2020 IEEE International Joint Conference on Biometrics, IJCB 2020, Houston, TX, USA, September 28 - October 1, 2020.

Extensions to text and writing

In the field of natural language processing, huge neural language models such as OpenAI’s Generative Pre-trained Transformer (GPT) can automatically generate fluent sentences. For example, the Grover model, which automatically generates news articles from the title, date, and publisher name of the article, has been proposed and its accuracy has been confirmed. Therefore, we expanded the synthetic media detection of this project to the natural language processing field. First, we focused on “word-of-mouth” reviews on shopping sites. These kinds of reviews are known to have problems such as being fake. If the neural language models are used to generate such fake reviews, the problem will worsen our daily lives. Therefore, we investigated the possibility of discriminating between computer-generated and human-written reviews and combined multiple discriminative models. Although the detection accuracy is lower than that of voice or images, the results show that the combination of multiple detection models can detect artificially generated reviews to some extent. This paper was also cited as one of the safety research papers when OpenAI released GPT-2 to the public .

Selected References:

David Ifeoluwa Adelani, Haotian Mai, Fuming Fang, Huy Hong Nguyen, Junichi Yamagishi, Isao Echizen (2020). Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-Based Detection. Advanced Information Networking and Applications - Proceedings of the 34th International Conference on Advanced Information Networking and Applications, AINA-2020, Caserta, Italy, 15-17 April.

Automated fact checking

A more essential task is to verify whether the meaning of the sentences is factual or not, i.e., fact-checking. Currently, fact-checking is often done manually, but attempts to automate fact-checking using machine learning have begun. Automated fact checking consists of several components and it first searches for scientific papers or information sources related to the claim to be verified. Next, it selects sentences that can be used as evidence. Finally, it outputs a judgment about whether the claim is true or not. This automatic fact-checking is a complementary technology to fake synthetic media detection technology. We began this new topic and confirmed that the performance of automatic fact-checking can be improved by simultaneously optimizing the sentence selection and the true/false answers, which are currently optimized separately. The results were presented at ACL finding 2021.

Selected References:

Canasai Kruengkrai, Junichi Yamagishi, Xin Wang (2021). A Multi-Level Attention Model for Evidence-Based Fact Checking. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.