The VoicePersonae project has focused its research on the following four themes:
Theme 1: Increasing the accuracy of machine learning techniques for vocal identity
Theme 2: Improving the security and robustness of voice biometrics
Theme 3: Realization of new technologies for voice privacy protection
Theme 4: Application of research results to other modalities
In addition, we have been working diligently to drive the field forward by managing international challenges.
There are two types of technologies related to the accurate reproduction of individual voices: voice conversion, which converts the voice of one speaker into that of another, and text-to-speech synthesis, which synthesizes the voices of multiple speakers from text. By integrating these methods, it became possible to consider text-to-speech synthesis and voice conversion in a unified framework rather than tackle the two research topics separately. It also become possible to mutually use training data in the databases. In addition, instead of merely improving the existing methods, it is potentially possible to develop a unified methodology for all kinds of speech generation tasks. The results were summarized in a journal paper, which was published on the top journal called IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP).
Selected References:
The speech waveform generation task is also closely related to other tasks, such as far-end speech enhancement, which extracts only clean speech from noisy speech (speech enhancement), and near-end listening enhancement, which transforms speech itself so speech sounds more intelligible in noisy conditions (speech intelligibility enhancement). Integration of speech synthesis techniques with these fields was also completed. We obtained good results on integrating speech synthesis and speech enhancement. One of the methods to control the speech style of text-to-speech systems using neural networks is known as the “Style Token Model”. We proposed a “Noise Token Model” that learns noise type as a latent variable, and this approach shows its superiority to conventional methods. Also, we have achieved results in speech intelligibility enhancement, which is not a simple conversion task since there is no correct teacher data and hence there has been no significant progress in this technology so far. We proposed iMetricGAN in which a complex and non-differentiable speech intelligibility index is regarded as the output value of a discriminator under the framework of generative adversarial network, and the discriminator is made to approximate the speech intelligibility index. This technology shows very promising results at the Hurricane Challenge 2.0 in 2020. Its paper was accepted to IEEE/ACM TASLP.
Selected References:
In speech synthesis and voice conversion, a technique called vocoder is usually used to convert predicted acoustic features into a speech waveform. In speech enhancement, vocoders are also used to generate a clean speech waveform. In other words, the vocoder is a key technology that spans multiple fields in the speech waveform generation task. Before the spread of neural networks, the STRAIGHT vocoder based on signal processing by Kawahara et al. was often used, and after the spread of neural networks, Wavenet, a huge network by Google DeepMind, was often used. Both have their own merits and demerits; STRAIGHT does not achieve as high of a quality of synthesized speech, and Wavenet requires massive computational resources and time. Therefore, we proposed a neural vocoder method, “Neural Source Filter Model”, which is a tight integration of signal processing and deep learning, and introduces a neural network into the part of the classical source filter vocoder. In this method, the signal processing is performed as a part of the operation of deep learning, and the parameters of the filter part are estimated from the speech waveform by the stochastic gradient descent method, while the source part is driven by the fundamental frequency like the signal processing vocoder. Such a method combining signal processing and neural network has a very high. We made a series of presentations on this method and the first paper has a high citation count of 74 even though this technology was proposed only two years ago in 2019. We also released the code for our neural vocoders, which is not only widely used for research purposes but also adopted in the free singing voice synthesis software “NEUTRINO.”
Selected References:
We also tried to integrate speech generation tasks with speaker recognition, which recognizes the individuality of a voice and performs personal authentication, and language recognition, which recognizes language from speech. We first showed that speaker embedding vectors, which are used in speaker recognition models, can be used in speech synthesis to control the speaker identity of synthesized speech. In addition, we showed that we can control the dialect of speech synthesis by introducing an intermediate representation from a dialect recognition model into speech synthesis.
Selected References:
As a challenging application of speech modeling technology focusing on the individuality of the voice, we focused on Rakugo, a traditional Japanese performing art, and studied how to learn and reproduce the storytelling performance by speech synthesis. Rakugo stories mainly consist of conversations among characters, and these characters are all played by a single Rakugo storyteller, who changes his voice tone appropriately so that the listener can understand which character is speaking. With the cooperation of the Edo Rakugo storyteller, we constructed a corpus of Rakugo speech and built a Rakugo synthesis model. This unique research result was published in a high-impact journal “IEEE Access”.
Selected References:
Although the technology of reproducing an individual’s voice is expected to bring new value in entertainment, it may cause security problems in speaker recognition systems if it is misused. In addition, there is a possibility of this technology being used for telephone fraud and information manipulation. To solve this problem, we first constructed a large-scale speech database for liveness detection, and held a worldwide ASVspoof challenge 2019 to evaluate the performance of liveness detection algorithms on the common database. A paper summarizing the ranking results of the challenge participants was also presented at Interspeech 2019. The number of citations for this is 151. From the analysis results, we see that the top teams out of the 50 teams that participated in the challenge achieved very good results, confirming that they can achieve accurate liveness detection even if speech synthesis is greatly advanced (as long as when the training database is properly constructed). Through further detailed analysis, we have identified the necessary conditions for building a highly discriminative liveness detection model, and we have published a practice guideline and an open-source program that summarizes the essence of our findings to the community. The data and findings from ASVspoof are expected to be widely used in research and industry in the future.
Selected References:
Through the analysis of ASVspoof challenge, it was found that in order to discriminate various types of synthetic speech with a high accuracy, it is necessary to construct a liveness detection model from various features. Therefore we’re currently developing a biometric liveness detection algorithm that can provide evidence of the information and features used in the detection of liveness and hence improve the explainability and interpretability. Examples include visualizing the attention of a Graph Attention Network and a new model that identifies tampered and synthetic regions to provide evidence for liveness detection. We will continue to improve our method and develop it as XAI liveness detection.
Selected References:
To drive the field and accelerate research, the ASVspoof challenge was held again in 2021. Since the huge database we built in 2019 is still the most advanced fake speech dataset after two years, we released additional test scenarios to evaluate the performance over the telephone and performance for compressed streaming audio. Its analysis is currently in progress.
Selected References:
As it is becoming easier and easier to create personal speech synthesis systems from speech data on social media, there is a need for techniques to anonymize speaker information contained in speech, as well as other information that is appropriate to be protected. This is a very new research topic. Therefore, we first proposed a new method of speaker anonymization by combining speech synthesis technology and speaker recognition technology. Although there are various characteristics of speech that could potentially be anonymized, we proposed a speaker anonymization method that aims to change speaker identity while preserving the naturalness of speech and other speaker attributes such as age and gender that can be perceived from speech. We decompose speech into three types of information: intonation, phonetic information, and a speaker embedding vector which is then averaged using k-nearest speakers. A neural source filter model is then used to resynthesize the speech waveform, which enables high quality speech generation.
Selected References:
In addition to speaker identity, there are many other types of information that should be considered when protecting privacy, such as the speaker’s gender, their dialect (due to race or environment, such as black, white, ethnicity, etc.), and information about the speaker’s speech impairment. It is desirable to be able to selectively mask other features according to the user’s wishes. We investigated the possibility of masking the speech content using a neural network structure like the speaker anonymization method described above. We first decomposed the speech signal into a series of local latent variables representing the speech content and a series of global latent variables representing the speaker information, and then constructed a network that re-synthesized the speech waveform. We then mask the speech content by replacing a part of the local latent variable series representing the speech content with another latent variable extracted from babble noise. By using this method, it became possible to mask a part of the speech without using special sounds such as beeps, while maintaining the original speaker identity.
Selected References:
As speaker anonymization is a very new research topic, we need to study and propose evaluation methods and metrics for speaker anonymization, too. Since the development of evaluation metrics is an important process that determines the direction of the field, it was discussed carefully. Specifically, our project members and EU H2020 COMPRISE members held discussions over a period of six months, from October 2019 to March 2020, and came to the conclusion that speaker anonymization methods should meet at least the following conditions:
We consider that speaker anonymization is not an end in itself, but that there are other tasks or applications that users want to perform. Such a task is generally called a downstream task, and the condition (1) indicates that the speaker anonymization method should not affect such a downstream task, which may be either a human listener or a machine. Next, (2) is the condition that the speaker-anonymized speech is not recognized as the person himself/herself by ordinary speaker recognition techniques. Condition (3) considers the risk that the voice after speaker anonymization will be identified again by another user with malicious intent. It is appropriate to estimate the risk of re-identification based on the worst-case scenario, although the knowledge and data possessed by such a malicious user may vary. In a conversation among multiple people, it is also necessary to make it possible for the listener to confirm the change of speakers even if the individuals cannot be identified. In other words, it is necessary to be able to distinguish speakers appropriately even after anonymization. This is condition (4). However, there were no evaluation metrics for (3) and (4). Therefore, we proposed a new metric called ZEBRA, which evaluates based on the worst-case risk of re-identification of speech after speaker anonymization, and another new metric that considers how similar the post-anonymized speech of multiple speakers is to each other.
Selected References:
We ran an international challenge on voice privacy protection, similar to the Voice Conversion Challenge and the ASVspoof challenge, in order to drive the field and accelerate research. The “Voice Privacy Challenge” was implemented in 2020. Specifically, we have conducted a mutual evaluation of speaker anonymization methods proposed by more than ten universities, companies, and research organizations As a result of the intercomparison of these different anonymization methods, it was confirmed that all of them can significantly reduce the possibility of personal identification, but at the same time, the usefulness of the anonymized voice for downstream tasks is also reduced, and the best method does not exist yet. However, it is a very significant finding in that it clarified the points that needs to be improved in the future.
Selected References:
We apply our research results in the field of speech to other modalities and show that our research results are also effective in the fields of image processing and natural language processing. We first constructed a model that automatically detects fake face images, which are automatically generated by the deepfake and Face2Face technologies. Those fake images and videos have become a social problem mainly in Europe and the United States and have caused actual damage in Japan. It is demonstrated that our proposed neural network technology called the capsule network can identify fake faces with high accuracy on a dataset called FaceForensics++ and presented it at the international conference. It is one of the first deepfake detection models in the world. This work was published in 2018 and has 147 citations. In addition, we proposed a new network that not only performs the authenticity test on videos but also identifies the tampered pixel region at the same time. Image segmentation techniques were used to identify those regions. By indicating the tampered pixel regions, we can show the evidence that the image is fake, which improves explainability. The experimental results also show that the identification of tampered pixel regions is effective against unknown fake video generation methods and has various other advantages. This work was presented at IEEE BTAS 2019, and this paper has a high citation count of 125.
Selected References:
Deepfake detection so far has been based on the assumption that face detection has already been performed, and hence the input to the discriminant model has always been a single face image. When multiple faces appear in an image, all of them are detected in advance and judged for authenticity sequentially, which is very inefficient. Hence we constructed a new high-quality deepfake detection database called OpenForensics, which contains a large number of images with an arbitrary number of faces and proposed various end-to-end deep networks that simultaneously and efficiently detect the authenticity of multiple faces and identify the tampering area. This research result was accepted to ICCV2021.
Selected References:
Speaker verification and face recognition systems share many common technologies. Therefore, our knowledge on vulnerability research of speaker recognition can be applied to face recognition systems. Specifically, we investigated whether it is possible to automatically generate a master face (corresponding to a so-called “master key”), whose features match those of many registered users, by automatically updating latent variables in publicly available deep generative models such as StyleGAN using the hill-climbing algorithm. We announced that such master faces can be generated for some face recognition systems at the famous biometrics conference IJCB2020. We found that the number of false acceptance cases increased as we optimized the generated face images. To deal with this vulnerability, it is necessary to introduce a defense model that automatically discriminates GAN images.
Selected References:
In the field of natural language processing, huge neural language models such as OpenAI’s Generative Pre-trained Transformer (GPT) can automatically generate fluent sentences. For example, the Grover model, which automatically generates news articles from the title, date, and publisher name of the article, has been proposed and its accuracy has been confirmed. Therefore, we expanded the synthetic media detection of this project to the natural language processing field. First, we focused on “word-of-mouth” reviews on shopping sites. These kinds of reviews are known to have problems such as being fake. If the neural language models are used to generate such fake reviews, the problem will worsen our daily lives. Therefore, we investigated the possibility of discriminating between computer-generated and human-written reviews and combined multiple discriminative models. Although the detection accuracy is lower than that of voice or images, the results show that the combination of multiple detection models can detect artificially generated reviews to some extent. This paper was also cited as one of the safety research papers when OpenAI released GPT-2 to the public .
Selected References:
A more essential task is to verify whether the meaning of the sentences is factual or not, i.e., fact-checking. Currently, fact-checking is often done manually, but attempts to automate fact-checking using machine learning have begun. Automated fact checking consists of several components and it first searches for scientific papers or information sources related to the claim to be verified. Next, it selects sentences that can be used as evidence. Finally, it outputs a judgment about whether the claim is true or not. This automatic fact-checking is a complementary technology to fake synthetic media detection technology. We began this new topic and confirmed that the performance of automatic fact-checking can be improved by simultaneously optimizing the sentence selection and the true/false answers, which are currently optimized separately. The results were presented at ACL finding 2021.
Selected References: