Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics

Authors: Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen
Paper

Emotion type Source speaker Baseline Proposed method Target speaker
Neutral
Normal happiness
Strong happiness
Normal sadness
Strong sadness
Normal anger
Strong anger







  • Spectrogram samples
The horizontal direction indicates temporal axis while the vertical direction is frequency axis with range of 0 to 8000Hz. It seems that the proposed method predicted better spectrogram than the baseline in most cases.
Neutral


Happiness


Strong happiness


Sadness


Strong sadness


Anger


Strong anger