数据集
Dataset | Avg Dura | Actor | Emo | View | Intensity | Resolution | Clips |
SAVESS | 7min21s | 4 | 7 | 1 | 1 | 1280*1024 | 480 |
RAVDESS | 3min42s | 24 | 8 | 1 | 2 | 1920*1080 | 7356 |
GRID | 18min54s | 34 | – | 1 | – | 720*576 | 34000 |
Lombard | 4min1s | 54 | – | 2 | – | 854*480 | 10800 |
CREMA-D | – | 91 | 6 | 1 | 3(1/12) | 1280*720 | 7442 |
MEAD | 38min57s | 60 | 8 | 7 | 3 | 1920*1080 | 281400 |
Mocap | – | 1 | 4 | 1 | 1 | 3D blendshape | 865(english) 925(chinese) |
论文
Name | Author | Time | Abstract | Link |
EXPRESSIVE SPEECH-DRIVEN FACIAL ANIMATION WITH CONTROLLABLE EMOTIONS (3D) | 清华大学,惠灵顿大学 | 2023 | link | |
SPACEx: Speech-driven Portrait Animation with Controllable Expression (情感特征控制,强度可控,可参考) | 英伟达 | 2022 | 数据集:VoxCeleb2,RAVDESS,MEAD 模型: Speech2Landmarks + Pose generation + Landmarks2Latents + FiLM(emotion control) + face-vid2vid generator (可借鉴) 输入:图片 + 语音 + 情感特征 | link Demo1 Demo2 Demo3 |
Expressive Talking Head Generation with Granular Audio Visual Control (情感视频控制) | 百度,港中文 | 2022 | 数据集: VoxCeleb2 MEAD 模型:ID encoder + pose encoder + emotional encoder + content encoder + audio encoder + G 图像分割+关键点检测+mask机制 Loss:contrastive loss Lc (视频特征和音频特征分布一致性) + L l1 + Lvgg + LGAN 输入:图片+pose video + emotion video + content video/audio | link |
EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model(情感视频控制,可参考) | 南京大学,港中文,悉尼大学,莫纳什大学,商汤,清华 | 2022 | 数据集:LRW,MEAD(2020),RAVDESS,CFD,CREMA-D 模型: Keypoint Detector Ek Audio2Facial-Dynamics Module: image encoder EI ,audio encoder Ea,pose sequence encoder Ep ,LSTM decoder,Flow Estimator F,image generator G First Order Motion Model Implicit Emotion Displacement Learner: Emotion Extractor Ee ,Displacement Predictor Pd Loss: A2FD module:a key-point loss term loss Lkp ,perceptual loss Lper Implicit Emotion Displacement Learner:Lkp 输入:图片+语音+预制头动序列+情感视频 | link code |
Audio-Driven Emotional Video Portraits (情感语音控制,可参考) | 南京大学,商汤,港中文,南洋理工,清华 | 2021 | 数据集:MEAD(情感),LRW(预训练content模块) 模型:Cross-Reconstructed Emotion Disentanglement + Audio To Landmark + 3D-Aware Keypoint Alignment + Edge-to-Video 对齐:3D-Aware Keypoint Alignment Loss: 1. cross reconstructed emotion disentanglement: cross reconstruction loss+self reconstruction loss+ classification loss+content loss 2. Edge-to-Video translation network 输入:情感语音+驱动视频 | link code |
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation (3D 文本驱动BS,情感标签) | 网易伏羲AI,悉尼大学 | 2021 | 数据集: Mocap 模型: Speaker-independent: Ghed,Gupp,Gmou Speaker-specific: Gldmk,Gvid 人脸、背景mask生成 Loss: Speaker-independent: L1 + Ladv +Lssim Speaker-specific: Ladv + Lperc + Limg + Lface 输入:文本 + 情感标签 + 驱动视频 | link demo |
3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head (3D) | 计算所夏老师 | 2021 | 数据:RAVDESS ,LBG 模型:Deepspeech+Voca 3DMM + Facescape model 输入:语音 + 3D mesh | link |
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis (3D情感视频控制) | 清华大学,海思hisilicon | 2021 | 数据集:Ted-HD(834 clips) 模型: Deep 3D reconstruction + deepspeech + resnet encoder-decoder + Unet + pix2pixHD 输入:图片 + 风格视频 + 语音 | link |
Speech Driven Talking Face Generation from a Single Image and an Emotion Condition (情感标签控制) | University of Rochester | 2021 | 数据集:CREMA-D 模型:image encoder+speech encoder + noise encoder + emotion encoder Loss:mouth region mask loss + perceptual loss + frame discriminator loss + emotion discriminator loss 输入:图片+语音+情感标签(one-hot) | link code |
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation (数据集) | 商汤,卡梅隆大学,中科院,南洋理工 | 2020 | 数据集公开 | link code |
Realistic Speech-Driven Facial Animation with GANs (语音驱动,数据集包含情感) | Samsung AI Centre, Cambridge, UK | 2019 | 数据集:GRID, TCD TIMIT, CREMA-D 模型:Identity Encoder+Content Encoder+Noise Generator+Frame Decoder Loss:Frame Discriminator loss + Sequence Discriminator loss + Synchronization Discriminator loss 输入:语音+图片 | link |
Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks | 德克萨斯大学 | 2018 | learn the relationship between emotion and lip movements from a designed conditional generative adversarial network | link |
ExprGAN: Facial Expression Editing with Controllable Expression Intensity | University of Maryland | 2017 | link |