JoyVASA Open-source Audio-driven Facial Animation Model - Supports Multi-language Facial Dynamics and Head Movement Generation

Joyvasa

Developed by jdh-algo

JoyVASA is an audio-driven facial animation generation method based on diffusion models, capable of generating facial dynamics and head movements with support for multilingual input.

Video Processing Open Source License:MIT #Audio-driven animation #Decoupled facial representation #Multilingual support

Downloads 95

Release Time : 11/13/2024

Model Overview

JoyVASA generates high-quality facial animations from audio cues through a decoupled facial representation framework and diffusion transformer technology, applicable to both human portraits and animal faces.

Model Features

Decoupled Facial Representation

Separates dynamic facial expressions from static 3D facial representations, supporting longer video generation

Identity-agnostic Motion Generation

The diffusion transformer directly generates motion sequences from audio, unaffected by character identity

Cross-species Support

Capable of handling not only human portraits but also generating animations for animal faces

Multilingual Support

Trained on a mixed dataset of private Chinese data and public English datasets

Model Capabilities

Audio-driven facial animation generation

3D facial representation rendering

Cross-species facial animation

Long video sequence generation

Use Cases

Digital Entertainment

Virtual Host Animation

Generates facial expressions and head movements synchronized with speech for virtual hosts

Natural and smooth facial animation effects

Education

Animal Character Teaching

Generates vivid facial animations for animal characters in educational content

Enhances the fun and interactivity of educational materials

🚀 JoyVASA

JoyVASA is a diffusion - based method for generating facial dynamics and head motion in audio - driven facial animation, supporting multilingual and animating both human and animal faces.

🚀 Quick Start

The code can be found at https://github.com/jdh - algo/JoyVASA.

✨ Features

Decoupled Facial Representation: Introduces a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations, enabling the generation of longer videos by combining static 3D facial representations with dynamic motion sequences.
Identity - Independent Motion Generation: A diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
Multilingual Support: The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support.
Extended Animation Scope: Beyond human portraits, it can seamlessly animate animal faces.

📚 Documentation

We propose JoyVASA, a diffusion - based method for generating facial dynamics and head motion in audio - driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high - quality animations.

With the decoupled facial representation and the identity - independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real - time performance and refining expression control, further expanding the framework’s applications in portrait animation.

📄 License

This project is licensed under the MIT license.

📚 Citation

@misc{cao2024joyvasaportraitanimalimage,
      title={JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation}, 
      author={Xuyang Cao and Guoxin Wang and Sheng Shi and Jun Zhao and Yang Yao and Jintao Fei and Minyu Gao},
      year={2024},
      eprint={2411.09209},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.09209}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご