🚀 EMOVA-Qwen-2.5-7B-HF
EMOVA-Qwen-2.5-7B-HF is a novel end-to-end omni-modal LLM. It can handle multiple modalities including text, vision, and speech, and generate corresponding responses with emotional controls. It offers state-of-the-art omni-modality performance, supports emotional spoken dialogue, and comes with diverse configurations to meet different computational needs.
✨ Features
- State-of-the-art omni-modality performance: EMOVA achieves state-of-the-art comparable results on both vision-language and speech benchmarks simultaneously. Our best performing model, EMOVA-72B, even surpasses commercial models including GPT-4o and Gemini Pro 1.5.
- Emotional spoken dialogue: A semantic-acoustic disentangled speech tokenizer and a lightweight style control module are adopted for seamless omni-modal alignment and diverse speech style controllability. EMOVA supports bilingual (Chinese and English) spoken dialogue with 24 speech style controls (i.e., 2 speakers, 3 pitches and 4 emotions).
- Diverse configurations: We open-source 3 configurations, EMOVA-3B/7B/72B, to support omni-modal usage under different computational budgets. Check our Model Zoo and find the best fit model for your computational devices!
📚 Documentation
Model Information
Property |
Details |
Library Name |
transformers |
Tags |
Omni-modal-LLM, Multi-modal-LLM, Emotional-spoken-dialogue |
License |
apache-2.0 |
Datasets |
Emova-ollm/emova-alignment-7m, Emova-ollm/emova-sft-4m, Emova-ollm/emova-sft-speech-231k |
Language |
en, zh |
Base Model |
Emova-ollm/qwen2vit600m, Emova-ollm/Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip |
Model Name |
emova-qwen-2-5-7b-hf |
Model Performance on Datasets
- AI2D: Accuracy of 81.7
- ChartQA: Accuracy of 84.9
- DocVQA: Accuracy of 94.2
- InfoVQA: Accuracy of 75.1
- MathVerse: Accuracy of 40.9
- MathVista: Accuracy of 65.5
- MMBench: Accuracy of 83
- MME: Score of 2317
- MMVet: Accuracy of 59.4
- OCRBench: Accuracy of 814
- RealWorldQA: Accuracy of 67.5
- Seed-Bench-Image: Accuracy of 75.5
- Science-QA: Accuracy of 96.4
- TextVQA: Accuracy of 78
- Automatic Speech Recognition (LibriSpeech clean): Test WER of 4.1
Comparative Performance
Benchmarks |
EMOVA-3B |
EMOVA-7B |
EMOVA-72B |
GPT-4o |
VITA 8x7B |
VITA 1.5 |
Baichuan-Omni |
MME |
2175 |
2317 |
2402 |
2310 |
2097 |
2311 |
2187 |
MMBench |
79.2 |
83.0 |
86.4 |
83.4 |
71.8 |
76.6 |
76.2 |
SEED-Image |
74.9 |
75.5 |
76.6 |
77.1 |
72.6 |
74.2 |
74.1 |
MM-Vet |
57.3 |
59.4 |
64.8 |
- |
41.6 |
51.1 |
65.4 |
RealWorldQA |
62.6 |
67.5 |
71.0 |
75.4 |
59.0 |
66.8 |
62.6 |
TextVQA |
77.2 |
78.0 |
81.4 |
- |
71.8 |
74.9 |
74.3 |
ChartQA |
81.5 |
84.9 |
88.7 |
85.7 |
76.6 |
79.6 |
79.6 |
DocVQA |
93.5 |
94.2 |
95.9 |
92.8 |
- |
- |
- |
InfoVQA |
71.2 |
75.1 |
83.2 |
- |
- |
- |
- |
OCRBench |
803 |
814 |
843 |
736 |
678 |
752 |
700 |
ScienceQA-Img |
92.7 |
96.4 |
98.2 |
- |
- |
- |
- |
AI2D |
78.6 |
81.7 |
85.8 |
84.6 |
73.1 |
79.3 |
- |
MathVista |
62.6 |
65.5 |
69.9 |
63.8 |
44.9 |
66.2 |
51.9 |
Mathverse |
31.4 |
40.9 |
50.0 |
- |
- |
- |
- |
Librispeech (WER↓) |
5.4 |
4.1 |
2.9 |
- |
3.4 |
8.1 |
- |
💻 Usage Examples
Basic Usage
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
model = AutoModel.from_pretrained(
"Emova-ollm/emova-qwen-2-5-7b-hf",
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2',
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained("Emova-ollm/emova-qwen-2-5-7b-hf", trust_remote_code=True)
speeck_tokenizer = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
processor.set_speech_tokenizer(speeck_tokenizer)
inputs = dict(
text=[
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What's shown in this image?"}]},
{"role": "assistant", "content": [{"type": "text", "text": "This image shows a red stop sign."}]},
{"role": "user", "content": [{"type": "text", "text": "Describe the image in more details."}]},
],
images=Image.open('path/to/image')
)
inputs = dict(
text=[{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}],
audios='path/to/audio'
)
inputs = dict(
text=[{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}],
images=Image.open('path/to/image'),
audios='path/to/audio'
)
has_speech = 'audios' in inputs.keys()
inputs = processor(**inputs, return_tensors="pt")
inputs = inputs.to(model.device)
gen_kwargs = {"max_new_tokens": 4096, "do_sample": False}
speech_kwargs = {"speaker": "female", "output_wav_prefix": "output"} if has_speech else {}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(processor.batch_decode(outputs, skip_special_tokens=True, **speech_kwargs))
📄 License
This project is licensed under the apache-2.0 license.
📖 Citation
@article{chen2024emova,
title={Emova: Empowering language models to see, hear and speak with vivid emotions},
author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
journal={arXiv preprint arXiv:2409.18042},
year={2024}
}