E

Emova Qwen 2 5 7b Hf

Developed by Emova-ollm
EMOVA is an end-to-end omnipotent modality large language model that supports vision, hearing, and speech functions, enabling multimodal understanding and generation without relying on external models.
Downloads 36
Release Time : 3/11/2025

Model Overview

EMOVA is an omnipotent modality large language model capable of receiving text, visual, and speech inputs, and generating text and speech responses with emotional control. It features advanced visual-language understanding, emotional voice dialogue, and structured data understanding for voice conversations.

Model Features

Omnipotent Modality Performance
Achieves leading comparable results in vision-language and speech benchmarks, supporting text, visual, and speech inputs and outputs.
Emotional Voice Dialogue
Utilizes a semantic-acoustic decoupled speech tokenizer and lightweight style control module, supporting 24 voice style controls (2 speakers, 3 pitches, and 4 emotions).
Diverse Configurations
Offers three model configurations (3B/7B/72B) to accommodate different computational budgets.

Model Capabilities

Text Generation
Image Analysis
Speech Recognition
Speech Synthesis
Emotion Control
Multimodal Dialogue

Use Cases

Intelligent Assistant
Emotional Voice Assistant
As an intelligent assistant, it can understand and generate emotionally rich voice responses, enhancing user experience.
Supports 24 voice style controls for vivid voice interactions.
Visual-Language Understanding
Image Caption Generation
Analyzes image content and generates detailed textual descriptions.
Achieves 94.2% accuracy on the DocVQA dataset.
Speech Recognition and Synthesis
Speech-to-Text
Converts speech input into text output.
Achieves a WER of 4.1 on the LibriSpeech (clean) test set.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase