Ming-Lite-Omni Open-Source Multi-Modal Model - Efficiently Process Text, Images, Audio, and Videos, Excellent at Speech and Image Generation

Ming Lite Omni

Developed by inclusionAI

A lightweight unified multi-modal model that efficiently processes various modal data such as images, texts, audios, and videos, and performs excellently in speech and image generation.

Multimodal Fusion

Transformers

Open Source License:MIT #Unified processing of all modalities #Lightweight MoE architecture #Cross-modal generation

Downloads 4,215

Release Time : 5/2/2025

Model Overview

The Ming-Lite-Omni All-modal Model is a lightweight unified multi-modal model that can efficiently process various modal data such as images, texts, audios, and videos. It performs excellently in speech and image generation, providing a powerful solution for multi-modal perception and generation tasks.

Model Features

Unified all-modal perception

Based on the Ling MoE architecture large language model, it solves task conflicts through a specific modal routing mechanism, ensuring that tokens of different modalities can be efficiently integrated in a unified framework.

Unified perception and generation

It realizes the unified understanding and generation of multi-modal data, can accurately interpret multi-modal instructions and user intentions during the generation process, and improves the generation quality and the usability of multi-tasks.

Innovative generation ability

It has the ability to perceive all modal data and can simultaneously generate high-quality texts, natural and fluent speeches, and vivid and realistic images. It performs excellently in cross-modal tasks such as image perception, audio-visual interaction, and image generation.

Model Capabilities

Text generation

Image analysis

Video analysis

Speech recognition

Speech generation

Image generation

Multi-modal Q&A

Multi-round dialogue

Use Cases

Q&A tasks

Encyclopedic knowledge Q&A

Answer detailed questions about the living habits of parrots

Provide detailed introductions about habitats, diets, etc.

Visual Q&A

Image recognition Q&A

Identify the flower species in the image

Accurately identify forget-me-nots

Video content understanding

Understand the actions of the characters in the video

Identify that a woman is doing yoga on the roof

Speech processing

Automatic speech recognition

Convert speech to text

Perform excellently on multiple test sets

Speech-to-speech conversion

Process speech input and generate speech output

🚀 Ming-Lite-Omni

Ming-Lite-Omni is a unified multimodal model that can process images, text, audio, and video. It is derived from Ling-lite and features 2.8 billion activated parameters. This model can perform speech and image generation, and support context-aware chatting, text-to-speech conversion, and versatile image editing.

Property	Details
Base Model	inclusionAI/Ling-lite
License	MIT
Pipeline Tag	any-to-any
Library Name	transformers

📑 Technical Report ｜ 📖 Project Page ｜ 🤗 Hugging Face ｜ 🤖 ModelScope

🚀 Quick Start

Prerequisites

Download the model following Model Downloads.
Install the Python environment dependencies:

pip install -r requirements.txt
pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8  # for H20

Note: The examples are tested on hardware of NVIDIA H800-80GB with CUDA 12.2. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 40890MB memory.

Example Code

import os
import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

✨ Features

Unified Omni-Modality Perception: Ming-lite-omni, built on Ling, an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.
Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.
Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.

📦 Installation

Please download the model from the following sources:

| **Model** | **Input modality** | **Output modality** | **Download** | | --- | :---: | :---: | --- | | Ming-Lite-Omni | Image, text, video, audio | Image, text, audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-Lite-Omni)
[🤖 ModelScope](https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni) |

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

💻 Usage Examples

Basic Usage

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类，它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍：
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区，包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同，但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物，它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮，能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子，以帮助消化和补充矿物质。
# ......

# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

Advanced Usage

Enable Thinking Before Response

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.
"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.
Choices:
(A) $\\frac{7}{16}$
(B) $\\frac{3}{16}$
(C) $\\frac{7}{32}$
(D) $\\frac{9}{32}$
(E) $\\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>
Okay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.
\</think\>
\<answer\>\\boxed{C}\</answer\>

Video QA

# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

Multi-Turn Chat

# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里？"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少？有多少常住人口？"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里，常住人口约为21,542,000人。

Preparation for Inference

text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)

Audio Tasks

# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
# we use whisper encoder for ASR task, so need modify code above
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
    audio_kwargs={'use_whisper_encoder': True}
)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
    use_whisper_encoder=True
)

# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/speechQA_sample.wav'},
        ],
    },
]
generation_config = GenerationConfig.from_dict({
    'output_hidden_states': True,
    'return_dict_in_generate': True,
    'no_repeat_ngram_size': 10}
)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
    us

📚 Documentation

Evaluation

Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks.

Image benchmark

| Benchmarks | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | InternVL2.5-8B-MPO | | --- | :---: | :---: | :---: | | AI2D | 83.1 | 84.4 | **84.5** | | HallusionBench | **55.0** | 55.8 | 51.7 | | MMBench_TEST_V11 | 80.8 | **82.8** | 82.0 | | MMMU | 56.3 | **56.6** | 54.8 | | MMStar | 64.7 | 65.3 | **65.2** | | MMVet | 71.3 | 71.6 | 68.1 | | MathVista | **71.6** | 68.1 | 67.9 | | OCRBench | **88.4** | 87.8 | 88.2 | | Average | 71.4 | **71.5** | 70.3 |

Encyclopedia Benchmarks

| Object Recognition | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | | --- | :---: | :---: | | Plants | **54.96** | 47.8 | | Animals | **56.7** | 50.85 | | Vehicles | 41.91 | **42.29** | | Food & Ingredients | **62.28** | 54.09 | | Dishes | **44.3** | 39.07 | | General | 91.08 | **92.42** | | Average | **58.54** | 54.43 |

Video benchmark

| Benchmarks | Ming-lite-omni | Qwen2.5VL-7B-Instruct | | --- | :---: | :---: | | VideoMME | 67.0 | **67.3** | | MVBench | 67.7 | **67.4** | | Video-MMMU | 46.3 | **47.4** | | LongVideoBench | 56.6 | 54.7 | | Average | **59.4** | 59.2 |

**Note**: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark

SpeechQA

| Model | Average | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench | | --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Qwen2-Audio-chat | 3.545 | 3.69 | 3.40 | 35.35 | 35.43 | 49.01 | 22.57 | 98.85 | | Baichuan-Audio | 3.695 | 4.00 | 3.39 | 49.64 | 48.80 | 63.30 | 41.32 | 86.73 | | GLM-4-Voice | 3.77 | 4.06 | 3.48 | 43.31 | 40.11 | 52.97 | 24.91 | 88.08 | | Kimi-Audio | 4.215 | 4.46 | 3.97 | **63.12** | **62.17** | **83.52** | **61.10** | **100.00** | | Qwen2.5-Omni | 4.21 | 4.49 | 3.93 | 55.71 | 61.32 | 81.10 | 52.87 | 99.42 | | Ming-lite-omni | **4.34** | **4.63** | **4.06** | 58.84 | 47.53 | 61.98 | 58.36 | 99.04 |

ASR

| Model | aishell1 | aishell2_android | aishell2_ios | cv15_zh | fleurs_zh | wenetspeech_meeting | wenetspeech_net | librispeech_test_clean | librispeech_test_other | multilingual_librispeech | cv15_en | fleurs_en | voxpopuli_v1.0_en | | --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Ming-lite-omni | 1.47 | **2.55** | **2.52** | 6.31 | 2.96 | 5.95 | 5.46 | 1.44 | 2.80 | **4.15** | **6.89** | **3.39** | **5.80** | | Qwen2.-Omni | 1.18 | 2.75 | 2.63 | **5.20** | 3.00 | **5.90** | 7.70 | 1.80 | 3.40 | 7.56 | 7.60 | 4.10 | **5.80** | | Qwen2-Audio | 1.53 | 2.92 | 2.92 | 6.90 | 7.50 | 7.16 | 8.42 | 1.60 | 3.60 | 5.40 | 8.60 | 6.90 | 6.84 | | Kimi-Audio | **0.60** | 2.64 | 2.56 | 7.21 | **2.69** | 6.28 | **5.37** | **1.28** | **2.42** | 5.88 | 10.31 | 4.44 | 7.97 |

Information-Seeking Benchmark

| Model | InfoSeek_H-mean | InfoSeek_unseen_question | InfoSeek_unseen_entity | | --- | :---: | :---: | :---: | | GPT-4o | **36.05** | - | - | | PaLI-X | 22.06 | 23.5 | 20.8 | | Qwen2.5-vl-32B | 19.35 | 20.55 | 18.28 | | Ming-lite-omni | 27.7 | **30.4** | **25.4** |

OCR

| Model | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | | --- | :---: | :---: | | ChartQA_TEST | 85.1 | **87.3** | | DocVQA_TEST | 93 | **95.7** | | OCRBenchV2_en/zh | 53.3/52 | **56.3/57.2** | | OmniDocBench↓ | 34/ **34.4** | **30.8**/39.8 | | TextVQA_VAL | 82.8 | **84.9** |

GUI

| Model | Ming-lite-omni | InternVL3 8B | Qwen2.5-VL-7B-Instruct | | --- | :---: | :---: | :---: | | ScreenSpot | **82.1** | 79.5 | 78.9* | | ScreenSpot-V2 | **84.1** | 81.4 | - | | AITZ(EM) | **66.6** | - | 57.6* |

**Note**: * denotes the reproduced results.

Unified Generation Benchmark

| Model | single_object | two_object | counting | colors | position | color_attr | GENEVAL | DPGBench | FID↓ | | --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Ming-lite-omni | **0.9875** | **0.7727** | **0.6812** | 0.7872 | 0.31 | 0.29 | **0.64** | 81.72 | **4.85** | | Metaquery-XL | - | - | - | - | - | - | 0.61 | **82.05** | 6.02 | | SDv2.1 | 0.98 | 0.51 | 0.44 | **0.85** | 0.07 | 0.17 | 0.50 | 68.09 | 26.96 | | Emu3-Gen | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | 80.60 | - | | SDXL | 0.98 | 0.74 | 0.39 | **0.85** | 0.15 | 0.23 | 0.55 | 74.65 | 8.76 | | Janus | 0.97 | 0.68 | 0.30 | 0.84 | **0.46** | **0.42** | 0.61 | 79.68 | 10.10 | | JanusFlow | - | - | - | - | - | - | 0.63 | 80.09 | 9.51 |

Please refer to our technical report for more comprehensive evaluation results.

Use Cases

Additional demonstration cases are available on our project page.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご