Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Ming-Lite-Omni
Ming-Lite-Omni is a unified multimodal model that can process images, text, audio, and video. It is derived from Ling-lite and features 2.8 billion activated parameters. This model can perform speech and image generation, and support context-aware chatting, text-to-speech conversion, and versatile image editing.
Property | Details |
---|---|
Base Model | inclusionAI/Ling-lite |
License | MIT |
Pipeline Tag | any-to-any |
Library Name | transformers |
📑 Technical Report | 📖 Project Page | 🤗 Hugging Face | 🤖 ModelScope
🚀 Quick Start
Prerequisites
- Download the model following Model Downloads.
- Install the Python environment dependencies:
pip install -r requirements.txt
pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8 # for H20
Note: The examples are tested on hardware of NVIDIA H800-80GB with CUDA 12.2. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 40890MB memory.
Example Code
import os
import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")
assets_path = YOUR_ASSETS_PATH
# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
✨ Features
- Unified Omni-Modality Perception: Ming-lite-omni, built on Ling, an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.
- Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.
- Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.
📦 Installation
Please download the model from the following sources:
[🤖 ModelScope](https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni) |
💻 Usage Examples
Basic Usage
# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]
# Output:
# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......
# image qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
{"type": "text", "text": "What kind of flower is this?"},
],
},
]
# Output:
# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.
Advanced Usage
Enable Thinking Before Response
cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.
"
# And your input message should be like this:
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
{"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.
Choices:
(A) $\\frac{7}{16}$
(B) $\\frac{3}{16}$
(C) $\\frac{7}{32}$
(D) $\\frac{9}{32}$
(E) $\\frac{1}{5}$"},
],
},
]
# Output:
# \<think\>
Okay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.
\</think\>
\<answer\>\\boxed{C}\</answer\>
Video QA
# video qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
{"type": "text", "text": "What is the woman doing?"},
],
},
]
# Output:
# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.
Multi-Turn Chat
# multi-turn chat
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "中国的首都是哪里?"},
],
},
{
"role": "ASSISTANT",
"content": [
{"type": "text", "text": "北京"},
],
},
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
],
},
]
# Output:
# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
Preparation for Inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
# call generate
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
eos_token_id=processor.gen_terminator,
generation_config=generation_config,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
Audio Tasks
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
# we use whisper encoder for ASR task, so need modify code above
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
audio_kwargs={'use_whisper_encoder': True}
)
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
eos_token_id=processor.gen_terminator,
generation_config=generation_config,
use_whisper_encoder=True
)
# speech2speech
messages = [
{
"role": "HUMAN",
"content": [
{"type": "audio", "audio": 'data/wavs/speechQA_sample.wav'},
],
},
]
generation_config = GenerationConfig.from_dict({
'output_hidden_states': True,
'return_dict_in_generate': True,
'no_repeat_ngram_size': 10}
)
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
eos_token_id=processor.gen_terminator,
generation_config=generation_config,
us
📚 Documentation
Evaluation
Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks.
Image benchmark
Encyclopedia Benchmarks
Video benchmark
Audio benchmark
SpeechQA
ASR
Information-Seeking Benchmark
OCR
GUI
Unified Generation Benchmark
Please refer to our technical report for more comprehensive evaluation results.
Use Cases
Additional demonstration cases are available on our project page.
📄 License
This project is licensed under the MIT License.







