🚀 FlashVL-2B-Dynamic-ISS
We're thrilled to present FlashVL, a groundbreaking method for optimizing Vision-Language Models (VLMs) in real - time applications. It aims for ultra - low latency and high throughput without compromising accuracy. By leveraging advanced architecture enhancements and efficient computational strategies, Flash - VL 2B maximizes throughput by cutting processing time while maintaining competitive performance across multiple vision - language benchmarks.
Key Information
Property |
Details |
License |
Apache-2.0 |
Datasets |
lmms - lab/LLaVA - OneVision - Data, BAAI/Infinity - MM |
Supported Languages |
English, Chinese |
Base Model |
apple/aimv2 - huge - patch14 - 448, Qwen/Qwen2 - 1.5B - Instruct |
Pipeline Tag |
image - text - to - text |
Library Name |
transformers |
[📜 FlashVL]

🚀 Quick Start
Environment Setup
pip install torch==2.1.2
pip install transformers==4.50.0.dev0
How to use it?
import torch
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor
model_path = "FlashVL/FlashVL-2B-Dynamic-ISS"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
model.im_trans = CLIPImageProcessor.from_pretrained(model_path)
image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png"
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data).convert('RGB')
messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}]
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)
messages = [
{'role': 'user', 'content': '这是什么'},
{"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\
银耳是一种常见的食材,通常用于制作甜品和汤品,具有软糯的口感和清润的口感。莲 \
子是莲子的干燥部分,常用于中医和食疗中,具有补脾止泻的功效。图片中还可以看到 \
一些枸杞和核桃,枸杞富含维生素和抗氧化物质,核桃则提供丰富的蛋白质和健康脂肪。 \
整体来看,这道甜品不仅美味,还具有一定的营养价值。'},
{'role': 'user', 'content': '对图中菜品卡路里分析'}
]
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)
messages = [{'role': 'user', 'content': "who are you"}]
answer = model.chat(None, messages, do_sample=False, max_new_tokens=256)
print(answer)
✨ Features
We are excited to introduce FlashVL, a novel approach to optimizing Vision - Language Models (VLMs) for real - time applications. It targets ultra - low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash - VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision - language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance.
📚 Documentation
Evaluation
Benchmark |
Qwen2 - VL - 2B |
Aquila - VL - 2B |
InternVL2.5 - 2B |
Flash - VL - 2Bs |
Flash - VL - 2Bd |
Flash - VL - 2Bd - ISS |
MMMUval |
41.9 |
44.4 |
41.8 |
43.6 |
42.9 |
42.9 |
MMBenchen |
74.9 |
78.6 |
74.7 |
78.4 |
78.4 |
79.1 |
MMBenchcn |
73.5 |
76.3 |
71.6 |
74.7 |
74.9 |
76.7 |
MMStar |
48.0 |
54.9 |
54.1 |
53.8 |
54.4 |
54.1 |
MathVistatestmini |
43.0 |
59.4 |
50.9 |
59.3 |
58.1 |
61.5 |
AI2Dtest |
74.1 |
75.0 |
75.1 |
74.2 |
74.1 |
74.4 |
MMVet |
49.5 |
40.9 |
61.7 |
47.3 |
52.7 |
50.7 |
HallusionBench |
39.2 |
38.5 |
42.7 |
43.5 |
45.5 |
49.0 |
OCRBench |
794 |
773 |
800 |
764 |
831 |
843 |
MME |
1872 |
1813 |
2091 |
1715 |
1866 |
1850 |
SEEDBench |
71.5 |
78.9 |
73.2 |
73.6 |
73.6 |
74.5 |
Average |
60.2 |
62.6 |
63.6 |
62.4 |
64.0 |
64.8 |
We use [VLMEvalKit](https://github.com/open - compass/VLMEvalKit) to evaluate FlashVL - 2B - Static.
📄 License
This project is under the Apache - 2.0 license.
📜 Citation
If you find this project useful in your research, please consider citing:
@misc{zhang2025flashvl2boptimizingvisionlanguage,
title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput},
author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
year={2025},
eprint={2505.09498},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.09498},
}