FlashVL-2B-Dynamic-ISS Open-Source Vision-Language Model - Ultra-low Latency, High Throughput, and Precision for Real-time Applications

Flashvl 2B Dynamic ISS

Developed by FlashVL

FlashVL is a new approach to optimizing vision-language models (VLMs) for real-time applications, aiming to achieve ultra-low latency and high throughput without sacrificing accuracy.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Ultra-low latency vision-language #Real-time image understanding #Multi-round dialogue optimization

Downloads 117

Release Time : 5/19/2025

Model Overview

FlashVL reduces processing time to maximize throughput while remaining competitive in multiple vision-language benchmark tests through advanced architecture enhancements and efficient computing strategies.

Model Features

Ultra-low latency

Achieve ultra-low latency and high throughput through advanced architecture enhancements and efficient computing strategies.

High accuracy

Maintain competitive performance in multiple vision-language benchmark tests.

Implicit semantic splicing

A new image processing technology that can effectively balance computational load and model performance.

Model Capabilities

Image understanding

Text generation

Multi-round dialogue

Visual question answering

Use Cases

Real-time applications

Real-time visual question answering

Quickly answer questions about images in real-time applications.

Maintain competitive performance in multiple benchmark tests.

Multi-round dialogue

Support multi-round dialogues based on images, suitable for scenarios such as customer service.

Able to understand context and generate coherent responses.

Education

Educational assistance

Help students understand image content and generate relevant explanations.

Perform excellently in education-related benchmark tests such as MMMU and MMBench.

🚀 FlashVL-2B-Dynamic-ISS

We're thrilled to present FlashVL, a groundbreaking method for optimizing Vision-Language Models (VLMs) in real - time applications. It aims for ultra - low latency and high throughput without compromising accuracy. By leveraging advanced architecture enhancements and efficient computational strategies, Flash - VL 2B maximizes throughput by cutting processing time while maintaining competitive performance across multiple vision - language benchmarks.

Key Information

Property	Details
License	Apache-2.0
Datasets	lmms - lab/LLaVA - OneVision - Data, BAAI/Infinity - MM
Supported Languages	English, Chinese
Base Model	apple/aimv2 - huge - patch14 - 448, Qwen/Qwen2 - 1.5B - Instruct
Pipeline Tag	image - text - to - text
Library Name	transformers

[📜 FlashVL]

🚀 Quick Start

Environment Setup

pip install torch==2.1.2
pip install transformers==4.50.0.dev0

How to use it?

import torch
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor

model_path = "FlashVL/FlashVL-2B-Dynamic-ISS"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
model.im_trans = CLIPImageProcessor.from_pretrained(model_path)

# single-image single-round conversation (单图单轮对话)
image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png"
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data).convert('RGB')   
messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}] # answer: EXTRA
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)

# single-image multi-round conversation (单图多轮对话)
messages = [
    {'role': 'user', 'content': '这是什么'},
    {"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\
     银耳是一种常见的食材，通常用于制作甜品和汤品，具有软糯的口感和清润的口感。莲 \
     子是莲子的干燥部分，常用于中医和食疗中，具有补脾止泻的功效。图片中还可以看到 \
     一些枸杞和核桃，枸杞富含维生素和抗氧化物质，核桃则提供丰富的蛋白质和健康脂肪。 \
     整体来看，这道甜品不仅美味，还具有一定的营养价值。'},
    {'role': 'user', 'content': '对图中菜品卡路里分析'}
    ]
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)

# pure-text single-round conversation (纯文本对话）
messages = [{'role': 'user', 'content': "who are you"}]
answer = model.chat(None, messages, do_sample=False, max_new_tokens=256)
print(answer)

✨ Features

We are excited to introduce FlashVL, a novel approach to optimizing Vision - Language Models (VLMs) for real - time applications. It targets ultra - low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash - VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision - language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance.

📚 Documentation

Evaluation

Benchmark	Qwen2 - VL - 2B	Aquila - VL - 2B	InternVL2.5 - 2B	Flash - VL - 2B_s	Flash - VL - 2B_d	Flash - VL - 2B_{d - ISS}
MMMU_val	41.9	44.4	41.8	43.6	42.9	42.9
MMBench^en	74.9	78.6	74.7	78.4	78.4	79.1
MMBench^cn	73.5	76.3	71.6	74.7	74.9	76.7
MMStar	48.0	54.9	54.1	53.8	54.4	54.1
MathVista_testmini	43.0	59.4	50.9	59.3	58.1	61.5
AI2D_test	74.1	75.0	75.1	74.2	74.1	74.4
MMVet	49.5	40.9	61.7	47.3	52.7	50.7
HallusionBench	39.2	38.5	42.7	43.5	45.5	49.0
OCRBench	794	773	800	764	831	843
MME	1872	1813	2091	1715	1866	1850
SEEDBench	71.5	78.9	73.2	73.6	73.6	74.5
Average	60.2	62.6	63.6	62.4	64.0	64.8

We use [VLMEvalKit](https://github.com/open - compass/VLMEvalKit) to evaluate FlashVL - 2B - Static.

📄 License

This project is under the Apache - 2.0 license.

📜 Citation

If you find this project useful in your research, please consider citing:

@misc{zhang2025flashvl2boptimizingvisionlanguage,
      title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, 
      author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
      year={2025},
      eprint={2505.09498},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09498}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご