# Multimodal reasoning
GLM 4.1V 9B Thinking
MIT
GLM-4.1V-9B-Thinking is an open-source vision-language model based on the GLM-4-9B-0414 foundation model, focusing on improving the reasoning ability in complex tasks and supporting a 64k context length and 4K image resolution.
Image-to-Text
Transformers Supports Multiple Languages

G
THUDM
163
95
Kimi VL A3B Thinking 2506
MIT
Kimi-VL-A3B-Thinking-2506 is an upgraded version of Kimi-VL-A3B-Thinking, with significant improvements in multimodal reasoning, visual perception and understanding, video scene processing, etc. It supports higher-resolution images and can achieve more intelligent thinking while consuming fewer tokens.
Image-to-Text
Transformers

K
moonshotai
515
67
Magistral Small 2506 Vision
Apache-2.0
Magistral-Small-2506-Vision is an inference fine-tuned version based on Mistral Small 3.1 with GRPO training, an experimental checkpoint with visual capabilities.
Image-to-Text
Safetensors Supports Multiple Languages
M
OptimusePrime
125
5
Stockmark 2 VL 100B Beta
Other
Stockmark-2-VL-100B-beta is a Japanese-specific vision-language model with 100 billion parameters, equipped with chain-of-thought (CoT) reasoning ability and can be used for document reading and comprehension.
Image-to-Text
Transformers Supports Multiple Languages

S
stockmark
184
8
Internvl3 8B
Apache-2.0
InternVL3 - 8B is an advanced multimodal large - language model with excellent multimodal perception and reasoning capabilities, capable of processing multimodal data such as images and videos.
Multimodal Alignment
Transformers

I
unsloth
224
1
Internvl3 1B GGUF
Apache-2.0
InternVL3 - 1B is an advanced multimodal large language model that excels in multimodal perception, reasoning, and other abilities. It also expands multimodal capabilities such as tool use and GUI agent.
Multimodal Fusion
Transformers

I
unsloth
868
2
Visionreasoner 7B
Apache-2.0
VisionReasoner-7B is an image-text-to-text model that adopts a decoupled architecture and consists of a reasoning model and a segmentation model. It can interpret user intentions and generate pixel-level masks.
Image-to-Text
Transformers English

V
Ricky06662
2,398
1
Qwen3 8B
Apache-2.0
Qwen3-8B is the latest large language model in the Qwen series. It has a variety of advanced features, supports multiple languages, and performs excellently in reasoning, instruction following, etc., bringing users a more intelligent and natural interaction experience.
Large Language Model
Transformers

Q
unsloth
30.23k
5
Internvl3 38B Hf
Other
InternVL3-38B is an advanced multimodal large language model (MLLM) with significant improvements in multimodal perception and reasoning abilities, supporting areas such as tool use, GUI agents, industrial image analysis, and 3D visual perception.
Image-to-Text
Transformers Other

I
OpenGVLab
2,226
3
Synthia S1 27b Bnb 4bit
Synthia-S1-27b is an advanced reasoning AI model developed by Tesslate AI, focusing on logical reasoning, coding, and role-playing tasks.
Text-to-Image
Transformers

S
GusPuffy
858
1
Internvl3 14B Hf
Other
InternVL3-14B is a powerful multimodal large language model that excels in multimodal perception and reasoning abilities and supports multiple inputs such as images, texts, and videos.
Image-to-Text
Transformers Other

I
OpenGVLab
4,260
0
Internvl3 38B
Other
InternVL3-38B is an advanced multimodal large language model that excels in multimodal perception, reasoning, and other capabilities. It shows significant improvements compared to previous models and also expands multimodal capabilities such as tool use and GUI agents.
Text-to-Image
Transformers Other

I
FriendliAI
166
0
Internvl3 8B
Other
InternVL3-8B is an advanced multimodal large language model with excellent multimodal perception and reasoning capabilities, and performs well in multiple fields such as tool use, GUI agents, and industrial image analysis.
Multimodal Fusion
Transformers Other

I
FriendliAI
167
0
Gemma 3 27b It GGUF
GGUF quantized version of Gemma 3 with 27B parameters, supporting image-text interaction tasks
Text-to-Image
G
Mungert
4,034
6
R1 VL 7B
Apache-2.0
R1-VL-7B is an inference model based on Qwen2-VL-7B-Instruct, trained using the Stepwise Grouped Relative Policy Optimization (StepGRPO) method, focusing on the image-text to text task.
Image-to-Text
Transformers

R
jingyiZ00
1,729
0
Phi 3.5 Vision Instruct
MIT
Phi-3.5-vision is a lightweight and advanced open-source multimodal model that supports a 128K context length and focuses on processing high-quality, inference-rich text and visual data.
Image-to-Text
Transformers Other

P
FriendliAI
370
0
Spec Vision V1
MIT
Spec-Vision-V1 is a lightweight, state-of-the-art open-source multimodal model designed for deep integration of visual and textual data, supporting a 128K context length.
Text-to-Image
Transformers Other

S
SVECTOR-CORPORATION
17
1
Mulberry Qwen2vl 7b
Apache-2.0
The Mulberry model is a step-by-step reasoning-based model trained on the Mulberry - 260K SFT dataset generated through collective knowledge search.
Text-to-Image
Transformers

M
HuanjinYao
13.57k
1
Mulberry Llava 8b
Apache-2.0
Mulberry-llava-8b is an image-text-to-text model based on step-by-step reasoning, trained on the Mulberry-260K SFT dataset, with powerful image understanding and text generation capabilities.
Image-to-Text
Transformers

M
HuanjinYao
1,735
2
Meditron 7b Llm Radiology
Apache-2.0
This is an open-source model under the Apache-2.0 license. Specific information needs to be supplemented.
Large Language Model
Transformers

M
nitinaggarwal12
26
1
DNABERT S
Apache-2.0
This is an open-source model based on the Apache-2.0 license. Specific functionalities should be referenced in the actual model documentation
Large Language Model
Transformers

D
zhihan1996
2,815
7
Featured Recommended AI Models