Model Selection

Multimodal reasoning

# Multimodal reasoning

GLM 4.1V 9B Thinking

GLM-4.1V-9B-Thinking is an open-source vision-language model based on the GLM-4-9B-0414 foundation model, focusing on improving the reasoning ability in complex tasks and supporting a 64k context length and 4K image resolution.

Transformers Supports Multiple Languages

Kimi VL A3B Thinking 2506

Kimi-VL-A3B-Thinking-2506 is an upgraded version of Kimi-VL-A3B-Thinking, with significant improvements in multimodal reasoning, visual perception and understanding, video scene processing, etc. It supports higher-resolution images and can achieve more intelligent thinking while consuming fewer tokens.

Magistral Small 2506 Vision

Magistral-Small-2506-Vision is an inference fine-tuned version based on Mistral Small 3.1 with GRPO training, an experimental checkpoint with visual capabilities.

Safetensors Supports Multiple Languages

Stockmark 2 VL 100B Beta

Stockmark-2-VL-100B-beta is a Japanese-specific vision-language model with 100 billion parameters, equipped with chain-of-thought (CoT) reasoning ability and can be used for document reading and comprehension.

Transformers Supports Multiple Languages

InternVL3 - 8B is an advanced multimodal large - language model with excellent multimodal perception and reasoning capabilities, capable of processing multimodal data such as images and videos.

Multimodal Alignment

Internvl3 1B GGUF

InternVL3 - 1B is an advanced multimodal large language model that excels in multimodal perception, reasoning, and other abilities. It also expands multimodal capabilities such as tool use and GUI agent.

Multimodal Fusion

Visionreasoner 7B

VisionReasoner-7B is an image-text-to-text model that adopts a decoupled architecture and consists of a reasoning model and a segmentation model. It can interpret user intentions and generate pixel-level masks.

Transformers English

Qwen3-8B is the latest large language model in the Qwen series. It has a variety of advanced features, supports multiple languages, and performs excellently in reasoning, instruction following, etc., bringing users a more intelligent and natural interaction experience.

Large Language Model

Internvl3 38B Hf

InternVL3-38B is an advanced multimodal large language model (MLLM) with significant improvements in multimodal perception and reasoning abilities, supporting areas such as tool use, GUI agents, industrial image analysis, and 3D visual perception.

Transformers Other

Synthia S1 27b Bnb 4bit

Synthia-S1-27b is an advanced reasoning AI model developed by Tesslate AI, focusing on logical reasoning, coding, and role-playing tasks.

Internvl3 14B Hf

InternVL3-14B is a powerful multimodal large language model that excels in multimodal perception and reasoning abilities and supports multiple inputs such as images, texts, and videos.

Transformers Other

InternVL3-38B is an advanced multimodal large language model that excels in multimodal perception, reasoning, and other capabilities. It shows significant improvements compared to previous models and also expands multimodal capabilities such as tool use and GUI agents.

Transformers Other

InternVL3-8B is an advanced multimodal large language model with excellent multimodal perception and reasoning capabilities, and performs well in multiple fields such as tool use, GUI agents, and industrial image analysis.

Multimodal Fusion

Transformers Other

Gemma 3 27b It GGUF

GGUF quantized version of Gemma 3 with 27B parameters, supporting image-text interaction tasks

R1-VL-7B is an inference model based on Qwen2-VL-7B-Instruct, trained using the Stepwise Grouped Relative Policy Optimization (StepGRPO) method, focusing on the image-text to text task.

Phi 3.5 Vision Instruct

Phi-3.5-vision is a lightweight and advanced open-source multimodal model that supports a 128K context length and focuses on processing high-quality, inference-rich text and visual data.

Transformers Other

Spec-Vision-V1 is a lightweight, state-of-the-art open-source multimodal model designed for deep integration of visual and textual data, supporting a 128K context length.

Transformers Other

SVECTOR-CORPORATION

Mulberry Qwen2vl 7b

The Mulberry model is a step-by-step reasoning-based model trained on the Mulberry - 260K SFT dataset generated through collective knowledge search.

Mulberry Llava 8b

Mulberry-llava-8b is an image-text-to-text model based on step-by-step reasoning, trained on the Mulberry-260K SFT dataset, with powerful image understanding and text generation capabilities.

Meditron 7b Llm Radiology

This is an open-source model under the Apache-2.0 license. Specific information needs to be supplemented.

Large Language Model

nitinaggarwal12

This is an open-source model based on the Apache-2.0 license. Specific functionalities should be referenced in the actual model documentation

Large Language Model

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase