GLM-4.1V-9B-Thinking Open Source Vision-Language Model - Enhance Reasoning for Complex Tasks and Support High-Definition Images

GLM 4.1V 9B Thinking

Developed by THUDM

GLM-4.1V-9B-Thinking is an open-source vision-language model based on the GLM-4-9B-0414 foundation model, focusing on improving the reasoning ability in complex tasks and supporting a 64k context length and 4K image resolution.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multimodal reasoning #4K image processing #64k long context

Downloads 163

Release Time : 6/28/2025

Model Overview

This model aims to explore the upper limit of the reasoning ability of vision-language models. By introducing the 'thinking paradigm' and reinforcement learning, it achieves state-of-the-art performance at the 10 billion parameter level and supports bilingual use of Chinese and English.

Model Features

Powerful reasoning ability

Through the chain-of-thought reasoning paradigm, it significantly improves the accuracy, richness, and interpretability of answers and performs excellently in complex tasks.

Long context support

Supports a 64k context length, suitable for processing long documents and multi-round conversations.

High-resolution image processing

Supports any aspect ratio and a maximum 4K image resolution, capable of processing high-definition images.

Bilingual support

Provides an open-source version with Chinese and English bilingual support, suitable for multilingual application scenarios.

Model Capabilities

Image description

Complex task reasoning

Long context understanding

Multimodal intelligent agent

Use Cases

Intelligent system

Complex problem solving

Utilize the model's reasoning ability to solve complex multimodal problems.

Outperforms the 72 billion parameter Qwen-2.5-VL-72B in 18 benchmark test tasks.

Long document understanding

Process long documents and multi-round conversations, supporting a 64k context length.

Image analysis

High-definition image description

Provide a detailed description of high-definition images, supporting 4K resolution.

🚀 GLM-4.1V-9B-Thinking

GLM-4.1V-9B-Thinking is an open - source Vision - Language Model (VLM) based on GLM - 4 - 9B - 0414, designed to explore the upper limits of reasoning in VLMs and achieve excellent performance in multiple benchmark tasks.

👀 View the GLM-4.1V-9B-Thinking paper.
🧪 Try the Hugging Face or ModelScope online demo for GLM-4.1V-9B-Thinking.
💻 Using GLM-4.1V-9B-Thinking API at Zhipu Foundation Model Open Platform

✨ Features

Vision-Language Models (VLMs) have become foundational components of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs must evolve beyond basic multimodal perception to enhance their reasoning capabilities in complex tasks. This involves improving accuracy, comprehensiveness, and intelligence, enabling applications such as complex problem solving, long-context understanding, and multimodal agents.

Based on the GLM-4-9B-0414 foundation model, we present the new open-source VLM model GLM-4.1V-9B-Thinking, designed to explore the upper limits of reasoning in vision-language models. By introducing a "thinking paradigm" and leveraging reinforcement learning, the model significantly enhances its capabilities. It achieves state-of-the-art performance among 10B-parameter VLMs, matching or even surpassing the 72B-parameter Qwen-2.5-VL-72B on 18 benchmark tasks. We are also open-sourcing the base model GLM-4.1V-9B-Base to support further research into the boundaries of VLM capabilities.

Compared to the previous generation models CogVLM2 and the GLM-4V series, GLM-4.1V-Thinking offers the following improvements:

🌟 The first reasoning-focused model in the series, achieving world-leading performance not only in mathematics but also across various sub-domains.
📚 Supports 64k context length.
🖼️ Handles arbitrary aspect ratios and up to 4K image resolution.
🌐 Provides an open-source version supporting both Chinese and English bilingual usage.

📊 Benchmark Performance

By incorporating the Chain-of-Thought reasoning paradigm, GLM-4.1V-9B-Thinking significantly improves answer accuracy, richness, and interpretability. It comprehensively surpasses traditional non-reasoning visual models. Out of 28 benchmark tasks, it achieved the best performance among 10B-level models on 23 tasks, and even outperformed the 72B-parameter Qwen-2.5-VL-72B on 18 tasks.

bench

🚀 Quick Start

💻 Usage Examples

Basic Usage

This is a simple example of running single-image inference using the transformers library.
First, install the transformers library from source:

pip install git+https://github.com/huggingface/transformers.git

Then, run the following code:

from transformers import AutoProcessor, Glm4vForConditionalGeneration
import torch

MODEL_PATH = "THUDM/GLM-4.1V-9B-Thinking"
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
            },
            {
                "type": "text",
                "text": "describe this image"
            }
        ],
    }
]
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = Glm4vForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)

For video reasoning, web demo deployment, and more code, please check our GitHub.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご