M

Minimax VL 01

Developed by MiniMaxAI
MiniMax-VL-01 is a powerful multimodal large language model that adopts the 'ViT-MLP-LLM' framework with dynamic resolution processing capabilities, demonstrating excellent performance in various vision-language tasks.
Downloads 237
Release Time : 1/12/2025

Model Overview

This model combines a Vision Transformer (ViT), MLP projector, and foundational large language model, capable of processing dynamic resolution image inputs ranging from 336×336 to 2016×2016, showcasing top-tier performance in multimodal tasks.

Model Features

Dynamic resolution processing
Supports dynamic resolution inputs from 336×336 to 2016×2016, preserving thumbnails and segmented encoding
Large-scale training
Vision Transformer trained on 694 million image-caption pairs, processing a total of 512 billion tokens
Multimodal capabilities
Combines visual and language understanding, excelling in complex multimodal tasks

Model Capabilities

Image understanding
Visual question answering
Document analysis
Chart comprehension
Mathematical reasoning
Scientific problem solving

Use Cases

Education
Scientific problem solving
Answering scientific questions containing charts and formulas
Excellent performance on MMMU and MMMU-Pro benchmarks
Document processing
Document QA
Extracting information from documents and answering questions
Achieved 96.4% accuracy on DocVQA benchmark
Data analysis
Chart comprehension
Analyzing and interpreting chart data
Achieved 91.7% accuracy on ChartQA benchmark
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase