# Multilingual Visual Understanding
Internvl3 8B AWQ
Other
InternVL3-8B is an advanced multimodal large language model developed by OpenGVLab, featuring powerful multimodal perception and reasoning capabilities, supporting tool calling, GUI agents, industrial image analysis, 3D visual perception, and other emerging fields.
Image-to-Text
Transformers Other

I
OpenGVLab
1,441
3
Internvl3 2B Instruct
Apache-2.0
InternVL3-2B-Instruct is a supervised fine-tuned version based on InternVL3-2B, undergoing native multimodal pretraining and SFT processing, equipped with powerful multimodal perception and reasoning capabilities.
Text-to-Image
Transformers Other

I
OpenGVLab
1,345
4
Colqwen2.5 3b Multilingual V1.0
MIT
A multilingual visual retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy, excelling in Vidore benchmark tests
Text-to-Image Supports Multiple Languages
C
Metric-AI
2,475
7
Erax VL 2B V1.5 I1 GGUF
Apache-2.0
EraX-VL-2B-V1.5 is a multimodal foundation model supporting Vietnamese, English, and Chinese, with capabilities for image-to-text and image-text-to-text conversion.
Image-to-Text Supports Multiple Languages
E
mradermacher
467
0
Pix2struct Infographics Vqa Base
Apache-2.0
Pix2Struct is a vision-language understanding model pretrained for image-to-text conversion tasks, specifically optimized for high-resolution infographic visual question answering.
Image-to-Text
Transformers Supports Multiple Languages

P
google
74
8
Pix2struct Infographics Vqa Large
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained through multi-task learning for visual-language understanding tasks, specifically optimized for visual question answering on high-resolution infographics.
Image-to-Text
Transformers Supports Multiple Languages

P
google
108
10
Pix2struct Textcaps Large
Apache-2.0
Pix2Struct is a vision-language understanding model trained via image-to-text conversion for multitasking, supporting tasks like image caption generation and visual question answering.
Image-to-Text
Transformers Supports Multiple Languages

P
google
128
14
Pix2struct Textcaps Base
Apache-2.0
Pix2Struct is a vision-language understanding model that processes image-to-text tasks through pre-training and fine-tuning, particularly suitable for image caption generation.
Image-to-Text
Transformers Supports Multiple Languages

P
google
3,888
28
Featured Recommended AI Models