# Visual question answering

Gemma 3 27b It Quantized.w4a16
This is a quantized version of google/gemma-3-27b-it, supporting visual-text input and text output. Optimized through weight quantization and activation quantization, it enables efficient inference with vLLM.
Image-to-Text Transformers
G
RedHatAI
302
1
Visionreasoner 7B
Apache-2.0
VisionReasoner-7B is an image-text-to-text model that adopts a decoupled architecture and consists of a reasoning model and a segmentation model. It can interpret user intentions and generate pixel-level masks.
Image-to-Text Transformers English
V
Ricky06662
2,398
1
Gemma 3 27b It GPTQ 4b 128g
This model is an INT4 quantized version of gemma-3-27b-it, reducing disk and GPU memory requirements by decreasing the number of bits per parameter.
Image-to-Text Transformers
G
ISTA-DASLab
32.15k
25
Gemma 3 4b It Qat Q4 0 Gguf
Gemma 3 is Google's lightweight cutting-edge open-source multimodal model supporting text and image inputs with text output, featuring 128K context window and 140+ language support
Image-to-Text
G
google
19.81k
120
Smolvlm2 2.2B Instruct
Apache-2.0
SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.
Image-to-Text Transformers English
S
HuggingFaceTB
62.56k
164
Uform Gen2 Qwen 500m
Apache-2.0
UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.
Image-to-Text Transformers English
U
unum-cloud
17.98k
76
Glamm FullScope
Apache-2.0
GLaMM-FullScope is a multimodal large model that integrates all capabilities of GLaMM, including scene dialogue generation, referring expression segmentation, region-level image description, image-level description generation, and visual question answering.
Text-to-Image Transformers
G
MBZUAI
236
6
Yi VL 6B
Apache-2.0
Yi-VL is an open-source multimodal vision-language model developed by 01.AI, supporting Chinese-English image-text dialogue and demonstrating excellent performance on MMMU and CMMMU benchmarks.
Text-to-Image PyTorch
Y
01-ai
336
124
Blip2 Opt 2.7b 8bit
MIT
BLIP-2 is a vision-language pre-trained model that combines an image encoder and a large language model for image-to-text generation tasks.
Image-to-Text Transformers English
B
Mediocreatmybest
69
2
Blip2 Image To Text
MIT
BLIP-2 is a vision-language pre-trained model that achieves language-image pre-training guidance by freezing the image encoder and large language model.
Image-to-Text Transformers English
B
paragon-AI
343
27
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase