# Vision-language model
Vigorl 7b Spatial
ViGoRL is a vision-language model fine-tuned through reinforcement learning, used to clearly associate text reasoning steps with visual coordinates to achieve precise visual reasoning and positioning.
Text-to-Image
Transformers

V
gsarch
319
1
Vjepa2 Vitl Fpc64 256
MIT
V-JEPA 2 is a cutting-edge video understanding model developed by the FAIR team under Meta. It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.
Video Processing
Transformers

V
facebook
109
27
Gemma 3 27b Pt Qat Q4 0 Gguf
Gemma is a lightweight and cutting-edge open model family launched by Google, built on the same research and technology as the Gemini model. Gemma 3 is a multimodal model that can process text and image inputs and generate text outputs.
Image-to-Text
G
google
633
24
Qwen Vl 2.5 3B Finetuned Cheque
A vision-language model specifically designed to extract structured financial information from cheque images and generate JSON-formatted output containing key information such as cheque number, payee, amount, and issue date.
Image-to-Text
Transformers English

Q
AJNG
170
1
Internlm XComposer2 Enhanced
Other
A vision-language large model developed based on InternLM2 with exceptional text-image understanding and creation capabilities
Text-to-Image
I
Coobiw
14
0
Xgen Mm Vid Phi3 Mini R V1.5 32tokens 8frames
xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.
Video-to-Text
Safetensors English
X
Salesforce
441
3
Pixtral 12b
Pixtral is a multimodal model based on the Mistral architecture that can handle image and text inputs and generate text outputs.
Image-to-Text
Transformers

P
saujasv
2,168
0
Fare2 Clip
MIT
Vision-language model initialized with OpenAI CLIP, enhanced with unsupervised adversarial fine-tuning for improved robustness
Text-to-Image
F
chs20
543
2
Blip Image Captioning Base Mocha
MIT
Official checkpoint of BLIP base model fine-tuned on MS-COCO dataset using MOCHA reinforcement learning framework to mitigate open-vocabulary description hallucination
Image-to-Text
Transformers

B
moranyanuka
88
1
Minigpt 4 LLaMA 7B
MiniGPT-4 is a multimodal model that combines visual and language capabilities and is developed based on the Vicuna language model.
Text-to-Image
Transformers

M
wangrongsheng
1,777
18
Llava 13b V0 4bit 128g
LLaVA is a multimodal model combining vision and language, based on the LLaMA architecture, supporting image understanding and dialogue generation.
Text-to-Image
Transformers

L
wojtab
167
79
Vit Base Patch16 224 In21k Gpt2 Finetuned To Pokemon Descriptions
A vision-language model based on ViT and GPT2 architectures, specifically fine-tuned for Pokémon description generation tasks
Text Generation
Transformers

V
tkarr
29
0
Clipseg Rd16
Apache-2.0
CLIP-based zero-shot and one-shot image segmentation model supporting text and image prompts
Image Segmentation
Transformers

C
CIDAS
5,256
0
Groupvit Gcc Yfcc
GroupViT is a vision-language model capable of performing zero-shot semantic segmentation on any given vocabulary category.
Text-to-Image
Transformers

G
nvidia
3,473
6
Featured Recommended AI Models