Model Selection

Vision-language model

# Vision-language model

Vigorl 7b Spatial

ViGoRL is a vision-language model fine-tuned through reinforcement learning, used to clearly associate text reasoning steps with visual coordinates to achieve precise visual reasoning and positioning.

Vjepa2 Vitl Fpc64 256

V-JEPA 2 is a cutting-edge video understanding model developed by the FAIR team under Meta. It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.

Video Processing

Gemma 3 27b Pt Qat Q4 0 Gguf

Gemma is a lightweight and cutting-edge open model family launched by Google, built on the same research and technology as the Gemini model. Gemma 3 is a multimodal model that can process text and image inputs and generate text outputs.

Qwen Vl 2.5 3B Finetuned Cheque

A vision-language model specifically designed to extract structured financial information from cheque images and generate JSON-formatted output containing key information such as cheque number, payee, amount, and issue date.

Transformers English

Internlm XComposer2 Enhanced

A vision-language large model developed based on InternLM2 with exceptional text-image understanding and creation capabilities

Xgen Mm Vid Phi3 Mini R V1.5 32tokens 8frames

xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.

Safetensors English

Pixtral is a multimodal model based on the Mistral architecture that can handle image and text inputs and generate text outputs.

Vision-language model initialized with OpenAI CLIP, enhanced with unsupervised adversarial fine-tuning for improved robustness

Blip Image Captioning Base Mocha

Official checkpoint of BLIP base model fine-tuned on MS-COCO dataset using MOCHA reinforcement learning framework to mitigate open-vocabulary description hallucination

Minigpt 4 LLaMA 7B

MiniGPT-4 is a multimodal model that combines visual and language capabilities and is developed based on the Vicuna language model.

Llava 13b V0 4bit 128g

LLaVA is a multimodal model combining vision and language, based on the LLaMA architecture, supporting image understanding and dialogue generation.

Vit Base Patch16 224 In21k Gpt2 Finetuned To Pokemon Descriptions

A vision-language model based on ViT and GPT2 architectures, specifically fine-tuned for Pokémon description generation tasks

Text Generation

CLIP-based zero-shot and one-shot image segmentation model supporting text and image prompts

Image Segmentation

Groupvit Gcc Yfcc

GroupViT is a vision-language model capable of performing zero-shot semantic segmentation on any given vocabulary category.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase