Model Selection

Multimodal Text Generation

# Multimodal Text Generation

Vintern 1B V3 5 GGUF Ext

Vintern-1B-v3_5 is a 1-billion-parameter vision-language model supporting image-text generation tasks.

Mistral Small 3.1 24B Instruct 2503 GGUF

This is a vision-enhanced version based on Mistral-Small-3.1-24B-Instruct-2503, supporting image-to-text generation tasks.

Gemma 3 4b It Int8 Asym Ov

Gemma 3 4B parameter model optimized with OpenVINO, supporting text-to-text and visual-text inference

Gemma 3 4b It Llamafile

Gemma 3 is a lightweight open-source model series launched by Google, built on Gemini technology, supporting multimodal input and text output.

Gemma 3 1b Pt Qat Q4 0 Gguf

Gemma is a family of lightweight, cutting-edge open models from Google, built on the same research and technology as the Gemini models. The 1B version is a pretrained base model in GGUF format with Quantization-Aware Training (QAT).

Qwen2 VL 7B Latex OCR

A fine-tuned version of the Qwen2-VL-7B model, trained using Unsloth and Huggingface TRL library, achieving 2x inference speed improvement.

Transformers English

Llava NeXT Video 34B DPO

Llama 2 is a series of open-source large language models developed by Meta, supporting various natural language processing tasks.

ko-deplot is a Korean visual question answering model based on Google's Pix2Struct architecture, fine-tuned from the Deplot model, supporting chart image question-answering tasks in Korean and English.

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase