L

Llm Jp 3 Vila 14b

Developed by llm-jp
A large-scale vision-language model developed by Japan's National Institute of Informatics, supporting Japanese and English with strong image understanding and text generation capabilities.
Downloads 106
Release Time : 10/26/2024

Model Overview

This is a vision-language model combining a visual encoder and a large language model, capable of understanding image content and generating relevant text descriptions or answering questions.

Model Features

Multilingual Support
Supports both Japanese and English for vision-language understanding and generation
Three-Stage Training
Adopts a phased training strategy: first adjusting the projection layer, then jointly training the projection layer and LLM, and finally fine-tuning
High-Performance Visual Encoder
Uses siglip-so400m-patch14-384 as the visual encoder, providing powerful image understanding capabilities
Leading Evaluation
Outperforms similar models in multiple Japanese vision-language benchmarks

Model Capabilities

Image content understanding
Image caption generation
Visual question answering
Multimodal dialogue

Use Cases

Content Understanding & Generation
Image Captioning
Generates detailed textual descriptions for images
Achieved 57.2% LLM score on the Heron benchmark
Visual Question Answering
Answers natural language questions about image content
Achieved 3.62/5.0 LLM score on JA-VG-VQA500 test
Multimodal Applications
Image-Text Dialogue
Engages in natural language dialogue based on image content
Achieved 3.69/5.0 LLM score on JA-VLM in-the-wild benchmark
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase