# High-Precision Image Understanding
Llada V
LLaDA-V is a vision-language model based on the diffusion model, outperforming other diffusion multimodal large language models in performance.
Text-to-Image
Safetensors
L
GSAI-ML
174
8
Internvl3 8B Bf16
Other
InternVL3-8B-bf16 is a vision-language model based on MLX format conversion, supporting multilingual image-to-text tasks.
Image-to-Text
Transformers Other

I
mlx-community
96
1
Sarashina2 Vision 14b
MIT
Sarashina2-Vision-14B is a large Japanese visual language model developed by SB Intuitions, combining Sarashina2-13B with Qwen2-VL-7B's image encoder, achieving excellent performance in multiple benchmarks.
Image-to-Text
Transformers Supports Multiple Languages

S
sbintuitions
192
6
Featured Recommended AI Models