# Visual question answering optimization
Phi 4 Multimodal Instruct
MIT
Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research data from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.
Text-to-Audio
Transformers Supports Multiple Languages

P
microsoft
584.02k
1,329
Spec Vision V1
MIT
Spec-Vision-V1 is a lightweight, state-of-the-art open-source multimodal model designed for deep integration of visual and textual data, supporting a 128K context length.
Text-to-Image
Transformers Other

S
SVECTOR-CORPORATION
17
1
H2ovl Mississippi 2b
Apache-2.0
H2OVL-Mississippi-2B is a high-performance general-purpose vision-language model developed by H2O.ai, capable of handling a wide range of multimodal tasks. This model has 2 billion parameters and performs excellently in tasks such as image captioning, visual question answering (VQA), and document understanding.
Image-to-Text
Transformers English

H
h2oai
91.28k
34
Tvl Mini 0.1
Apache-2.0
This is a LORA fine-tuned version of the Qwen2-VL-2B model for Russian, supporting multimodal tasks.
Image-to-Text
Transformers Supports Multiple Languages

T
2Vasabi
23
2
Eilev Blip2 Flan T5 Xl
MIT
A vision-language model optimized for first-person perspective videos, employing EILEV's innovative training method to stimulate in-context learning capabilities
Image-to-Text
Transformers English

E
kpyu
135
1
Featured Recommended AI Models