The Best 68 Video-to-Text Tools in 2025

Llava Video 7B Qwen2
Apache-2.0
The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.
Video-to-Text Transformers English
L
lmms-lab
34.28k
91
Llava NeXT Video 7B DPO Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.
Video-to-Text Transformers English
L
llava-hf
12.61k
9
Internvideo2 5 Chat 8B
Apache-2.0
InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.
Video-to-Text Transformers English
I
OpenGVLab
8,265
60
Cogvlm2 Llama3 Caption
Other
CogVLM2-Caption is a video caption generation model used to generate training data for the CogVideoX model.
Video-to-Text Transformers English
C
THUDM
7,493
95
Spacetimegpt
SpaceTime GPT is a video description generation model capable of spatial and temporal reasoning, analyzing video frames and generating sentences describing video events.
Video-to-Text Transformers English
S
Neleac
2,877
33
Video R1 7B
Apache-2.0
Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.
Video-to-Text Transformers English
V
Video-R1
2,129
9
Internvl 2 5 HiCo R16
Apache-2.0
InternVideo2.5 is a video multimodal large language model (MLLM) built upon InternVL2.5, enhanced with Long and Rich Context (LRC) modeling, capable of perceiving fine-grained details and capturing long-term temporal structures.
Video-to-Text Transformers English
I
OpenGVLab
1,914
3
Videollm Online 8b V1plus
MIT
VideoLLM-online is a multimodal large language model based on Llama-3-8B-Instruct, focusing on online video understanding and video-text generation tasks.
Video-to-Text English
V
chenjoya
1,688
23
Videochat R1 7B
Apache-2.0
VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.
Video-to-Text Transformers English
V
OpenGVLab
1,686
7
Qwen2.5 Vl 7b Cam Motion Preview
Other
A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks
Video-to-Text Transformers
Q
chancharikm
1,456
10
Mambavision B 1K
Apache-2.0
PAVE is a model focused on repairing and adapting video large language models, aiming to enhance the conversion capability between video and text.
Video-to-Text Transformers
M
nvidia
1,082
11
Longvu Llama3 2 3B
Apache-2.0
LongVU is a spatio-temporal adaptive compression technology for long video language understanding, designed to efficiently process long video content.
Video-to-Text PyTorch
L
Vision-CAIR
1,079
7
Videochat Flash Qwen2 5 2B Res448
Apache-2.0
VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.
Video-to-Text Transformers English
V
OpenGVLab
904
18
Vamba Qwen2 VL 7B
MIT
Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.
Video-to-Text Transformers
V
TIGER-Lab
806
16
Videochat R1 Thinking 7B
Apache-2.0
VideoChat-R1-thinking_7B is a multimodal model based on Qwen2.5-VL-7B-Instruct, focusing on video-text-to-text tasks.
Video-to-Text Transformers English
V
OpenGVLab
800
0
Videochat Flash Qwen2 7B Res448
Apache-2.0
VideoChat-Flash-7B is a multimodal model built upon UMT-L (300M) and Qwen2-7B, using only 16 tokens per frame and supporting input sequences of up to approximately 10,000 frames.
Video-to-Text Transformers English
V
OpenGVLab
661
12
Tarsier 7b
Tarsier-7b is an open-source large-scale video-language model from the Tarsier series, specializing in generating high-quality video descriptions with excellent general video understanding capabilities.
Video-to-Text Transformers
T
omni-research
635
23
Internvideo2 Stage2 6B
MIT
InternVideo2 is a multimodal video understanding model with 6B parameters, focusing on video content analysis and comprehension tasks.
Video-to-Text Safetensors
I
OpenGVLab
542
0
Internvideo2 Chat 8B
MIT
InternVideo2-Chat-8B is a video understanding model that combines a large language model (LLM) with video BLIP, built through a progressive learning scheme, capable of video semantic understanding and human-computer interaction.
Video-to-Text Transformers English
I
OpenGVLab
492
22
Llava Video 7B Qwen2 TPO
MIT
LLaVA-Video-7B-Qwen2-TPO is a video understanding model based on LLaVA-Video-7B-Qwen2 with temporal preference optimization, demonstrating excellent performance across multiple benchmarks.
Video-to-Text Transformers
L
ruili0
490
1
Longvu Llama3 2 1B
Apache-2.0
LongVU is a spatio-temporal adaptive compression technology designed for long video language understanding, aiming to efficiently process long video content and enhance language comprehension.
Video-to-Text PyTorch
L
Vision-CAIR
465
11
Video Blip Opt 2.7b Ego4d
MIT
VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.
Video-to-Text Transformers English
V
kpyu
429
16
Xgen Mm Vid Phi3 Mini R V1.5 128tokens 8frames
xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.
Video-to-Text Safetensors English
X
Salesforce
398
11
Videochat2 HD Stage4 Mistral 7B Hf
MIT
VideoChat2-HD-hf is a multimodal video understanding model based on Mistral-7B, focusing on video-to-text conversion tasks.
Video-to-Text Safetensors
V
OpenGVLab
393
3
Skycaptioner V1
Apache-2.0
SkyCaptioner-V1 is a model specifically designed for generating high-quality structured descriptions of video data. By integrating specialized sub-expert models, multimodal large language models, and manual annotations, it addresses the limitations of general description models in capturing professional film details.
Video-to-Text Transformers
S
Skywork
362
29
Sharecaptioner Video
An open-source video caption generator fine-tuned on GPT4V-annotated data, supporting videos of various durations, aspect ratios, and resolutions
Video-to-Text Transformers
S
Lin-Chen
264
17
Internvl 2 5 HiCo R64
Apache-2.0
A video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, improving existing MLLMs by enhancing the perception of fine-grained details and capturing long-term temporal structures
Video-to-Text Transformers English
I
OpenGVLab
252
2
Longvu Qwen2 7B
Apache-2.0
LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.
Video-to-Text
L
Vision-CAIR
230
69
Longva 7B TPO
MIT
LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.
Video-to-Text Transformers
L
ruili0
225
1
Llavaction 0.5B
LLaVAction is a multimodal large language model for action recognition, based on the Qwen2 language model, trained on the EPIC-KITCHENS-100-MQA dataset.
Video-to-Text Transformers English
L
MLAdaptiveIntelligence
215
1
Llava NeXT Video 34B DPO
Llama 2 is a series of open-source large language models developed by Meta, supporting various natural language processing tasks.
Video-to-Text Transformers
L
lmms-lab
214
10
Videomind 2B
Bsd-3-clause
VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).
Video-to-Text
V
yeliudev
207
1
Internvideo2 Chat 8B HD
MIT
InternVideo2-Chat-8B-HD is a video understanding model that combines a large language model and VideoBLIP. It is constructed through a progressive learning scheme and can handle high-definition video input.
Video-to-Text Safetensors
I
OpenGVLab
190
16
Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4
A video multimodal large language model using a slow-fast architecture, balancing temporal resolution and spatial details, supporting 64-frame video understanding
Video-to-Text Transformers
S
shi-labs
184
0
Timezero Charades 7B
TimeZero is a reasoning-guided large vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks. It identifies temporal segments in videos corresponding to natural language queries through reinforcement learning methods.
Video-to-Text Transformers
T
wwwyyy
183
0
Videollama2.1 7B 16F Base
Apache-2.0
VideoLLaMA2.1 is an upgraded version of VideoLLaMA2, focusing on enhancing spatiotemporal modeling and audio understanding capabilities in large video-language models.
Video-to-Text Transformers English
V
DAMO-NLP-SG
179
1
Kangaroo
Apache-2.0
Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.
Video-to-Text Transformers Supports Multiple Languages
K
KangarooGroup
163
12
Llavaction 7B
LLaVAction is a multimodal large language model evaluation and training framework for action recognition, based on the Qwen2 language model architecture, supporting first-person perspective video understanding.
Video-to-Text Transformers English
L
MLAdaptiveIntelligence
149
1
Timezero ActivityNet 7B
TimeZero is a reasoning-guided large-scale vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks, achieving dynamic video-language relationship analysis through reinforcement learning methods.
Video-to-Text Transformers
T
wwwyyy
142
1
Tinyllava Video R1
Apache-2.0
TinyLLaVA-Video-R1 is a small-scale video reasoning model based on the traceable training model TinyLLaVA-Video. It significantly enhances reasoning and thinking abilities through reinforcement learning and exhibits the emergent property of 'epiphany moments'.
Video-to-Text Transformers
T
Zhang199
123
2
Tarsier 34b
Apache-2.0
Tarsier-34b is an open-source large-scale video-language model focused on generating high-quality video captions and achieving leading results in multiple public benchmarks.
Video-to-Text Transformers
T
omni-research
103
17
TEMPURA Qwen2.5 VL 3B S2
TEMPURA is a vision-language model capable of reasoning causal event relationships and generating fine-grained timestamp descriptions for unedited videos.
Video-to-Text Transformers
T
andaba
102
1
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase