Video-to-Text

The Best 68 Video-to-Text Tools in 2025

Llava Video 7B Qwen2

The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.

Transformers English

Llava NeXT Video 7B DPO Hf

LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.

Transformers English

Internvideo2 5 Chat 8B

InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.

Transformers English

Cogvlm2 Llama3 Caption

CogVLM2-Caption is a video caption generation model used to generate training data for the CogVideoX model.

Transformers English

SpaceTime GPT is a video description generation model capable of spatial and temporal reasoning, analyzing video frames and generating sentences describing video events.

Transformers English

Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.

Transformers English

Internvl 2 5 HiCo R16

InternVideo2.5 is a video multimodal large language model (MLLM) built upon InternVL2.5, enhanced with Long and Rich Context (LRC) modeling, capable of perceiving fine-grained details and capturing long-term temporal structures.

Transformers English

Videollm Online 8b V1plus

VideoLLM-online is a multimodal large language model based on Llama-3-8B-Instruct, focusing on online video understanding and video-text generation tasks.

Video-to-Text English

Videochat R1 7B

VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.

Transformers English

Qwen2.5 Vl 7b Cam Motion Preview

A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks

Mambavision B 1K

PAVE is a model focused on repairing and adapting video large language models, aiming to enhance the conversion capability between video and text.

Longvu Llama3 2 3B

LongVU is a spatio-temporal adaptive compression technology for long video language understanding, designed to efficiently process long video content.

Videochat Flash Qwen2 5 2B Res448

VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.

Transformers English

Vamba Qwen2 VL 7B

Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.

Videochat R1 Thinking 7B

VideoChat-R1-thinking_7B is a multimodal model based on Qwen2.5-VL-7B-Instruct, focusing on video-text-to-text tasks.

Transformers English

Videochat Flash Qwen2 7B Res448

VideoChat-Flash-7B is a multimodal model built upon UMT-L (300M) and Qwen2-7B, using only 16 tokens per frame and supporting input sequences of up to approximately 10,000 frames.

Transformers English

Tarsier-7b is an open-source large-scale video-language model from the Tarsier series, specializing in generating high-quality video descriptions with excellent general video understanding capabilities.

Internvideo2 Stage2 6B

InternVideo2 is a multimodal video understanding model with 6B parameters, focusing on video content analysis and comprehension tasks.

Internvideo2 Chat 8B

InternVideo2-Chat-8B is a video understanding model that combines a large language model (LLM) with video BLIP, built through a progressive learning scheme, capable of video semantic understanding and human-computer interaction.

Transformers English

Llava Video 7B Qwen2 TPO

LLaVA-Video-7B-Qwen2-TPO is a video understanding model based on LLaVA-Video-7B-Qwen2 with temporal preference optimization, demonstrating excellent performance across multiple benchmarks.

Longvu Llama3 2 1B

LongVU is a spatio-temporal adaptive compression technology designed for long video language understanding, aiming to efficiently process long video content and enhance language comprehension.

Video Blip Opt 2.7b Ego4d

VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.

Transformers English

Xgen Mm Vid Phi3 Mini R V1.5 128tokens 8frames

xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.

Safetensors English

Videochat2 HD Stage4 Mistral 7B Hf

VideoChat2-HD-hf is a multimodal video understanding model based on Mistral-7B, focusing on video-to-text conversion tasks.

Skycaptioner V1

SkyCaptioner-V1 is a model specifically designed for generating high-quality structured descriptions of video data. By integrating specialized sub-expert models, multimodal large language models, and manual annotations, it addresses the limitations of general description models in capturing professional film details.

Sharecaptioner Video

An open-source video caption generator fine-tuned on GPT4V-annotated data, supporting videos of various durations, aspect ratios, and resolutions

Internvl 2 5 HiCo R64

A video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, improving existing MLLMs by enhancing the perception of fine-grained details and capturing long-term temporal structures

Transformers English

Longvu Qwen2 7B

LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.

LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.

Llavaction 0.5B

LLaVAction is a multimodal large language model for action recognition, based on the Qwen2 language model, trained on the EPIC-KITCHENS-100-MQA dataset.

Transformers English

MLAdaptiveIntelligence

Llava NeXT Video 34B DPO

Llama 2 is a series of open-source large language models developed by Meta, supporting various natural language processing tasks.

VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).

Internvideo2 Chat 8B HD

InternVideo2-Chat-8B-HD is a video understanding model that combines a large language model and VideoBLIP. It is constructed through a progressive learning scheme and can handle high-definition video input.

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

A video multimodal large language model using a slow-fast architecture, balancing temporal resolution and spatial details, supporting 64-frame video understanding

Timezero Charades 7B

TimeZero is a reasoning-guided large vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks. It identifies temporal segments in videos corresponding to natural language queries through reinforcement learning methods.

Videollama2.1 7B 16F Base

VideoLLaMA2.1 is an upgraded version of VideoLLaMA2, focusing on enhancing spatiotemporal modeling and audio understanding capabilities in large video-language models.

Transformers English

Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.

Transformers Supports Multiple Languages

LLaVAction is a multimodal large language model evaluation and training framework for action recognition, based on the Qwen2 language model architecture, supporting first-person perspective video understanding.

Transformers English

MLAdaptiveIntelligence

Timezero ActivityNet 7B

TimeZero is a reasoning-guided large-scale vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks, achieving dynamic video-language relationship analysis through reinforcement learning methods.

Tinyllava Video R1

TinyLLaVA-Video-R1 is a small-scale video reasoning model based on the traceable training model TinyLLaVA-Video. It significantly enhances reasoning and thinking abilities through reinforcement learning and exhibits the emergent property of 'epiphany moments'.

Tarsier-34b is an open-source large-scale video-language model focused on generating high-quality video captions and achieving leading results in multiple public benchmarks.

TEMPURA Qwen2.5 VL 3B S2

TEMPURA is a vision-language model capable of reasoning causal event relationships and generating fine-grained timestamp descriptions for unedited videos.

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase