Videomind 2B
VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).
Downloads 207
Release Time : 3/21/2025
Model Overview
VideoMind is a multimodal large language model focused on video-text-to-text tasks, enhancing video reasoning by simulating human thought processes.
Model Features
Multimodal Agent Framework
Enhances video reasoning by simulating human thought processes (e.g., task decomposition, moment localization & verification, and answer synthesis).
Role Specialization
The model includes four roles: planner, localizer, verifier, and responder, each handling distinct reasoning tasks.
Efficient Reasoning
Achieves rapid role switching and efficient reasoning through LoRA adapter technology.
Model Capabilities
Video Understanding
Video Moment Localization
Video Question Answering
Multimodal Reasoning
Use Cases
Video Analysis
Video Question Answering
Ask questions about video content and receive accurate answers.
Can precisely locate key moments in videos and generate relevant answers.
Video Moment Localization
Locate the timing of specific events in long videos.
Can accurately identify and return the time segments when events occur.
license: bsd-3-clause pipeline_tag: video-text-to-text
VideoMind-2B
VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.
🔖 Model Details
- Model type: Multi-modal Large Language Model
- Language(s): English
- License: BSD-3-Clause
🚀 Quick Start
Install the environment
- Clone the repository from GitHub.
git clone git@github.com:yeliudev/VideoMind.git
cd VideoMind
- Initialize conda environment.
conda create -n videomind python=3.11 -y
conda activate videomind
- Install dependencies.
pip install -r requirements.txt
For NPU users, please modify Line 18-25 of requirements.txt
.
Quick Inference Demo
The script below showcases how to perform inference with VideoMind's different roles. Please refer to our GitHub Repository for more details about this model.
import torch
from videomind.constants import GROUNDER_PROMPT, PLANNER_PROMPT, VERIFIER_PROMPT
from videomind.dataset.utils import process_vision_info
from videomind.model.builder import build_model
from videomind.utils.io import get_duration
from videomind.utils.parser import parse_span
MODEL_PATH = 'yeliudev/VideoMind-2B'
video_path = '<path-to-video>'
question = '<question>'
# initialize role *grounder*
model, processor = build_model(MODEL_PATH)
device = next(model.parameters()).device
# initialize role *planner*
model.load_adapter(f'{MODEL_PATH}/planner', adapter_name='planner')
# initialize role *verifier*
model.load_adapter(f'{MODEL_PATH}/verifier', adapter_name='verifier')
# ==================== Planner ====================
messages = [{
'role':
'user',
'content': [{
'type': 'video',
'video': video_path,
'min_pixels': 36 * 28 * 28,
'max_pixels': 64 * 28 * 28,
'max_frames': 100,
'fps': 1.0
}, {
'type': 'text',
'text': PLANNER_PROMPT.format(question)
}]
}]
# preprocess inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
images, videos = process_vision_info(messages)
data = processor(text=[text], images=images, videos=videos, return_tensors='pt').to(device)
# switch adapter to *planner*
model.base_model.disable_adapter_layers()
model.base_model.enable_adapter_layers()
model.set_adapter('planner')
# run inference
output_ids = model.generate(**data, do_sample=False, temperature=None, top_p=None, top_k=None, max_new_tokens=256)
# decode output ids
output_ids = output_ids[0, data.input_ids.size(1):-1]
response = processor.decode(output_ids, clean_up_tokenization_spaces=False)
print(f'Planner Response: {response}')
# ==================== Grounder ====================
messages = [{
'role':
'user',
'content': [{
'type': 'video',
'video': video_path,
'min_pixels': 36 * 28 * 28,
'max_pixels': 64 * 28 * 28,
'max_frames': 150,
'fps': 1.0
}, {
'type': 'text',
'text': GROUNDER_PROMPT.format(question)
}]
}]
# preprocess inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
images, videos = process_vision_info(messages)
data = processor(text=[text], images=images, videos=videos, return_tensors='pt').to(device)
# switch adapter to *grounder*
model.base_model.disable_adapter_layers()
model.base_model.enable_adapter_layers()
model.set_adapter('grounder')
# run inference
output_ids = model.generate(**data, do_sample=False, temperature=None, top_p=None, top_k=None, max_new_tokens=256)
# decode output ids
output_ids = output_ids[0, data.input_ids.size(1):-1]
response = processor.decode(output_ids, clean_up_tokenization_spaces=False)
print(f'Grounder Response: {response}')
duration = get_duration(video_path)
# 1. extract timestamps and confidences
blob = model.reg[0].cpu().float()
pred, conf = blob[:, :2] * duration, blob[:, -1].tolist()
# 2. clamp timestamps
pred = pred.clamp(min=0, max=duration)
# 3. sort timestamps
inds = (pred[:, 1] - pred[:, 0] < 0).nonzero()[:, 0]
pred[inds] = pred[inds].roll(1)
# 4. convert timestamps to list
pred = pred.tolist()
print(f'Grounder Regressed Timestamps: {pred}')
# ==================== Verifier ====================
# using top-5 predictions
probs = []
for cand in pred[:5]:
s0, e0 = parse_span(cand, duration, 2)
offset = (e0 - s0) / 2
s1, e1 = parse_span([s0 - offset, e0 + offset], duration)
# percentage of s0, e0 within s1, e1
s = (s0 - s1) / (e1 - s1)
e = (e0 - s1) / (e1 - s1)
messages = [{
'role':
'user',
'content': [{
'type': 'video',
'video': video_path,
'video_start': s1,
'video_end': e1,
'min_pixels': 36 * 28 * 28,
'max_pixels': 64 * 28 * 28,
'max_frames': 64,
'fps': 2.0
}, {
'type': 'text',
'text': VERIFIER_PROMPT.format(question)
}]
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
images, videos = process_vision_info(messages)
data = processor(text=[text], images=images, videos=videos, return_tensors='pt')
# ===== insert segment start/end tokens =====
video_grid_thw = data['video_grid_thw'][0]
num_frames, window = int(video_grid_thw[0]), int(video_grid_thw[1] * video_grid_thw[2] / 4)
assert num_frames * window * 4 == data['pixel_values_videos'].size(0)
pos_s, pos_e = round(s * num_frames), round(e * num_frames)
pos_s, pos_e = min(max(0, pos_s), num_frames), min(max(0, pos_e), num_frames)
assert pos_s <= pos_e, (num_frames, s, e)
base_idx = torch.nonzero(data['input_ids'][0] == model.config.vision_start_token_id).item()
pos_s, pos_e = pos_s * window + base_idx + 1, pos_e * window + base_idx + 2
input_ids = data['input_ids'][0].tolist()
input_ids.insert(pos_s, model.config.seg_s_token_id)
input_ids.insert(pos_e, model.config.seg_e_token_id)
data['input_ids'] = torch.LongTensor([input_ids])
data['attention_mask'] = torch.ones_like(data['input_ids'])
# ===========================================
data = data.to(device)
# switch adapter to *verifier*
model.base_model.disable_adapter_layers()
model.base_model.enable_adapter_layers()
model.set_adapter('verifier')
# run inference
with torch.inference_mode():
logits = model(**data).logits[0, -1].softmax(dim=-1)
# NOTE: magic numbers here
# In Qwen2-VL vocab: 9454 -> Yes, 2753 -> No
score = (logits[9454] - logits[2753]).sigmoid().item()
probs.append(score)
# sort predictions by verifier's confidence
ranks = torch.Tensor(probs).argsort(descending=True).tolist()
pred = [pred[idx] for idx in ranks]
conf = [conf[idx] for idx in ranks]
print(f'Verifier Re-ranked Timestamps: {pred}')
# ==================== Answerer ====================
# select the best candidate moment
s, e = parse_span(pred[0], duration, 32)
messages = [{
'role':
'user',
'content': [{
'type': 'video',
'video': video_path,
'video_start': s,
'video_end': e,
'min_pixels': 128 * 28 * 28,
'max_pixels': 256 * 28 * 28,
'max_frames': 32,
'fps': 2.0
}, {
'type': 'text',
'text': question
}]
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
images, videos = process_vision_info(messages)
data = processor(text=[text], images=images, videos=videos, return_tensors='pt').to(device)
# remove all adapters as *answerer* is the base model itself
with model.disable_adapter():
output_ids = model.generate(**data, do_sample=False, temperature=None, top_p=None, top_k=None, max_new_tokens=256)
# decode output ids
output_ids = output_ids[0, data.input_ids.size(1):-1]
response = processor.decode(output_ids, clean_up_tokenization_spaces=False)
print(f'Answerer Response: {response}')
📖 Citation
Please kindly cite our paper if you find this project helpful.
@article{liu2025videomind,
title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.13444},
year={2025}
}
Llava Video 7B Qwen2
Apache-2.0
The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.
Video-to-Text
Transformers English

L
lmms-lab
34.28k
91
Llava NeXT Video 7B DPO Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.
Video-to-Text
Transformers English

L
llava-hf
12.61k
9
Internvideo2 5 Chat 8B
Apache-2.0
InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.
Video-to-Text
Transformers English

I
OpenGVLab
8,265
60
Cogvlm2 Llama3 Caption
Other
CogVLM2-Caption is a video caption generation model used to generate training data for the CogVideoX model.
Video-to-Text
Transformers English

C
THUDM
7,493
95
Spacetimegpt
SpaceTime GPT is a video description generation model capable of spatial and temporal reasoning, analyzing video frames and generating sentences describing video events.
Video-to-Text
Transformers English

S
Neleac
2,877
33
Video R1 7B
Apache-2.0
Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.
Video-to-Text
Transformers English

V
Video-R1
2,129
9
Internvl 2 5 HiCo R16
Apache-2.0
InternVideo2.5 is a video multimodal large language model (MLLM) built upon InternVL2.5, enhanced with Long and Rich Context (LRC) modeling, capable of perceiving fine-grained details and capturing long-term temporal structures.
Video-to-Text
Transformers English

I
OpenGVLab
1,914
3
Videollm Online 8b V1plus
MIT
VideoLLM-online is a multimodal large language model based on Llama-3-8B-Instruct, focusing on online video understanding and video-text generation tasks.
Video-to-Text
Safetensors English
V
chenjoya
1,688
23
Videochat R1 7B
Apache-2.0
VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.
Video-to-Text
Transformers English

V
OpenGVLab
1,686
7
Qwen2.5 Vl 7b Cam Motion Preview
Other
A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks
Video-to-Text
Transformers

Q
chancharikm
1,456
10
Featured Recommended AI Models