LLaVAction-7B Open-Source Action Recognition Model - Supports First-Person View Video Understanding

Llavaction 7B

Developed by MLAdaptiveIntelligence

LLaVAction is a multimodal large language model evaluation and training framework for action recognition, based on the Qwen2 language model architecture, supporting first-person perspective video understanding.

Video-to-Text

Transformers

English#First-person action understanding #64-frame long video processing #Multimodal Q&A

Downloads 149

Release Time : 3/24/2025

Model Overview

The LLaVAction-7B model specializes in understanding human actions from first-person perspective videos, supporting processing of up to 64 frames of video input, and demonstrates excellent performance on multiple video understanding benchmarks.

Model Features

First-person perspective understanding

Specially optimized for first-person perspective videos, capable of accurately understanding actions and interactions from an egocentric viewpoint

Long video processing capability

Supports processing of up to 64 frames of video input, enabling effective understanding of long video content

Multimodal fusion

Combines visual and linguistic information to achieve high-quality video content understanding and Q&A

High-performance benchmark results

Achieves leading performance on multiple video understanding benchmarks, such as EgoSchema (59%), MVBench (61.1%), etc.

Model Capabilities

Video content understanding

Action recognition

Multimodal Q&A

Long video analysis

First-person perspective understanding

Use Cases

Smart home

Kitchen activity analysis

Analyzing users' cooking activities in the kitchen

Can accurately recognize actions like chopping and cooking

Behavioral research

Daily activity analysis

Studying patterns of human daily activities

Can identify and classify various daily activities

Assistive technology

Action guidance

Providing action guidance for users with special needs

Can understand and guide users to complete specific actions

🚀 LLaVAction-7B

LLaVAction: evaluating and training multi-modal large language models for action recognition

🚀 Quick Start

The LLaVAction-7B model is a powerful tool for action recognition from videos. It is trained on specific datasets and based on the Qwen2 language model, offering enhanced capabilities in understanding human egocentric actions.

✨ Features

Trained on Specific Datasets: The model is trained on EPIC - KITCHENS - 100 - MQA [dataset release pending] and LLaVA - Video - 178K, which improves its ability to understand human egocentric actions from videos.
Based on Qwen2: Built on the Qwen2 language model with a context window of 32K tokens, supporting at most 64 frames.
Multiple Task Performance: It has shown good performance on various multimodal tasks, as demonstrated by the accuracy metrics on different datasets.

📦 Installation

!pip install llavaction

💻 Usage Examples

Basic Usage

# Your video (it assumes an egocentric view point)
video_path = "XXXX"

#These are the prompts we trained with, but you can test others:
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
task_prompt = "Describe in details what you see from the video frames."

def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time

pretrained = "MLAdaptiveIntelligence/LLaVAction-7B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

Advanced Usage

For more details on using the model, you could refer to our Github.

📚 Documentation

Model Summary

The LLaVAction - 7B model is trained on EPIC - KITCHENS - 100 - MQA, based on the Qwen2 language model with a context window of 32K tokens. This model supports at most 64 frames.

Project Page: https://mmathislab.github.io/llavaction/
Paper: For more details, please check our paper
Repository: https://github.com/AdaptiveMotorControlLab/LLaVAction
Point of Contact: Mackenzie Mathis
Languages: English

Model Performance

Property	Details
Model Type	LLaVAction - 7B
Training Data	A mixture of LLaVA - 178K and EPIC - KITCHENS - 100 - MQA
Metrics	Accuracy on multiple datasets

Performance on Datasets

Dataset	Accuracy
EgoSchema	59
MVBench	61.1
NextQA	82.8
PercepTest	70.2
LongVideoBench	58.6
VideoMME	63.9
VideoMME (w - subs)	71.4

🔧 Technical Details

Model

Architecture: SO400M + Qwen2
Initialized Model: lmms - lab/LLaVA - Video - 7B - Qwen2
Data: A mixture of LLaVA - 178K and EPIC - KITCHENS - 100 - MQA, 2 epochs, full model
Precision: bfloat16

Hardware & Software

GPUs: 32 * Nvidia GH - 200 (for whole model series training)
Orchestration: HuggingFace Trainer
Neural networks: PyTorch

📄 License

This project is licensed under the cc - by - nc - sa - 4.0 license.

📚 Citation

arXiv: arxiv.org/abs/2503.18712

@article{YeQi2025llavaction,
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
  author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
  journal={arXiv preprint},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご