Open-source LLaVAction-0.5B Multimodal Large Model - Capable of Efficient Action Recognition

Llavaction 0.5B

Developed by MLAdaptiveIntelligence

LLaVAction is a multimodal large language model for action recognition, based on the Qwen2 language model, trained on the EPIC-KITCHENS-100-MQA dataset.

Video-to-Text

Transformers

English#First-person action recognition #Multimodal video question answering #Long video understanding

Downloads 215

Release Time : 3/24/2025

Model Overview

This model focuses on video action recognition tasks, capable of understanding action content in first-person perspective videos, suitable for analyzing video content similar to EPIC-KITCHENS-100.

Model Features

Multimodal understanding capability

Combines visual and linguistic information to understand video content and generate relevant descriptions

First-person perspective action recognition

Specifically designed to recognize hand-object interaction actions in first-person perspective videos

Large context window

Supports a 32K token context window, suitable for processing long video content

Model Capabilities

Video content understanding

Action recognition

Multimodal question answering

Video frame analysis

Temporal information processing

Use Cases

Smart home

Kitchen activity analysis

Identifies various operational activities of users in the kitchen

Can accurately recognize common kitchen actions such as chopping and cooking

Behavioral research

Daily activity analysis

Studies human daily activity patterns and behavioral habits

🚀 LLaVAction-0.5B

LLaVAction is a project focused on evaluating and training multi-modal large language models for action recognition, offering a new solution for video action recognition tasks.

🚀 Quick Start

The LLaVAction-0.5B model is trained on EPIC - KITCHENS - 100 - MQA, based on the Qwen2 language model with a context window of 32K tokens.

Project Page: https://mmathislab.github.io/llavaction/
Paper: For more details, please check our paper
Repository: https://github.com/AdaptiveMotorControlLab/LLaVAction
Point of Contact: Mackenzie Mathis
Languages: English

✨ Features

Multimodal Capability: Supports video - text - to - text tasks, enabling action recognition and description in videos.
Based on Qwen2: Utilizes the Qwen2 language model with a 32K token context window.
Trained on Specific Dataset: Trained on EPIC - KITCHENS - 100 - MQA for better performance in relevant scenarios.

💻 Usage Examples

Basic Usage

# Your video (it assumes an egocentric view point)
video_path = "XXXX"

#These are the prompts we trained with, but you can test others:
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
task_prompt = "Describe in details what you see from the video frames."


def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time

pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

📚 Documentation

Training

See details in Ye et al. 2025: arxiv.org/abs/2503.18712

Model

Architecture: SO400M + Qwen2
Initialized Model: lmms - lab/llava - onevision - qwen2 - 0.5b - ov
Data: EPIC - KITCHENS - 100 - MQA, 2 epochs, full model
Precision: bfloat16

Hardware & Software

GPUs: 32 * Nvidia GH - 200 (for whole model series training)
Orchestration: HuggingFace Trainer
Neural networks: PyTorch

📄 License

This project is licensed under the cc - by - nc - sa - 4.0 license.

📄 Citation

arXiv: arxiv.org/abs/2503.18712

@article{YeQi2025llavaction,
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
  author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
  journal={arXiv preprint},
  year={2025}
}

📋 Information Table

Property	Details
Model Type	LLaVAction - 0.5B
Base Model	lmms - lab/llava - onevision - qwen2 - 0.5b - ov
Pipeline Tag	video - text - to - text
Tags	Action, Video, MQA, multimodal, MLLMs, LLaVAction
Metrics	accuracy
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご