開源LLaVAction-0.5B多模態大模型 - 可高效進行動作識別

首頁

Llavaction 0.5B

由MLAdaptiveIntelligence開發

LLaVAction是一個用於動作識別的多模態大語言模型，基於Qwen2語言模型，在EPIC-KITCHENS-100-MQA數據集上訓練而成。

視頻生成文本

Transformers

英語#第一人稱動作識別 #多模態視頻問答 #長視頻理解

下載量 215

發布時間 : 3/24/2025

模型概述

該模型專注於視頻動作識別任務，能夠理解第一人稱視角視頻中的動作內容，適用於與EPIC-KITCHENS-100類似的視頻內容分析。

模型特點

多模態理解能力

結合視覺和語言信息，能夠理解視頻內容並生成相關描述

第一人稱視角動作識別

專門針對第一人稱視角視頻中的手部與物體交互動作進行識別

大上下文窗口

支持32K令牌的上下文窗口，適合處理長視頻內容

模型能力

視頻內容理解

動作識別

多模態問答

視頻幀分析

時間信息處理

使用案例

智能家居

廚房活動分析

識別用戶在廚房中的各種操作活動

可準確識別切菜、烹飪等常見廚房動作

行為研究

日常活動分析

研究人類日常活動模式和行為習慣

🚀 LLaVAction-0.5B

LLaVAction-0.5B是一個用於動作識別的多模態大語言模型，基於Qwen2語言模型訓練，可處理視頻文本任務，在動作識別領域有重要應用價值。

🚀 快速開始

LLaVAction-0.5B模型基於Qwen2語言模型，在EPIC - KITCHENS - 100 - MQA數據集上進行訓練，上下文窗口為32K個標記。

項目頁面：https://mmathislab.github.io/llavaction/
論文：更多詳細信息，請查看我們的論文
代碼倉庫：https://github.com/AdaptiveMotorControlLab/LLaVAction
聯繫人：Mackenzie Mathis
支持語言：英語

✨ 主要特性

多模態處理：支持視頻和文本的多模態輸入輸出。
動作識別：專注於動作識別任務，可對視頻中的動作進行詳細描述。
基於強大語言模型：以Qwen2為基礎，擁有32K標記的上下文窗口。

💻 使用示例

基礎用法

!pip install llavaction

from llavaction.model.builder import load_pretrained_model
from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llavaction.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")

#Your video (it assumes an egocentric view point)
video_path = "XXXX"

#These are the prompts we trained with, but you can test others:
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
task_prompt = "Describe in details what you see from the video frames."


def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time

pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

🔧 技術細節

訓練詳情

具體訓練細節可參考Ye等人2025年的論文：arxiv.org/abs/2503.18712

模型信息

屬性	詳情
模型架構	SO400M + Qwen2
初始化模型	lmms - lab/llava - onevision - qwen2 - 0.5b - ov
訓練數據	EPIC - KITCHENS - 100 - MQA，2個訓練週期，全量模型
精度	bfloat16

硬件與軟件

GPU：32 * Nvidia GH - 200（用於整個模型系列的訓練）
編排工具：HuggingFace Trainer
神經網絡框架：PyTorch

📄 許可證

本項目採用CC - BY - NC - SA 4.0許可證。

📚 引用信息

arXiv: arxiv.org/abs/2503.18712

@article{YeQi2025llavaction,
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
  author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
  journal={arXiv preprint},
  year={2025}
}