LLaVAction-7B開源動作識別模型 - 支持第一人稱視角視頻理解

首頁

Llavaction 7B

由MLAdaptiveIntelligence開發

LLaVAction是一個面向動作識別的多模態大語言模型評估與訓練框架，基於Qwen2語言模型架構，支持第一人稱視角視頻理解。

視頻生成文本

Transformers

英語#第一人稱動作理解 #64幀長視頻處理 #多模態問答

下載量 149

發布時間 : 3/24/2025

模型概述

LLaVAction-7B模型專注於從第一人稱視角視頻理解人類動作，支持處理最多64幀視頻輸入，在多個視頻理解基準測試上表現優異。

模型特點

第一人稱視角理解

專門針對第一人稱視角視頻優化，能準確理解自我中心視角下的動作和交互

長視頻處理能力

支持處理最多64幀視頻輸入，能有效理解長視頻內容

多模態融合

結合視覺和語言信息，實現高質量的視頻內容理解和問答

高性能基準測試表現

在多個視頻理解基準測試上達到領先水平，如EgoSchema(59%)、MVBench(61.1%)等

模型能力

視頻內容理解

動作識別

多模態問答

長視頻分析

第一人稱視角理解

使用案例

智能家居

廚房活動分析

分析用戶在廚房中的烹飪活動

能準確識別切菜、烹飪等動作

行為研究

日常活動分析

研究人類日常活動模式

可識別和分類各種日常活動

輔助技術

動作指導

為特殊需求用戶提供動作指導

能理解並指導用戶完成特定動作

🚀 LLaVAction-7B

LLaVAction-7B是一個用於動作識別的多模態大語言模型，基於Qwen2語言模型訓練，支持最多64幀視頻處理，在多個多模態數據集上有不錯的準確率表現。

🚀 快速開始

LLaVAction-7B模型基於Qwen2語言模型，在EPIC - KITCHENS - 100 - MQA數據集上進行訓練，上下文窗口為32K個標記，最多支持64幀視頻。

項目頁面：https://mmathislab.github.io/llavaction/
論文：更多詳細信息，請查看我們的論文
代碼倉庫：https://github.com/AdaptiveMotorControlLab/LLaVAction
聯繫人：Mackenzie Mathis
支持語言：英語

✨ 主要特性

基於Qwen2語言模型，上下文窗口達32K標記。
支持最多64幀視頻處理。
在多個多模態數據集上進行評估，有較好的準確率表現。

📦 安裝指南

使用前需安裝llavaction庫：

!pip install llavaction

💻 使用示例

基礎用法

#Your video (it assumes an egocentric view point)
video_path = "XXXX"

#These are the prompts we trained with, but you can test others:
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
task_prompt = "Describe in details what you see from the video frames."

def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time

pretrained = "MLAdaptiveIntelligence/LLaVAction-7B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

📚 詳細文檔

模型

架構：SO400M + Qwen2
初始化模型：lmms - lab/LLaVA - Video - 7B - Qwen2
數據：混合LLaVA - 178K和EPIC - KITCHENS - 100 - MQA數據集，訓練2個週期，全量模型訓練
精度：bfloat16

硬件與軟件

GPU：32 * Nvidia GH - 200（用於全模型系列訓練）
編排工具：HuggingFace Trainer
神經網絡框架：PyTorch

評估指標

數據集	準確率
EgoSchema	59
MVBench	61.1
NextQA	82.8
PercepTest	70.2
LongVideoBench	58.6
VideoMME	63.9
VideoMME (w - subs)	71.4

🔧 技術細節

LLaVAction-7B模型的詳細技術細節可參考Ye等人2025年的論文：arxiv.org/abs/2503.18712 。

📄 許可證

本項目採用CC - BY - NC - SA - 4.0許可證。

📚 引用

@article{YeQi2025llavaction,
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
  author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
  journal={arXiv preprint},
  year={2025}
}