videomae-base-finetuned-ucfcrime-full开源视频分类模型

首页

Videomae Base Finetuned Ucfcrime Full

由 archit11 开发

基于VideoMAE框架在UCF-CRIME数据集上微调的视频分类模型，专注于破坏行为检测

视频处理

Transformers

#视频异常检测 #破坏行为识别 #UCF-CRIME微调

下载量 85

发布时间 : 3/17/2024

模型简介

该模型是基于MCG-NJU/videomae-base在UCF-CRIME数据集上微调的版本，主要用于视频中的破坏行为检测和分类任务。

模型特点

破坏行为检测

专门针对视频中的破坏行为进行识别和分类

基于VideoMAE框架

采用高效的VideoMAE自监督学习框架进行预训练

UCF-CRIME数据集微调

在公开的UCF-CRIME数据集上进行微调，专注于异常行为识别

模型能力

视频分类

破坏行为检测

实时视频分析

使用案例

安防监控

公共场所异常行为检测

检测公共场所中的破坏行为或异常活动

智能家居

家庭安全监控

通过家庭摄像头检测可能的破坏行为

🚀 videomae-base-finetuned-ucfcrime-full2

该模型是在 UCF-CRIME 数据集上对 MCG-NJU/videomae-base 进行微调后的版本。代码链接：github。它在评估集上取得了以下结果：

损失值：2.5014
准确率：0.225

🚀 快速开始

此模型基于 MCG-NJU/videomae-base 在 UCF-CRIME 数据集上微调而来，可用于视频分类任务，尤其是破坏行为检测。

✨ 主要特性

基于 VideoMAE 架构，适用于视频分类。
在 UCF-CRIME 数据集上进行了微调。

📦 安装指南

文档未提供安装步骤，可参考原模型 MCG-NJU/videomae-base 的安装说明。

💻 使用示例

基础用法

使用手机摄像头进行推理（需从应用商店下载 ipwebcam 应用到手机）：

import cv2
import torch
import numpy as np
from transformers import AutoImageProcessor, VideoMAEForVideoClassification

np.random.seed(0)

def preprocess_frames(frames, image_processor):
    inputs = image_processor(frames, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}  # Move tensors to GPU
    return inputs

# Initialize the video capture object, replace ip addr with the local ip of your phone  (will be shown in the ipwebcam app)
cap = cv2.VideoCapture('http://192.168.229.98:8080/video')

# Set the frame size (optional)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")

# Move the model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

frame_buffer = []
buffer_size = 16
previous_labels = []
top_confidences = []  # Initialize top_confidences

while True:
    ret, frame = cap.read()

    if not ret:
        print("Failed to capture frame")
        break

    # Add the current frame to the buffer
    frame_buffer.append(frame)

    # Check if we have enough frames for inference
    if len(frame_buffer) >= buffer_size:
        # Preprocess the frames
        inputs = preprocess_frames(frame_buffer, image_processor)

        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits

        # Get the top 3 predicted labels and their confidence scores
        top_k = 3
        probs = torch.softmax(logits, dim=-1)
        top_probs, top_indices = torch.topk(probs, top_k)
        top_labels = [model.config.id2label[idx.item()] for idx in top_indices[0]]
        top_confidences = top_probs[0].tolist()  # Update top_confidences

        # Check if the predicted labels are different from the previous labels
        if top_labels != previous_labels:
            previous_labels = top_labels
            print("Predicted class:", top_labels[0])  # Print the predicted class for debugging

        # Clear the frame buffer and continue from the next frame
        frame_buffer.clear()

        # Display the predicted labels and confidence scores on the frame
        for i, (label, confidence) in enumerate(zip(previous_labels, top_confidences)):
            label_text = f"{label}: {confidence:.2f}"
            cv2.putText(frame, label_text, (10, 30 + i * 30), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 0, 255), 2)

        # Display the resulting frame
        cv2.imshow('Video', frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

# Release everything when done
cap.release()
cv2.destroyAllWindows()

高级用法

简单使用示例：

import av
import torch
import numpy as np

from transformers import AutoImageProcessor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download

np.random.seed(0)


def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    '''
    Sample a given number of frame indices from the video.
    Args:
        clip_len (`int`): Total number of frames to sample.
        frame_sample_rate (`int`): Sample every n-th frame.
        seg_len (`int`): Maximum allowed index of sample's last frame.
    Returns:
        indices (`List[int]`): List of sampled frame indices
    '''
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices


# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
    repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
)
# use any other video just replace `file_path` with the video path
container = av.open(file_path)

# sample 16 frames
indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
video = read_video_pyav(container, indices)

image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")

inputs = image_processor(list(video), return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# model predicts one of the 13 ucf-crime classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

📚 详细文档

训练和评估数据

更多信息待补充。

训练过程

训练超参数

训练过程中使用了以下超参数：

学习率：5e-05
训练批次大小：8
评估批次大小：8
随机种子：42
优化器：Adam，betas=(0.9, 0.999)，epsilon=1e-08
学习率调度器类型：线性
学习率调度器热身比例：0.1
训练步数：700

训练结果

训练损失	轮数	步数	验证损失	准确率
2.5836	0.13	88	2.4944	0.2080
2.3212	1.13	176	2.5855	0.1773
2.2333	2.13	264	2.6270	0.1046
1.985	3.13	352	2.4058	0.2109
2.194	4.13	440	2.3654	0.2235
1.9796	5.13	528	2.2609	0.2235
1.8786	6.13	616	2.2725	0.2341
1.71	7.12	700	2.2228	0.2226