Videomae Base Finetuned Ucfcrime Full
Model Overview
Model Features
Model Capabilities
Use Cases
đ videomae-base-finetuned-ucfcrime-full2
This is a video classification model fine - tuned from [MCG - NJU/videomae - base](https://huggingface.co/MCG - NJU/videomae - base) on the [UCF - CRIME](https://paperswithcode.com/dataset/ucf - crime) dataset. It provides a solution for video - based vandalism detection and classification tasks. The code for this project can be found on [github](https://github.com/archit - spec/majorproject).
đ Quick Start
This model is a fine - tuned version of [MCG - NJU/videomae - base](https://huggingface.co/MCG - NJU/videomae - base) on the [UCF - CRIME](https://paperswithcode.com/dataset/ucf - crime) dataset. It achieves the following results on the evaluation set:
- Loss: 2.5014
- Accuracy: 0.225
⨠Features
- Fine - tuned Model: Based on the pre - trained
videomae - base
model, it is fine - tuned on the UCF - CRIME dataset, which is suitable for video classification tasks in the field of crime detection. - Multiple Metrics: Evaluated using accuracy, providing a clear measure of model performance.
đĻ Installation
No specific installation steps are provided in the original README. If you want to use this model, you can install the necessary libraries according to the code examples, such as transformers
, torch
, av
, cv2
, etc. You can use the following command to install the transformers
library:
pip install transformers
đģ Usage Examples
Basic Usage
import av
import torch
import numpy as np
from transformers import AutoImageProcessor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download
np.random.seed(0)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
'''
Sample a given number of frame indices from the video.
Args:
clip_len (`int`): Total number of frames to sample.
frame_sample_rate (`int`): Sample every n - th frame.
seg_len (`int`): Maximum allowed index of sample's last frame.
Returns:
indices (`List[int]`): List of sampled frame indices
'''
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
)
# use any other video just replace `file_path` with the video path
container = av.open(file_path)
# sample 16 frames
indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
video = read_video_pyav(container, indices)
image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
inputs = image_processor(list(video), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 13 ucf-crime classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
Advanced Usage
import cv2
import torch
import numpy as np
from transformers import AutoImageProcessor, VideoMAEForVideoClassification
np.random.seed(0)
def preprocess_frames(frames, image_processor):
inputs = image_processor(frames, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()} # Move tensors to GPU
return inputs
# Initialize the video capture object, replace ip addr with the local ip of your phone (will be shown in the ipwebcam app)
cap = cv2.VideoCapture('http://192.168.229.98:8080/video')
# Set the frame size (optional)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
# Move the model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
frame_buffer = []
buffer_size = 16
previous_labels = []
top_confidences = [] # Initialize top_confidences
while True:
ret, frame = cap.read()
if not ret:
print("Failed to capture frame")
break
# Add the current frame to the buffer
frame_buffer.append(frame)
# Check if we have enough frames for inference
if len(frame_buffer) >= buffer_size:
# Preprocess the frames
inputs = preprocess_frames(frame_buffer, image_processor)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Get the top 3 predicted labels and their confidence scores
top_k = 3
probs = torch.softmax(logits, dim=-1)
top_probs, top_indices = torch.topk(probs, top_k)
top_labels = [model.config.id2label[idx.item()] for idx in top_indices[0]]
top_confidences = top_probs[0].tolist() # Update top_confidences
# Check if the predicted labels are different from the previous labels
if top_labels != previous_labels:
previous_labels = top_labels
print("Predicted class:", top_labels[0]) # Print the predicted class for debugging
# Clear the frame buffer and continue from the next frame
frame_buffer.clear()
# Display the predicted labels and confidence scores on the frame
for i, (label, confidence) in enumerate(zip(previous_labels, top_confidences)):
label_text = f"{label}: {confidence:.2f}"
cv2.putText(frame, label_text, (10, 30 + i * 30), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 0, 255), 2)
# Display the resulting frame
cv2.imshow('Video', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release everything when done
cap.release()
cv2.destroyAllWindows()
đ Documentation
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- training_steps: 700
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
2.5836 | 0.13 | 88 | 2.4944 | 0.2080 |
2.3212 | 1.13 | 176 | 2.5855 | 0.1773 |
2.2333 | 2.13 | 264 | 2.6270 | 0.1046 |
1.985 | 3.13 | 352 | 2.4058 | 0.2109 |
2.194 | 4.13 | 440 | 2.3654 | 0.2235 |
1.9796 | 5.13 | 528 | 2.2609 | 0.2235 |
1.8786 | 6.13 | 616 | 2.2725 | 0.2341 |
1.71 | 7.12 | 700 | 2.2228 | 0.2226 |
Framework versions
- Transformers 4.38.1
- Pytorch 2.1.2
- Datasets 2.1.0
- Tokenizers 0.15.2
đ License
This model is licensed under the CC - BY - NC - 4.0 license.



