đ VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 is a cutting - edge video - large language model that enhances spatial - temporal modeling and audio understanding, offering powerful solutions for video and audio - visual question - answering tasks.
Metadata
Property |
Details |
License |
Apache-2.0 |
Datasets |
lmms - lab/ClothoAQA, Loie/VGGSound |
Language |
en |
Metrics |
accuracy |
Pipeline Tag |
visual - question - answering |
Library Name |
transformers |
Tags |
Audio - visual Question Answering, Audio Question Answering, multimodal large language model |
If you like our project, please give us a star â on Github for the latest update.
đ° News
đ Model Zoo
Vision - Only Checkpoints
Audio - Visual Checkpoints
đ Main Results
Multi - Choice Video QA & Video Captioning

Open - Ended Video QA

Multi - Choice & Open - Ended Audio QA

Open - Ended Audio - Visual QA

đģ Usage Examples
Basic Usage
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
import argparse
def inference(args):
model_path = args.model_path
model, processor, tokenizer = model_init(model_path)
if args.modal_type == "a":
model.model.vision_tower = None
elif args.modal_type == "v":
model.model.audio_tower = None
elif args.modal_type == "av":
pass
else:
raise NotImplementedError
audio_video_path = "assets/00003491.mp4"
preprocess = processor['audio' if args.modal_type == "a" else "video"]
if args.modal_type == "a":
audio_video_tensor = preprocess(audio_video_path)
else:
audio_video_tensor = preprocess(audio_video_path, va=True if args.modal_type == "av" else False)
question = f"Please describe the video with audio information."
audio_video_path = "assets/bird-twitter-car.wav"
preprocess = processor['audio' if args.modal_type == "a" else "video"]
if args.modal_type == "a":
audio_video_tensor = preprocess(audio_video_path)
else:
audio_video_tensor = preprocess(audio_video_path, va=True if args.modal_type == "av" else False)
question = f"Please describe the audio."
audio_video_path = "assets/output_v_1jgsRbGzCls.mp4"
preprocess = processor['audio' if args.modal_type == "a" else "video"]
if args.modal_type == "a":
audio_video_tensor = preprocess(audio_video_path)
else:
audio_video_tensor = preprocess(audio_video_path, va=True if args.modal_type == "av" else False)
question = f"What activity are the people practicing in the video?"
output = mm_infer(
audio_video_tensor,
question,
model=model,
tokenizer=tokenizer,
modal='audio' if args.modal_type == "a" else "video",
do_sample=False,
)
print(output)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--model-path', help='', required=False, default='DAMO-NLP-SG/VideoLLaMA2.1-7B-AV')
parser.add_argument('--modal-type', choices=["a", "v", "av"], help='', required=True)
args = parser.parse_args()
inference(args)
đ License
If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}