đ Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Kangaroo has been released. For detailed information, please refer to our paper, blog, and github.
đ Documentation
Abstract
We introduce Kangaroo, a powerful Multimodal Large Language Model tailored for long-context video understanding. The presented Kangaroo model demonstrates remarkable performance across diverse video understanding tasks, such as video captioning, QA, and conversation. Generally, our key contributions in this work can be summarized as follows:
- Long-context Video Input: To enhance the model's ability to comprehend longer videos, we extend the maximum frames of input videos to 160. We aggregate multiple videos with variable frame counts and aspect ratios into one sample and design a spatial-temporal pathify module to improve training efficiency.
- Strong Performance: We evaluate our model on various video understanding benchmarks. The results show that our model achieves state-of-the-art performance on most comprehensive benchmarks and remains competitive on others. Notably, it outperforms most larger open-source models with over 30B parameters and some proprietary models on certain benchmarks.
- Video Annotation System: We develop a data curation and automatic annotation system to generate captions for open-source and internal videos. The generated large-scale dataset is used for video-text pre-training. For the video instruction tuning stage, we construct a video instruction tuning dataset based on public and internal datasets covering a variety of tasks.
- Bilingual Conversation: Our proposed model is capable of Chinese, English, and bilingual conversations and supports single/multi-round conversation paradigms.
đ Quick Start
đĻ Installation
Refer to our github page for installation details.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("KangarooGroup/kangaroo")
model = AutoModelForCausalLM.from_pretrained(
"KangarooGroup/kangaroo",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = model.to("cuda")
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
video_path = "/path/to/video"
query = "Give a brief description of the video."
out, history = model.chat(video_path=video_path,
query=query,
tokenizer=tokenizer,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,)
print('Assitant: \n', out)
query = "What happend at the end of the video?"
out, history = model.chat(video_path=video_path,
query=query,
history=history,
tokenizer=tokenizer,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,)
print('Assitant: \n', out)
đ License
This project is licensed under the Apache-2.0 license.
đ Citation
If you find this research useful, please cite the related papers/blogs using the following BibTeX:
@misc{kangaroogroup,
title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
url={https://kangaroogroup.github.io/Kangaroo.github.io/},
author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
month={July},
year={2024}
}