Kangaroo Open-Source Multimodal Large Language Model - Supports Bilingual Conversations and Aids in Long Video Understanding!

Kangaroo

Developed by KangarooGroup

Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.

Video-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Long Video Understanding #Bilingual Dialogue #Multimodal Large Model

Downloads 163

Release Time : 7/11/2024

Model Overview

The Kangaroo model specializes in video understanding tasks, including video captioning, Q&A, and dialogue, with exceptional capability in processing long video inputs (up to 160 frames).

Model Features

Long Video Input Support

Innovatively handles videos with varying frame counts and aspect ratios by extending input capacity to 160 frames

Outstanding Performance

Achieves or surpasses SOTA levels on multiple video understanding benchmarks

Video Annotation System

Developed a data filtering and auto-annotation system to generate large-scale video-text datasets

Bilingual Dialogue Capability

Supports single/multi-turn video dialogues in both Chinese and English

Model Capabilities

Video Content Description

Video Q&A

Video Dialogue

Long Video Understanding

Bilingual Processing (Chinese-English)

Use Cases

Video Content Analysis

Video Summarization

Automatically generates textual summaries of video content

Accurately captures key video content

Intelligent Customer Service

Video Product Q&A

Answers user questions about products in videos

Provides accurate product information solutions

🚀 Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Kangaroo has been released. For detailed information, please refer to our paper, blog, and github.

📚 Documentation

Abstract

We introduce Kangaroo, a powerful Multimodal Large Language Model tailored for long-context video understanding. The presented Kangaroo model demonstrates remarkable performance across diverse video understanding tasks, such as video captioning, QA, and conversation. Generally, our key contributions in this work can be summarized as follows:

Long-context Video Input: To enhance the model's ability to comprehend longer videos, we extend the maximum frames of input videos to 160. We aggregate multiple videos with variable frame counts and aspect ratios into one sample and design a spatial-temporal pathify module to improve training efficiency.
Strong Performance: We evaluate our model on various video understanding benchmarks. The results show that our model achieves state-of-the-art performance on most comprehensive benchmarks and remains competitive on others. Notably, it outperforms most larger open-source models with over 30B parameters and some proprietary models on certain benchmarks.
Video Annotation System: We develop a data curation and automatic annotation system to generate captions for open-source and internal videos. The generated large-scale dataset is used for video-text pre-training. For the video instruction tuning stage, we construct a video instruction tuning dataset based on public and internal datasets covering a variety of tasks.
Bilingual Conversation: Our proposed model is capable of Chinese, English, and bilingual conversations and supports single/multi-round conversation paradigms.

🚀 Quick Start

📦 Installation

Refer to our github page for installation details.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("KangarooGroup/kangaroo")
model = AutoModelForCausalLM.from_pretrained(
    "KangarooGroup/kangaroo",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = model.to("cuda")
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

video_path = "/path/to/video"

# Round 1
query = "Give a brief description of the video."
out, history = model.chat(video_path=video_path,
                          query=query,
                          tokenizer=tokenizer,
                          max_new_tokens=512,
                          eos_token_id=terminators,
                          do_sample=True,
                          temperature=0.6,
                          top_p=0.9,)
print('Assitant: \n', out)

# Round 2
query = "What happend at the end of the video?"
out, history = model.chat(video_path=video_path,
                          query=query,
                          history=history,
                          tokenizer=tokenizer,
                          max_new_tokens=512,
                          eos_token_id=terminators,
                          do_sample=True,
                          temperature=0.6,
                          top_p=0.9,)
print('Assitant: \n', out)

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

If you find this research useful, please cite the related papers/blogs using the following BibTeX:

@misc{kangaroogroup,
    title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
    url={https://kangaroogroup.github.io/Kangaroo.github.io/},
    author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
    month={July},
    year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご