Apollo Open-Source Multimodal Model - Free Support for Long Video Understanding, Temporal Reasoning, and Complex Q&A

Apollo LMMs Apollo 1 5B T32

Developed by GoodiesHere

Apollo is a series of large multimodal models focused on video understanding, excelling in tasks such as long video content comprehension, temporal reasoning, and complex video question answering.

Video-to-Text

Safetensors

Open Source License:Apache-2.0 #Long Video Understanding #Temporal Reasoning #Multimodal Dialogue

Downloads 37

Release Time : 12/18/2024

Model Overview

The Apollo model strategically balances speed and accuracy, capable of processing video content up to one hour in length while achieving competitive performance against larger models with a smaller parameter scale.

Model Features

Scalable Consistency

Designs validated on small models and datasets can be effectively transferred to larger scales, reducing computational and experimental costs

Efficient Video Sampling

FPS sampling and advanced token resampling strategies (e.g., Perceiver) enhance temporal awareness

Encoder Synergy

The combination of SigLIP-SO400M (image) and InternVideo2 (video) forms robust representations, outperforming single encoders in temporal tasks

ApolloBench

A streamlined evaluation benchmark (41x faster) focused on assessing real-world video understanding capabilities

Model Capabilities

Long video content understanding

Temporal reasoning

Complex video question answering

Multimodal dialogue based on video content

Use Cases

Video Analysis

Video Content Description

Detailed description of video content up to one hour in length

Accurately captures key content and temporal relationships in videos

Video Question Answering

Answering complex questions about video content

Excellent performance in complex video QA tasks

🚀 Apollo: An Exploration of Video Understanding in Large Multimodal Models

Apollo is a family of Large Multimodal Models (LMMs) that advance the state - of - the - art in video understanding. It supports various tasks such as long - form video comprehension, temporal reasoning, complex video question - answering, and multi - turn conversations based on video content.

Apollo models are highly effective at handling hour - long videos. Through strategic design, they strike a balance between speed and accuracy. With only 3B parameters, our models outperform most 7B competitors and can even rival 30B - scale models.

✨ Features

Scaling Consistency: Design decisions validated on smaller models and datasets can be effectively applied to larger scales, reducing computation and experimentation costs.
Efficient Video Sampling: fps sampling and advanced token resampling strategies (e.g., Perceiver) enhance temporal perception.
Encoder Synergies: Combining SigLIP - SO400M (image) with InternVideo2 (video) provides a robust representation, outperforming single encoders in temporal tasks.
ApolloBench: A streamlined evaluation benchmark (41x faster) that focuses on true video understanding capabilities.

🚀 Quick Start

📦 Installation

pip install -e .
pip install flash-attn --no-build-isolation

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM
from apollo.mm_utils import (
    KeywordsStoppingCriteria,
    tokenizer_mm_token,
    ApolloMMLoader
)
from apollo.conversations import conv_templates, SeparatorStyle
from huggingface_hub import snapshot_download

model_url = "Apollo-LMMs/Apollo-3B-t32"
model_path = snapshot_download(model_url, repo_type="model")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True
).to(device=device, dtype=torch.bfloat16)

tokenizer = model.tokenizer
vision_processors = model.vision_tower.vision_processor
config = model.config
num_repeat_token = config.mm_connector_cfg['num_output_tokens']
mm_processor = ApolloMMLoader(
    vision_processors,
    config.clip_duration,
    frames_per_clip=4,
    clip_sampling_ratio=0.65,
    model_max_length=config.model_max_length,
    device=device,
    num_repeat_token=num_repeat_token
)

video_path = "path/to/video.mp4"
question = "Describe this video in detail"
mm_data, replace_string = mm_processor.load_video(video_path)

conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], replace_string + "\n\n" + question)
conv.append_message(conv.roles[1], None)

prompt = conv.get_prompt()
input_ids = tokenizer_mm_token(prompt, tokenizer, return_tensors="pt").unsqueeze(0).to(device)

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        vision_input=[mm_data],
        data_types=['video'],
        do_sample=True,
        temperature=0.4,
        max_new_tokens=256,
        top_p=0.7,
        use_cache=True,
        num_beams=1,
        stopping_criteria=[stopping_criteria]
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

📚 Documentation

If you find this project useful, please consider citing:

@article{zohar2024apollo,
    title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
    author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
    journal={arXiv preprint arXiv:2412.10360},
    year={2024}
}

For more details, visit the project website or check out the paper.

📄 License

This project is licensed under the Apache - 2.0 License.

Additional Information

Property	Details
Tags	video, video - understanding, vision, multimodal, conversational, qwen, custom_code, instruction - tuning
Datasets	ApolloBench, Video - MME, MLVU, LongVideoBench, NExTQA, PerceptionTest
Inference	true
Pipeline Tag	video - text - to - text

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご