Vamba-Qwen2-VL-7B Open Source Model - Supports Efficient Long Video Understanding, Free Deployment and Highly Practical

Vamba Qwen2 VL 7B

Developed by TIGER-Lab

Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.

Video-to-Text

Transformers

Open Source License:MIT #Long Video Understanding #Hybrid Mamba-Transformer #Efficient Video Processing

Downloads 806

Release Time : 3/13/2025

Model Overview

Vamba is an innovative hybrid architecture that combines the strengths of Mamba and Transformer, specifically designed for long video understanding tasks. It significantly reduces computational overhead by differentially processing text and video tokens.

Model Features

Efficient Long Video Processing

Utilizes Mamba modules to process video token sequences, significantly reducing computational complexity.

Hybrid Architecture Design

Combines the self-attention mechanism of Transformer with the efficient sequence processing capability of Mamba.

Differential Token Processing

Employs different processing mechanisms for text and video tokens to optimize computational efficiency.

Model Capabilities

Long Video Understanding

Video Content Description

Image Content Description

Multimodal Reasoning

Use Cases

Video Content Analysis

Magic Trick Analysis

Analyze and describe the magic tricks performed in the video

Accurately identifies and describes magic actions

Image Understanding

Image Content Description

Provide a detailed description of the input image

Generates accurate image descriptions

🚀 Vamba

This repository houses the model checkpoints for Vamba-Qwen2-VL-7B. Vamba is a hybrid Mamba-Transformer model. It utilizes cross-attention layers and Mamba-2 blocks to achieve efficient understanding of hour-long videos.

🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 Model

✨ Features

Vamba Model Architecture

In transformer-based LMMs, the main computational overhead stems from the quadratic complexity of self-attention in video tokens. To address this, we've designed a hybrid Mamba Transformer architecture to handle text and video tokens differently. Our approach splits the costly self-attention operation across the entire video and text token sequence into two more efficient components. Given that video tokens usually dominate the sequence while text tokens are fewer, we retain the self-attention mechanism solely for text tokens and remove it for video tokens. Instead, we incorporate cross-attention layers, using text tokens as queries and video tokens as keys and values. Meanwhile, we propose using Mamba blocks to effectively process video tokens.

🚀 Quick Start

# git clone https://github.com/TIGER-AI-Lab/Vamba
# cd Vamba
# export PYTHONPATH=.
from tools.vamba_chat import Vamba
model = Vamba(model_path="TIGER-Lab/Vamba-Qwen2-VL-7B", device="cuda")
test_input = [
    {
        "type": "video",
        "content": "assets/magic.mp4",
        "metadata": {
            "video_num_frames": 128,
            "video_sample_type": "middle",
            "img_longest_edge": 640,
            "img_shortest_edge": 256,
        }
    },
    {
        "type": "text",
        "content": "<video> Describe the magic trick."
    }
]
print(model(test_input))

test_input = [
    {
        "type": "image",
        "content": "assets/old_man.png",
        "metadata": {}
    },
    {
        "type": "text",
        "content": "<image> Describe this image."
    }
]
print(model(test_input))

📄 License

This project is licensed under the MIT license.

📚 Documentation

Citation

If you find our paper useful, please cite us with

@misc{ren2025vambaunderstandinghourlongvideos,
      title={Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers}, 
      author={Weiming Ren and Wentao Ma and Huan Yang and Cong Wei and Ge Zhang and Wenhu Chen},
      year={2025},
      eprint={2503.11579},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11579}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご