DAM-3B-Video Open-source Visual Language Model - Free Deployment for Precise Generation of Image/Video Local Descriptions

DAM 3B Video

Developed by nvidia

DAM-3B-Video is a 3-billion-parameter vision-language model capable of generating fine-grained local descriptions for user-specified image/video regions.

Image-to-Text

Safetensors

EnglishOpen Source License:Other #Local region description #Multimodal input #Video caption generation

Downloads 426

Release Time : 4/21/2025

Model Overview

This model integrates full-image/video context with fine-grained local details through a focus prompt mechanism and gated cross-attention enhanced local visual backbone network to generate detailed descriptions for visual regions.

Model Features

Fine-grained local description

Capable of generating detailed descriptions for image/video regions specified by users via points/boxes/scribbles/masks

Focus prompt mechanism

Innovative focus prompt mechanism helps the model concentrate on user-specified regions

Gated cross-attention enhancement

Employs a gated cross-attention enhanced local visual backbone network to integrate global context with local details

Multimodal input support

Supports various input forms including images, videos, text, and binary masks

Model Capabilities

Image region description generation

Video region description generation

Multimodal input processing

Fine-grained local feature recognition

Use Cases

Research applications

Computer vision research

Used for vision-language model research and development

Non-commercial applications

Educational demonstrations

Showcasing advanced visual-language understanding capabilities

🚀 Describe Anything: Detailed Localized Image and Video Captioning

NVIDIA, UC Berkeley, UCSF present a model that enables detailed localized descriptions of images and videos, facilitating research and non - commercial development.

Authors: Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming - Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

🚀 Quick Start

The Describe Anything Model 3B Video (DAM - 3B - Video) is designed to generate detailed localized descriptions of images and videos. It's mainly for research and non - commercial use. You can access the model through various links provided above, such as the HuggingFace Demo for a quick test.

✨ Features

Localized Description: DAM - 3B - Video can generate detailed descriptions for user - specified regions in images/videos using points/boxes/scribbles/masks.
Novel Architecture: It integrates full - image/video context with fine - grained local details via a novel focal prompt and a localized vision backbone enhanced with gated cross - attention.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

No code examples are provided in the original README.

📚 Documentation

Model Card for DAM - 3B

Description

Describe Anything Model 3B Video (DAM - 3B - Video) takes inputs of user - specified regions in the form of points/boxes/scribbles/masks within images/videos, and generates detailed localized descriptions of images/videos. DAM integrates full - image/video context with fine - grained local details using a novel focal prompt and a localized vision backbone enhanced with gated cross - attention. The model is for research and development only. This model is ready for non - commercial use.

License

NVIDIA Noncommercial License

Intended Usage

This model is intended to demonstrate and facilitate the understanding and usage of the describe anything models. It should primarily be used for research and non - commercial purposes.

Model Architecture

Property	Details
Architecture Type	Transformer
Network Architecture	ViT and Llama

This model was developed based on VILA - 1.5. This model has 3B of model parameters.

Input

Property	Details
Input Type(s)	Image, Video, Text, Binary Mask
Input Format(s)	RGB Image, RGB Video, Binary Mask
Input Parameters	2D Image, 2D Video, 2D Binary Mask
Other Properties Related to Input	3 channels for RGB image, 3 channels for RGB video, 1 channel for binary mask. Resolution is 384x384.

Output

Property	Details
Output Type(s)	Text
Output Format	String
Output Parameters	1D Text
Other Properties Related to Output	Detailed descriptions for the visual region.

Supported Hardware and OS

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Lovelace

Preferred/Supported Operating System(s):

Linux

Training Dataset

Describe Anything Training Datasets

Evaluation Dataset

We evaluate our models on our detailed localized captioning benchmark: DLC - Bench

Inference

PyTorch

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

🔧 Technical Details

The model uses a novel focal prompt and a localized vision backbone enhanced with gated cross - attention to integrate full - image/video context with fine - grained local details. It is developed based on VILA - 1.5 and has 3B of model parameters.

📄 License

NVIDIA Noncommercial License

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.

@article{lian2025describe,
  title={Describe Anything: Detailed Localized Image and Video Captioning}, 
  author={Long Lian and Yifan Ding and Yunhao Ge and Sifei Liu and Hanzi Mao and Boyi Li and Marco Pavone and Ming - Yu Liu and Trevor Darrell and Adam Yala and Yin Cui},
  journal={arXiv preprint arXiv:2504.16072},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご