DAM-3B Open-source Vision-Language Model - Free to Use, Generate Precise Fine-grained Descriptions of Image Regions

DAM 3B

Developed by nvidia

DAM-3B is a 3-billion-parameter vision-language model capable of generating fine-grained local descriptions for user-specified image regions.

Image-to-Text

Safetensors

EnglishOpen Source License:Other #Fine-grained Local Description #Multimodal Input Support #Non-commercial Research Use

Downloads 1,417

Release Time : 4/21/2025

Model Overview

This model takes user-specified image regions in the form of points/boxes/scribbles/masks as input and generates fine-grained local descriptions of the image. It integrates global context with fine-grained local details through an innovative focus prompt mechanism and an enhanced local visual backbone network using gated cross-attention.

Model Features

Fine-grained Local Description

Capable of generating detailed descriptions for any user-specified image region

Multi-form Region Specification

Supports various forms of region specification including points, boxes, scribbles, and masks

Focus Prompt Mechanism

Innovative attention mechanism integrating global context with local details

Gated Cross-Attention

Enhanced local visual backbone network improves description quality

Model Capabilities

Image region description generation

Multi-form region input processing

Fine-grained visual understanding

Use Cases

Computer Vision Research

Fine-grained Image Understanding

Used to study the model's ability to understand local image details

Assistive Technology

Visual Assistance Description

Provides detailed descriptions of specific image regions for visually impaired individuals

🚀 Describe Anything: Detailed Localized Image and Video Captioning

Describe Anything Model 3B (DAM-3B) can generate detailed localized descriptions of images based on user - specified regions. It's designed for research and non - commercial use.

🚀 Quick Start

This README provides a comprehensive introduction to the Describe Anything Model 3B (DAM - 3B), including its description, license, usage, architecture, input/output details, training and evaluation datasets, and citation information.

✨ Features

Localized Description: DAM - 3B can generate detailed descriptions of user - specified regions in images.
Novel Architecture: It integrates full - image context with fine - grained local details using a novel focal prompt and a localized vision backbone enhanced with gated cross - attention.
Research - Oriented: The model is mainly for research and non - commercial use.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Card for DAM - 3B

Description

Describe Anything Model 3B (DAM - 3B) takes inputs of user - specified regions in the form of points/boxes/scribbles/masks within images, and generates detailed localized descriptions of images. DAM integrates full - image context with fine - grained local details using a novel focal prompt and a localized vision backbone enhanced with gated cross - attention. The model is for research and development only. This model is ready for non - commercial use.

License

[NVIDIA Noncommercial License](https://huggingface.co/nvidia/DAM - 3B/blob/main/LICENSE)

Intended Usage

This model is intended to demonstrate and facilitate the understanding and usage of the describe anything models. It should primarily be used for research and non - commercial purposes.

Model Architecture

Property	Details
Architecture Type	Transformer
Network Architecture	ViT and Llama
Development Basis	VILA - 1.5
Model Parameters	3B

Input

Property	Details
Input Type(s)	Image, Text, Binary Mask
Input Format(s)	RGB Image, Binary Mask
Input Parameters	2D Image, 2D Binary Mask
Other Properties Related to Input	3 channels for RGB image, 1 channel for binary mask. Resolution is 384x384.

Output

Property	Details
Output Type(s)	Text
Output Format	String
Output Parameters	1D Text
Other Properties Related to Output	Detailed descriptions for the visual region.

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Lovelace

Preferred/Supported Operating System(s):

Linux

Training Dataset

Describe Anything Training Datasets

Evaluation Dataset

We evaluate our models on our detailed localized captioning benchmark: [DLC - Bench](https://huggingface.co/datasets/nvidia/DLC - Bench)

Inference

PyTorch

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en - us/support/submit - security - vulnerability/).

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.

@article{lian2025describe,
  title={Describe Anything: Detailed Localized Image and Video Captioning}, 
  author={Long Lian and Yifan Ding and Yunhao Ge and Sifei Liu and Hanzi Mao and Boyi Li and Marco Pavone and Ming - Yu Liu and Trevor Darrell and Adam Yala and Yin Cui},
  journal={arXiv preprint arXiv:2504.16072},
  year={2025}
}

Authors

NVIDIA, UC Berkeley, UCSF

Long Lian, [Yifan Ding](https://research.nvidia.com/person/yifan - ding), Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, [Marco Pavone](https://research.nvidia.com/person/marco - pavone), Ming - Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

Links

[Paper] | [[Code](https://github.com/NVlabs/describe - anything)] | [[Project Page](https://describe - anything.github.io/)] | [[Video](https://describe - anything.github.io/#video)] | [[HuggingFace Demo](https://huggingface.co/spaces/nvidia/describe - anything - model - demo)] | [[Model/Benchmark/Datasets](https://huggingface.co/collections/nvidia/describe - anything - 680825bb8f5e41ff0785834c)] | [Citation]

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご