D

DAM 3B Video

Developed by nvidia
DAM-3B-Video is a 3-billion-parameter vision-language model capable of generating fine-grained local descriptions for user-specified image/video regions.
Downloads 426
Release Time : 4/21/2025

Model Overview

This model integrates full-image/video context with fine-grained local details through a focus prompt mechanism and gated cross-attention enhanced local visual backbone network to generate detailed descriptions for visual regions.

Model Features

Fine-grained local description
Capable of generating detailed descriptions for image/video regions specified by users via points/boxes/scribbles/masks
Focus prompt mechanism
Innovative focus prompt mechanism helps the model concentrate on user-specified regions
Gated cross-attention enhancement
Employs a gated cross-attention enhanced local visual backbone network to integrate global context with local details
Multimodal input support
Supports various input forms including images, videos, text, and binary masks

Model Capabilities

Image region description generation
Video region description generation
Multimodal input processing
Fine-grained local feature recognition

Use Cases

Research applications
Computer vision research
Used for vision-language model research and development
Non-commercial applications
Educational demonstrations
Showcasing advanced visual-language understanding capabilities
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase