Model Overview
Model Features
Model Capabilities
Use Cases
đ Cosmos-1.0-Diffusion: A Suite of Diffusion-based World Foundation Models
Cosmos-1.0-Diffusion is a collection of diffusion-based world foundation models that can generate dynamic, high-quality videos from text, image, or video inputs, serving as a building block for various world generation applications.
Cosmos | Code | Paper | Paper Website
đ Quick Start
For detailed usage, please refer to Cosmos. Cosmos can also be used with Diffusers!
import torch
from diffusers import CosmosTextToWorldPipeline
from diffusers.utils import export_to_video
model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30)
Find more information in the diffusers documentation.
⨠Features
- High-Performance Video Generation: Cosmos diffusion models can generate dynamic, high-quality videos from text, image, or video inputs, which can be used as building blocks for various world generation applications.
- Multiple Model Versions: In the Cosmos 1.0 release, the Cosmos Diffusion WFM family includes multiple models, such as 7B Text2World, 14B Text2World, 7B Video2World, and 14B Video2World, to meet different application scenarios.
- Commercial Usability: Under the NVIDIA Open Model License, the models are commercially usable, and users can freely create and distribute derivative models.
đĻ Installation
The original README does not provide installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from diffusers import CosmosTextToWorldPipeline
from diffusers.utils import export_to_video
model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30)
đ Documentation
Model Overview
Description
Cosmos World Foundation Models is a family of highly performant pre-trained world foundation models designed for generating physics-aware videos and world states for physical AI development. The Cosmos diffusion models can generate dynamic, high-quality videos from text, image, or video inputs, which can be used as building blocks for various world generation applications. The models are commercially usable under the NVIDIA Open Model License Agreement.
Model Developer: NVIDIA
Model Versions
In the Cosmos 1.0 release, the Cosmos Diffusion WFM family includes the following models:
- Cosmos-1.0-Diffusion-7B-Text2World: Given a text description, it predicts an output video of 121 frames.
- Cosmos-1.0-Diffusion-14B-Text2World: Given a text description, it predicts an output video of 121 frames.
- Cosmos-1.0-Diffusion-7B-Video2World: Given a text description and an image as the first frame, it predicts the future 120 frames.
- Cosmos-1.0-Diffusion-14B-Video2World: Given a text description and an image as the first frame, it predicts the future 120 frames.
License
This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.
Under the NVIDIA Open Model License, NVIDIA confirms:
- Models are commercially usable.
- You are free to create and distribute Derivative Models.
- NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.
- Cosmos-1.0-Guardrail is the safety guardrail for this model.
Model Architecture
Cosmos-1.0-Diffusion-7B-Text2World is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention, and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When an image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
Input/Output Specifications
Property | Details |
---|---|
Input Type(s) | Text |
Input Format(s) | String |
Input Parameters | One-dimensional (1D) |
Other Properties Related to Input | The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration. |
Output Type(s) | Video |
Output Format(s) | mp4 |
Output Parameters | Three-dimensional (3D) |
Other Properties Related to Output | By default, the generated video is a 5-second clip with a resolution of 1280x704 pixels and a frame rate of 24 frames per second (fps). The video content visualizes the input text description as a short animated scene, capturing key elements within the specified time constraints. Aspect ratios and resolutions are configurable, with options including 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate is also adjustable within a range of 12 to 40 fps. |
Software Integration
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Ampere
Note: We have only tested inference with BF16 precision.
Operating System(s):
- Linux (We have not tested on other operating systems.)
Evaluation
Please see our technical paper for detailed evaluations.
Inference Time and GPU Memory Usage
The numbers provided below may vary depending on system specs and are for reference only.
We report the maximum observed GPU memory usage during end-to-end inference. Additionally, we offer a series of model offloading strategies to help users manage GPU memory usage effectively.
For GPUs with limited memory (e.g., RTX 3090/4090 with 24 GB memory), we recommend fully offloading all models. For higher-end GPUs, users can select the most suitable offloading strategy considering the numbers provided below.
Offloading Strategy | 7B Text2World | 14B Text2World |
---|---|---|
Offload prompt upsampler | 74.0 GB | > 80.0 GB |
Offload prompt upsampler & guardrails | 57.1 GB | 70.5 GB |
Offload prompt upsampler & guardrails & T5 encoder | 38.5 GB | 51.9 GB |
Offload prompt upsampler & guardrails & T5 encoder & tokenizer | 38.3 GB | 51.7 GB |
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model | 24.4 GB | 39.0 GB |
The table below presents the end-to-end inference runtime on a single H100 GPU, excluding model initialization time.
7B Text2World (offload prompt upsampler) | 14B Text2World (offload prompt upsampler, guardrails) |
---|---|
~380 seconds | ~590 seconds |
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns here.
Plus Plus (++) Promise
We value you, the datasets, the diversity they represent, and what we have been entrusted with. This model and its associated data have been:
- Verified to comply with current applicable disclosure laws, regulations, and industry standards.
- Verified to comply with applicable privacy labeling requirements.
- Annotated to describe the collector/source (NVIDIA or a third-party).
- Characterized for technical limitations.
- Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
- Reviewed before release.
- Tagged for known restrictions and potential safety implications.
Bias
Field | Response |
---|---|
Participation considerations from adversely impacted groups protected classes in model design and testing | None |
Measures taken to mitigate against unwanted bias | None |
đ§ Technical Details
The Cosmos-1.0-Diffusion-7B-Text2World model is a diffusion transformer model designed for video denoising in the latent space. Its network structure consists of interleaved self-attention, cross-attention, and feedforward layers. The cross-attention layers enable the model to condition on input text during the denoising process. Adaptive layer normalization is applied before each layer to embed time information for denoising. When an image or video is used as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to the conditional latent frames to bridge the gap between training and inference.
đ License
This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.
Under the NVIDIA Open Model License, NVIDIA confirms:
- Models are commercially usable.
- You are free to create and distribute Derivative Models.
- NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.
- Cosmos-1.0-Guardrail is the safety guardrail for this model.

