Cosmos-1.0-Diffusion-7B-Text2World Open-Source Model - Generate High-Quality Physically-Aware Videos Based on Text Input

Cosmos 1.0 Diffusion 7B Text2World

Developed by nvidia

A multimodal world foundation model based on diffusion architecture developed by NVIDIA, capable of generating high-quality physics-aware videos from text inputs

Text-to-Video Open Source License:Other #Physics-aware Video Generation #Multimodal Diffusion Model #High Frame Rate Video Synthesis

Downloads 5,011

Release Time : 1/7/2025

Model Overview

Cosmos is a high-performance pre-trained world foundation model series specifically designed for physics-aware video generation and physics AI development, supporting dynamic video generation from text, image, or video inputs

Model Features

Multimodal Input Support

Supports text, images, or videos as input conditions to generate coherent video sequences

Physics-aware Generation

Generated videos exhibit physical plausibility, suitable for physics AI development applications

Commercial-friendly License

Allows commercial use and creation of derivative models, NVIDIA does not claim ownership of output content

Safety Guardrail Mechanism

Built-in safety components prevent inappropriate content generation, circumvention mechanisms will result in license termination

Model Capabilities

Text-to-Video Generation

Video Prediction (based on first frame)

Multi-resolution Output

Variable Frame Rate Control

Use Cases

Entertainment Media

Short Video Content Generation

Automatically generates short video content based on script descriptions

5-second 1280x704 resolution video

Physics Simulation

Physical Phenomenon Prediction

Predicts object motion trajectories based on initial states

120-frame physically plausible motion sequence

🚀 Cosmos-1.0-Diffusion: A Suite of Diffusion-based World Foundation Models

Cosmos-1.0-Diffusion is a collection of diffusion-based world foundation models that can generate dynamic, high-quality videos from text, image, or video inputs, serving as a building block for various world generation applications.

Cosmos | Code | Paper | Paper Website

🚀 Quick Start

For detailed usage, please refer to Cosmos. Cosmos can also be used with Diffusers!

import torch
from diffusers import CosmosTextToWorldPipeline
from diffusers.utils import export_to_video

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30)

Find more information in the diffusers documentation.

✨ Features

High-Performance Video Generation: Cosmos diffusion models can generate dynamic, high-quality videos from text, image, or video inputs, which can be used as building blocks for various world generation applications.
Multiple Model Versions: In the Cosmos 1.0 release, the Cosmos Diffusion WFM family includes multiple models, such as 7B Text2World, 14B Text2World, 7B Video2World, and 14B Video2World, to meet different application scenarios.
Commercial Usability: Under the NVIDIA Open Model License, the models are commercially usable, and users can freely create and distribute derivative models.

📦 Installation

The original README does not provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from diffusers import CosmosTextToWorldPipeline
from diffusers.utils import export_to_video

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30)

📚 Documentation

Model Overview

Description

Cosmos World Foundation Models is a family of highly performant pre-trained world foundation models designed for generating physics-aware videos and world states for physical AI development. The Cosmos diffusion models can generate dynamic, high-quality videos from text, image, or video inputs, which can be used as building blocks for various world generation applications. The models are commercially usable under the NVIDIA Open Model License Agreement.

Model Developer: NVIDIA

Model Versions

In the Cosmos 1.0 release, the Cosmos Diffusion WFM family includes the following models:

Cosmos-1.0-Diffusion-7B-Text2World: Given a text description, it predicts an output video of 121 frames.
Cosmos-1.0-Diffusion-14B-Text2World: Given a text description, it predicts an output video of 121 frames.
Cosmos-1.0-Diffusion-7B-Video2World: Given a text description and an image as the first frame, it predicts the future 120 frames.
Cosmos-1.0-Diffusion-14B-Video2World: Given a text description and an image as the first frame, it predicts the future 120 frames.

License

This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA Open Model License, NVIDIA confirms:

Models are commercially usable.
You are free to create and distribute Derivative Models.
NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.

Cosmos-1.0-Guardrail is the safety guardrail for this model.

Model Architecture

Cosmos-1.0-Diffusion-7B-Text2World is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention, and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When an image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.

Input/Output Specifications

Property	Details
Input Type(s)	Text
Input Format(s)	String
Input Parameters	One-dimensional (1D)
Other Properties Related to Input	The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
Output Type(s)	Video
Output Format(s)	mp4
Output Parameters	Three-dimensional (3D)
Other Properties Related to Output	By default, the generated video is a 5-second clip with a resolution of 1280x704 pixels and a frame rate of 24 frames per second (fps). The video content visualizes the input text description as a short animated scene, capturing key elements within the specified time constraints. Aspect ratios and resolutions are configurable, with options including 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate is also adjustable within a range of 12 to 40 fps.

Software Integration

Runtime Engine(s):

Supported Hardware Microarchitecture Compatibility:

NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Ampere

Note: We have only tested inference with BF16 precision.

Operating System(s):

Linux (We have not tested on other operating systems.)

Evaluation

Please see our technical paper for detailed evaluations.

Inference Time and GPU Memory Usage

The numbers provided below may vary depending on system specs and are for reference only.

We report the maximum observed GPU memory usage during end-to-end inference. Additionally, we offer a series of model offloading strategies to help users manage GPU memory usage effectively.

For GPUs with limited memory (e.g., RTX 3090/4090 with 24 GB memory), we recommend fully offloading all models. For higher-end GPUs, users can select the most suitable offloading strategy considering the numbers provided below.

Offloading Strategy	7B Text2World	14B Text2World
Offload prompt upsampler	74.0 GB	> 80.0 GB
Offload prompt upsampler & guardrails	57.1 GB	70.5 GB
Offload prompt upsampler & guardrails & T5 encoder	38.5 GB	51.9 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer	38.3 GB	51.7 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model	24.4 GB	39.0 GB

The table below presents the end-to-end inference runtime on a single H100 GPU, excluding model initialization time.

7B Text2World (offload prompt upsampler)	14B Text2World (offload prompt upsampler, guardrails)
~380 seconds	~590 seconds

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns here.

Plus Plus (++) Promise

We value you, the datasets, the diversity they represent, and what we have been entrusted with. This model and its associated data have been:

Verified to comply with current applicable disclosure laws, regulations, and industry standards.
Verified to comply with applicable privacy labeling requirements.
Annotated to describe the collector/source (NVIDIA or a third-party).
Characterized for technical limitations.
Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
Reviewed before release.
Tagged for known restrictions and potential safety implications.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing	None
Measures taken to mitigate against unwanted bias	None

🔧 Technical Details

The Cosmos-1.0-Diffusion-7B-Text2World model is a diffusion transformer model designed for video denoising in the latent space. Its network structure consists of interleaved self-attention, cross-attention, and feedforward layers. The cross-attention layers enable the model to condition on input text during the denoising process. Adaptive layer normalization is applied before each layer to embed time information for denoising. When an image or video is used as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to the conditional latent frames to bridge the gap between training and inference.

📄 License

This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA Open Model License, NVIDIA confirms:

Models are commercially usable.
You are free to create and distribute Derivative Models.
NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

Cosmos-1.0-Guardrail is the safety guardrail for this model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご