🚀 Cosmos-Reason1: Physical AI Common Sense and Embodied Reasoning Models
Cosmos-Reason1 is a Physical AI model that understands physical common sense and generates appropriate embodied decisions through long chain - of - thought reasoning. It's ready for commercial use and has specific input, output, and software integration requirements.
📚 Documentation
Model Overview
Description
Cosmos-Reason1 Models: These Physical AI models can understand physical common sense and generate appropriate embodied decisions in natural language via long chain - of - thought reasoning processes.
The Cosmos-Reason1 models are post - trained with physical common sense and embodied reasoning data using supervised fine - tuning and reinforcement learning. They can understand space, time, and fundamental physics, and serve as planning models to reason about the next steps of an embodied agent.
The models are available for commercial use.
Model Developer: NVIDIA
Model Versions
The Cosmos-Reason1 includes the following model:
- Cosmos-Reason1-7B: Given a text prompt and an input video, it thinks and generates an answer based on the input text prompt and video.
License
This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.
Under the NVIDIA Open Model License, NVIDIA confirms:
- The models are commercially usable.
- You are free to create and distribute Derivative Models.
- NVIDIA does not claim ownership of any outputs generated using the Models or Derivative Models.
⚠️ Important Note
If You bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism (collectively “Guardrail”) contained in the Model without a substantially similar Guardrail appropriate for your use case, your rights under this Agreement NVIDIA Open Model License Agreement will automatically terminate.
Deployment Geography
Global
Use Case
Physical AI: Space, time, fundamental physics understanding and embodied reasoning, encompassing robotics, and autonomous vehicles (AV).
Release Date
Model Architecture
Property |
Details |
Architecture Type |
A Multi - modal LLM consists of a Vision Transformer (ViT) for vision encoder and a Dense Transformer model for LLM. |
Network Architecture |
Qwen2.5 - VL - 7B - Instruct. |
Cosmos - Reason - 7B is post - trained based on Qwen2.5 - VL - 7B - Instruct and follows the same model architecture.
Input
Property |
Details |
Input Type(s) |
Text+Video/Image |
Input Format(s) |
Text: String; Video: mp4; Image: jpg |
Input Parameters |
Text: One - dimensional (1D); Video: Three - dimensional (3D); Image: Two - dimensional (2D) |
Other Properties Related to Input |
Use FPS = 4 for input video to match the training setup. Append Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>. in the system prompt to encourage long chain - of - thought reasoning response. |
Output
Property |
Details |
Output Type(s) |
Text |
Output Format |
String |
Output Parameters |
Text: One - dimensional (1D) |
Other Properties Related to Output |
Recommend using 4096 or more output max tokens to avoid truncation of long chain - of - thought response. Our AI models are designed and/or optimized to run on NVIDIA GPU - accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU - only solutions. |
Software Integration
Property |
Details |
Runtime Engine(s) |
vLLM |
Supported Hardware Microarchitecture Compatibility |
NVIDIA Blackwell; NVIDIA Hopper |
Note |
We have only tested doing inference with BF16 precision. |
Operating System(s) |
Linux (We have not tested on other operating systems.) |
Usage
See Cosmos-Reason1 for details.
- Post Training: Cosmos-Reason1 provides examples of supervised fine - tuning and reinforcement learning on embodied reasoning datasets.
Evaluation
Please see our technical paper for detailed evaluations on physical common sense and embodied reasoning. Part of the evaluation datasets are released under Cosmos-Reason1-Benchmark. The embodied reasoning datasets and benchmarks focus on the following areas: robotics (RoboVQA, BridgeDataV2, Agibot, RobFail), ego - centric human demonstration (HoloAssist), and Autonomous Vehicle (AV) driving video data. The AV dataset is collected and annotated by NVIDIA.
All datasets go through the data annotation process described in the technical paper to prepare training and evaluation data and annotations.
Data Collection Method
Dataset |
Collection Method |
RoboVQA |
Hybrid: Automatic/Sensors |
BridgeDataV2 |
Automatic/Sensors |
AgiBot |
Automatic/Sensors |
RoboFail |
Automatic/Sensors |
HoloAssist |
Human |
AV |
Automatic/Sensors |
Labeling Method
Dataset |
Labeling Method |
RoboVQA |
Hybrid: Human,Automated |
BridgeDataV2 |
Hybrid: Human,Automated |
AgiBot |
Hybrid: Human,Automated |
RoboFail |
Hybrid: Human,Automated |
HoloAssist |
Hybrid: Human,Automated |
AV |
Hybrid: Human,Automated |
Metrics
Dataset Format
Modality: Video (mp4) and Text
Dataset Quantification
We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
⚠️ Important Note
The AV data is currently unavailable and will be uploaded soon!
We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held - out RoboFail benchmark is released for measuring the generalization capability.
Inference
Test Hardware: H100, A100, GB200
⚠️ Important Note
We suggest using fps = 4
for the input video and max_tokens = 4096
to avoid truncated response.
💻 Usage Examples
Basic Usage
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10, "video": 10},
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
repetition_penalty=1.05,
max_tokens=4096,
)
video_messages = [
{"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
{"role": "user", "content": [
{"type": "text", "text": (
"Is it safe to turn right?"
)
},
{
"type": "video",
"video": "file:///path/to/your/video.mp4",
"fps": 4,
}
]
},
]
messages = video_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below.
Please report security vulnerabilities or NVIDIA AI Concerns here.
Plus Plus (++) Promise
We value you, the datasets, the diversity... (The content seems incomplete here, but I keep it as it is according to the rules.)
📄 License
This model is released under the NVIDIA Open Model License.