๐ Cosmos-Reason1: Physical AI Common Sense and Embodied Reasoning Models
Cosmos-Reason1 is a set of Physical AI models that can understand physical common sense and perform embodied reasoning through long chain-of-thought processes. It has wide applications in fields like robotics and autonomous vehicles.
๐ Quick Start
For detailed usage, please refer to Cosmos-Reason1. It provides examples of supervised fine-tuning and reinforcement learning on embodied reasoning datasets.
โจ Features
- Unsloth Dynamic 2.0: Unsloth Dynamic 2.0 achieves superior accuracy and outperforms other leading quants.
- Multi - modal LLM: Consists of a Vision Transformer (ViT) for vision encoder and a Dense Transformer model for LLM, based on the Qwen2.5 - VL - 7B - Instruct architecture.
- Commercial Usability: Commercially usable under the NVIDIA Open Model License.
- Global Deployment: Can be deployed globally.
- Wide Use Cases: Applicable in Physical AI, including robotics and autonomous vehicles.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
No code examples are provided in the original document.
๐ Documentation
Model Overview
- Model Developer: NVIDIA
- Model Versions: [Cosmos - Reason1 - 7B](https://huggingface.co/nvidia/Cosmos - Reason1 - 7B) can generate answers based on text prompts and input videos.
- License: Released under the NVIDIA Open Model License. For custom licenses, contact [cosmos - license@nvidia.com](mailto:cosmos - license@nvidia.com).
- Permissions: Commercially usable, free to create and distribute derivative models, and NVIDIA does not claim ownership of outputs.
- Important Note: Bypassing guardrails without appropriate substitutes will terminate your rights under the agreement.
- Deployment Geography: Global
- Use Case: Physical AI, including understanding of space, time, fundamental physics, and embodied reasoning in robotics and autonomous vehicles.
- Release Date:
- Github: [05/17/2025](https://github.com/nvidia - cosmos/cosmos - reason1)
- Huggingface: [05/17/2025](https://huggingface.co/collections/nvidia/cosmos - reason1 - 67c9e926206426008f1da1b7)
Model Architecture
- Architecture Type: A Multi - modal LLM with a Vision Transformer (ViT) for vision encoder and a Dense Transformer model for LLM.
- Network Architecture: Qwen2.5 - VL - 7B - Instruct. Cosmos - Reason - 7B is post - trained based on [Qwen2.5 - VL - 7B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - VL - 7B - Instruct).
Input
Property |
Details |
Input Type(s) |
Text+Video/Image |
Input Format(s) |
Text: String; Video: mp4; Image: jpg |
Input Parameters |
Text: One - dimensional (1D); Video: Three - dimensional (3D); Image: Two - dimensional (2D) |
Other Properties Related to Input |
Use FPS = 4 for input video. Append <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer> in the system prompt for long chain - of - thought reasoning. |
Output
Property |
Details |
Output Type(s) |
Text |
Output Format |
String |
Output Parameters |
Text: One - dimensional (1D) |
Other Properties Related to Output |
Recommend using 4096 or more output max tokens. Designed to run on NVIDIA GPU - accelerated systems for faster training and inference. |
Software Integration
- Runtime Engine(s): [vLLM](https://github.com/vllm - project/vllm)
- Supported Hardware Microarchitecture Compatibility: NVIDIA Blackwell, NVIDIA Hopper
- Note: Only tested doing inference with BF16 precision.
- Operating System(s): Linux
Evaluation
- Datasets: Part of the evaluation datasets are released under [Cosmos - Reason1 - Benchmark](https://huggingface.co/datasets/nvidia/Cosmos - Reason1 - Benchmark), covering robotics, ego - centric human demonstration, and autonomous vehicle driving video data.
- Data Collection Method: Hybrid (Automatic/Sensors or Human) for different datasets.
- Labeling Method: Hybrid (Human, Automated) for all datasets.
- Metrics:
| | RoboVQA | AV | [BridgeDataV2](https://rail - berkeley.github.io/bridgedata/) | [Agibot](https://github.com/OpenDriveLab/AgiBot - World) | HoloAssist | [RoboFail](https://robot - reflect.github.io/) | Average |
|--------------------|---------------------------------------------|----------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|
| Accuracy | 87.3 | 70.8 | 63.7 | 48.9 | 62.7 | 57.2 | 65.1 |
Dataset Format
Modality: Video (mp4) and Text
Dataset Quantification
|
RoboVQA |
AV |
[BridgeDataV2](https://rail - berkeley.github.io/bridgedata/) |
[Agibot](https://github.com/OpenDriveLab/AgiBot - World) |
HoloAssist |
[RoboFail](https://robot - reflect.github.io/) |
Total Storage Size |
SFT Data |
1.14m |
24.7k |
258k |
38.9k |
273k |
N/A |
300.6GB |
RL Data |
252 |
200 |
240 |
200 |
200 |
N/A |
2.6GB |
Benchmark Data |
110 |
100 |
100 |
100 |
100 |
100 |
1.5GB |
Inference
- Acceleration Engine: PyTorch, flash attention
- Test Hardware: H100, A100, GB200
- Requirement: Minimum 2 GPU cards, multi - nodes require Infiniband / ROCE connection
Ethical Considerations
- General Responsibility: NVIDIA believes in Trustworthy AI. Developers should ensure the model meets industry requirements and addresses misuse.
- User Responsibility: Users are responsible for model inputs, outputs, and safe integration, including implementing guardrails.
- Reporting: Report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit - security - vulnerability/).
Plus Plus (++) Promise
The model and its associated data have been verified for compliance with laws, regulations, and standards, and have been annotated and reviewed.
Bias
Field |
Response |
(No specific field content provided in the original) |
(No specific response content provided in the original) |
๐ง Technical Details
- The model is a multi - modal LLM with a Vision Transformer (ViT) for vision encoding and a Dense Transformer for language processing, based on the Qwen2.5 - VL - 7B - Instruct architecture.
- It is post - trained on specific datasets from NVIDIA to enhance physical common sense and embodied reasoning capabilities.
- The input and output formats and requirements are designed to support its application in Physical AI scenarios.
๐ License
This model is released under the NVIDIA Open Model License. For a custom license, please contact [cosmos - license@nvidia.com](mailto:cosmos - license@nvidia.com).
โ ๏ธ Important Note
If You bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism (collectively โGuardrailโ) contained in the Model without a substantially similar Guardrail appropriate for your use case, your rights under this Agreement NVIDIA Open Model License Agreement will automatically terminate.
๐ก Usage Tip
Recommend using 4096 or more output max tokens to avoid truncation of long chain - of - thought response. Use FPS = 4
for input video to match the training setup.