Cosmos - Reason1 - 7B - GGUF Open-Source Physics AI Model: Understand Commonsense Reasoning and Decision-Making, Free to Use!

Cosmos Reason1 7B GGUF

Developed by unsloth

Cosmos-Reason1 is a physics AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decision-making natural language through long-chain reasoning.

Text-to-Video

Transformers

EnglishOpen Source License:Other #Physical Common Sense Reasoning #Embodied Intelligent Decision-Making #Multimodal Video Understanding

Downloads 6,690

Release Time : 5/24/2025

Model Overview

This model, through supervised fine-tuning and reinforcement learning, is post-trained using physical common sense and embodied reasoning data. It can comprehend space, time, and fundamental physics, serving as a planning model to reason the next actions of embodied intelligent agents.

Model Features

Multimodal Understanding

Capable of processing both text and video/image inputs simultaneously, achieving cross-modal understanding.

Long-Chain Thought Reasoning

Supports complex reasoning processes through <think> and <answer> token structures.

Commercial Usability

Under the NVIDIA Open Model License, the model can be used for commercial purposes and allows the creation of derivative models.

High-Performance Inference

Optimized to run on NVIDIA GPU-accelerated systems, supporting efficient inference engines like vLLM.

Model Capabilities

Physical Common Sense Understanding

Embodied Decision-Making Generation

Cross-Modal Reasoning

Video Content Analysis

Long-Text Generation

Use Cases

Robotics

Robotic Action Planning

Generates next-step action instructions for robots based on environmental video input.

Achieves 87.3% accuracy on the RoboVQA dataset.

Autonomous Driving

Driving Decision Generation

Analyzes road scene videos and generates driving decision suggestions.

Achieves 70.8% accuracy on the AV dataset.

🚀 Cosmos-Reason1: Physical AI Common Sense and Embodied Reasoning Models

Cosmos-Reason1 is a set of Physical AI models that can understand physical common sense and perform embodied reasoning through long chain-of-thought processes. It has wide applications in fields like robotics and autonomous vehicles.

🚀 Quick Start

For detailed usage, please refer to Cosmos-Reason1. It provides examples of supervised fine-tuning and reinforcement learning on embodied reasoning datasets.

✨ Features

Unsloth Dynamic 2.0: Unsloth Dynamic 2.0 achieves superior accuracy and outperforms other leading quants.
Multi - modal LLM: Consists of a Vision Transformer (ViT) for vision encoder and a Dense Transformer model for LLM, based on the Qwen2.5 - VL - 7B - Instruct architecture.
Commercial Usability: Commercially usable under the NVIDIA Open Model License.
Global Deployment: Can be deployed globally.
Wide Use Cases: Applicable in Physical AI, including robotics and autonomous vehicles.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Model Overview

Model Developer: NVIDIA
Model Versions: [Cosmos - Reason1 - 7B](https://huggingface.co/nvidia/Cosmos - Reason1 - 7B) can generate answers based on text prompts and input videos.
License: Released under the NVIDIA Open Model License. For custom licenses, contact [cosmos - license@nvidia.com](mailto:cosmos - license@nvidia.com).
- Permissions: Commercially usable, free to create and distribute derivative models, and NVIDIA does not claim ownership of outputs.
- Important Note: Bypassing guardrails without appropriate substitutes will terminate your rights under the agreement.
Deployment Geography: Global
Use Case: Physical AI, including understanding of space, time, fundamental physics, and embodied reasoning in robotics and autonomous vehicles.
Release Date:
- Github: [05/17/2025](https://github.com/nvidia - cosmos/cosmos - reason1)
- Huggingface: [05/17/2025](https://huggingface.co/collections/nvidia/cosmos - reason1 - 67c9e926206426008f1da1b7)

Model Architecture

Architecture Type: A Multi - modal LLM with a Vision Transformer (ViT) for vision encoder and a Dense Transformer model for LLM.
Network Architecture: Qwen2.5 - VL - 7B - Instruct. Cosmos - Reason - 7B is post - trained based on [Qwen2.5 - VL - 7B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - VL - 7B - Instruct).

Input

Property	Details
Input Type(s)	Text+Video/Image
Input Format(s)	Text: String; Video: mp4; Image: jpg
Input Parameters	Text: One - dimensional (1D); Video: Three - dimensional (3D); Image: Two - dimensional (2D)
Other Properties Related to Input	Use `FPS = 4` for input video. Append `<think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>` in the system prompt for long chain - of - thought reasoning.

Output

Property	Details
Output Type(s)	Text
Output Format	String
Output Parameters	Text: One - dimensional (1D)
Other Properties Related to Output	Recommend using 4096 or more output max tokens. Designed to run on NVIDIA GPU - accelerated systems for faster training and inference.

Software Integration

Runtime Engine(s): [vLLM](https://github.com/vllm - project/vllm)
Supported Hardware Microarchitecture Compatibility: NVIDIA Blackwell, NVIDIA Hopper
Note: Only tested doing inference with BF16 precision.
Operating System(s): Linux

Evaluation

Datasets: Part of the evaluation datasets are released under [Cosmos - Reason1 - Benchmark](https://huggingface.co/datasets/nvidia/Cosmos - Reason1 - Benchmark), covering robotics, ego - centric human demonstration, and autonomous vehicle driving video data.
Data Collection Method: Hybrid (Automatic/Sensors or Human) for different datasets.
Labeling Method: Hybrid (Human, Automated) for all datasets.
Metrics: | | RoboVQA | AV | [BridgeDataV2](https://rail - berkeley.github.io/bridgedata/) | [Agibot](https://github.com/OpenDriveLab/AgiBot - World) | HoloAssist | [RoboFail](https://robot - reflect.github.io/) | Average | |--------------------|---------------------------------------------|----------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------| | Accuracy | 87.3 | 70.8 | 63.7 | 48.9 | 62.7 | 57.2 | 65.1 |

Dataset Format

Modality: Video (mp4) and Text

Dataset Quantification

	RoboVQA	AV	[BridgeDataV2](https://rail - berkeley.github.io/bridgedata/)	[Agibot](https://github.com/OpenDriveLab/AgiBot - World)	HoloAssist	[RoboFail](https://robot - reflect.github.io/)	Total Storage Size
SFT Data	1.14m	24.7k	258k	38.9k	273k	N/A	300.6GB
RL Data	252	200	240	200	200	N/A	2.6GB
Benchmark Data	110	100	100	100	100	100	1.5GB

Inference

Acceleration Engine: PyTorch, flash attention
Test Hardware: H100, A100, GB200
Requirement: Minimum 2 GPU cards, multi - nodes require Infiniband / ROCE connection

Ethical Considerations

General Responsibility: NVIDIA believes in Trustworthy AI. Developers should ensure the model meets industry requirements and addresses misuse.
User Responsibility: Users are responsible for model inputs, outputs, and safe integration, including implementing guardrails.
Reporting: Report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit - security - vulnerability/).

Plus Plus (++) Promise

The model and its associated data have been verified for compliance with laws, regulations, and standards, and have been annotated and reviewed.

Bias

Field	Response
(No specific field content provided in the original)	(No specific response content provided in the original)

🔧 Technical Details

The model is a multi - modal LLM with a Vision Transformer (ViT) for vision encoding and a Dense Transformer for language processing, based on the Qwen2.5 - VL - 7B - Instruct architecture.
It is post - trained on specific datasets from NVIDIA to enhance physical common sense and embodied reasoning capabilities.
The input and output formats and requirements are designed to support its application in Physical AI scenarios.

📄 License

This model is released under the NVIDIA Open Model License. For a custom license, please contact [cosmos - license@nvidia.com](mailto:cosmos - license@nvidia.com).

⚠️ Important Note

If You bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism (collectively “Guardrail”) contained in the Model without a substantially similar Guardrail appropriate for your use case, your rights under this Agreement NVIDIA Open Model License Agreement will automatically terminate.

💡 Usage Tip

Recommend using 4096 or more output max tokens to avoid truncation of long chain - of - thought response. Use FPS = 4 for input video to match the training setup.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご