Cosmos-Reason1-7B Open-Source Physical AI Model - Comprehend Common Sense, Conduct Long-Chain Reasoning, and Generate Embodied Decisions

Cosmos Reason1 7B

Developed by nvidia

Cosmos-Reason1 is a physical AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decisions through long-chain reasoning.

Transformers

EnglishOpen Source License:Other #Physical Common Sense Reasoning #Embodied Intelligent Decision-Making #Multimodal Video Understanding

Downloads 18.56k

Release Time : 4/18/2025

Model Overview

This model is post-trained based on Qwen2.5-VL-7B-Instruct and can comprehend space, time, and fundamental physics, serving as a planning model to infer the next actions of embodied agents.

Model Features

Physical Common Sense Understanding

Capable of understanding spatial, temporal, and fundamental physical concepts.

Long-Chain Thought Reasoning

Generates embodied decisions in natural language form through multi-step reasoning.

Multimodal Input

Supports text + video/image input methods.

Commercial Use

Under the NVIDIA Open Model License, the model can be used for commercial purposes.

Model Capabilities

Physical Common Sense Understanding

Embodied Reasoning

Multimodal Processing

Long-Chain Thought Reasoning

Robotic Planning

Autonomous Driving Decision-Making

Use Cases

Robotics

Robotic Task Planning

Plans the next actions for robots based on environmental video input.

Achieves 87.3% accuracy on the RoboVQA dataset.

Autonomous Driving

Driving Decision-Making

Analyzes driving scene videos and makes safe decisions.

Achieves 70.8% accuracy on the AV dataset.

🚀 Cosmos-Reason1: Physical AI Common Sense and Embodied Reasoning Models

Cosmos-Reason1 is a Physical AI model that understands physical common sense and generates appropriate embodied decisions through long chain - of - thought reasoning. It's ready for commercial use and has specific input, output, and software integration requirements.

📚 Documentation

Model Overview

Description

Cosmos-Reason1 Models: These Physical AI models can understand physical common sense and generate appropriate embodied decisions in natural language via long chain - of - thought reasoning processes.

The Cosmos-Reason1 models are post - trained with physical common sense and embodied reasoning data using supervised fine - tuning and reinforcement learning. They can understand space, time, and fundamental physics, and serve as planning models to reason about the next steps of an embodied agent.

The models are available for commercial use.

Model Developer: NVIDIA

Model Versions

The Cosmos-Reason1 includes the following model:

Cosmos-Reason1-7B: Given a text prompt and an input video, it thinks and generates an answer based on the input text prompt and video.

License

This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA Open Model License, NVIDIA confirms:

The models are commercially usable.
You are free to create and distribute Derivative Models.
NVIDIA does not claim ownership of any outputs generated using the Models or Derivative Models.

⚠️ Important Note

If You bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism (collectively “Guardrail”) contained in the Model without a substantially similar Guardrail appropriate for your use case, your rights under this Agreement NVIDIA Open Model License Agreement will automatically terminate.

Deployment Geography

Global

Use Case

Physical AI: Space, time, fundamental physics understanding and embodied reasoning, encompassing robotics, and autonomous vehicles (AV).

Release Date

Github: 05/17/2025
Huggingface: 05/17/2025

Model Architecture

Property	Details
Architecture Type	A Multi - modal LLM consists of a Vision Transformer (ViT) for vision encoder and a Dense Transformer model for LLM.
Network Architecture	Qwen2.5 - VL - 7B - Instruct.

Cosmos - Reason - 7B is post - trained based on Qwen2.5 - VL - 7B - Instruct and follows the same model architecture.

Input

Property	Details
Input Type(s)	Text+Video/Image
Input Format(s)	Text: String; Video: mp4; Image: jpg
Input Parameters	Text: One - dimensional (1D); Video: Three - dimensional (3D); Image: Two - dimensional (2D)
Other Properties Related to Input	Use `FPS = 4` for input video to match the training setup. Append `Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>.` in the system prompt to encourage long chain - of - thought reasoning response.

Output

Property	Details
Output Type(s)	Text
Output Format	String
Output Parameters	Text: One - dimensional (1D)
Other Properties Related to Output	Recommend using 4096 or more output max tokens to avoid truncation of long chain - of - thought response. Our AI models are designed and/or optimized to run on NVIDIA GPU - accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU - only solutions.

Software Integration

Property	Details
Runtime Engine(s)	vLLM
Supported Hardware Microarchitecture Compatibility	NVIDIA Blackwell; NVIDIA Hopper
Note	We have only tested doing inference with BF16 precision.
Operating System(s)	Linux (We have not tested on other operating systems.)

Usage

See Cosmos-Reason1 for details.

Post Training: Cosmos-Reason1 provides examples of supervised fine - tuning and reinforcement learning on embodied reasoning datasets.

Evaluation

Please see our technical paper for detailed evaluations on physical common sense and embodied reasoning. Part of the evaluation datasets are released under Cosmos-Reason1-Benchmark. The embodied reasoning datasets and benchmarks focus on the following areas: robotics (RoboVQA, BridgeDataV2, Agibot, RobFail), ego - centric human demonstration (HoloAssist), and Autonomous Vehicle (AV) driving video data. The AV dataset is collected and annotated by NVIDIA.

All datasets go through the data annotation process described in the technical paper to prepare training and evaluation data and annotations.

Data Collection Method

Dataset	Collection Method
RoboVQA	Hybrid: Automatic/Sensors
BridgeDataV2	Automatic/Sensors
AgiBot	Automatic/Sensors
RoboFail	Automatic/Sensors
HoloAssist	Human
AV	Automatic/Sensors

Labeling Method

Dataset	Labeling Method
RoboVQA	Hybrid: Human,Automated
BridgeDataV2	Hybrid: Human,Automated
AgiBot	Hybrid: Human,Automated
RoboFail	Hybrid: Human,Automated
HoloAssist	Hybrid: Human,Automated
AV	Hybrid: Human,Automated

Metrics

	RoboVQA	AV	BridgeDataV2	Agibot	HoloAssist	RoboFail	Average
Accuracy	87.3	70.8	63.7	48.9	62.7	57.2	65.1

Dataset Format

Modality: Video (mp4) and Text

Dataset Quantification

We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.

⚠️ Important Note

The AV data is currently unavailable and will be uploaded soon!

	RoboVQA	AV	BridgeDataV2	Agibot	HoloAssist	RoboFail	Total Storage Size
SFT Data	1.14m	24.7k	258k	38.9k	273k	N/A	300.6GB
RL Data	252	200	240	200	200	N/A	2.6GB
Benchmark Data	110	100	100	100	100	100	1.5GB

We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held - out RoboFail benchmark is released for measuring the generalization capability.

Inference

Test Hardware: H100, A100, GB200

⚠️ Important Note

We suggest using fps = 4 for the input video and max_tokens = 4096 to avoid truncated response.

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

# You can also replace the MODEL_PATH by a safetensors folder path mentioned above
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_tokens=4096,
)

video_messages = [
    {"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
    {"role": "user", "content": [
            {"type": "text", "text": (
                    "Is it safe to turn right?"
                )
            },
            {
                "type": "video", 
                "video": "file:///path/to/your/video.mp4",
                "fps": 4,
            }
        ]
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,

    # FPS will be returned in video_kwargs
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Plus Plus (++) Promise

We value you, the datasets, the diversity... (The content seems incomplete here, but I keep it as it is according to the rules.)

📄 License

This model is released under the NVIDIA Open Model License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご