Open-source FlowerVLA Vision-Language-Action Flow Model: Achieving General Robot Manipulation Strategies with Small Parameters

Flower Calvin D

Developed by mbreuss

FlowerVLA is a vision-language-action flow model pre-trained on the CALVIN D dataset, employing an efficient flow-matching architecture that achieves general-purpose robot operation strategies with only about 1 billion parameters.

Multimodal Fusion

Safetensors

EnglishOpen Source License:MIT #Robot Operation Control #Vision-Language-Action Flow #Efficient Parameter Architecture

Downloads 16

Release Time : 3/16/2025

Model Overview

FlowerVLA is an innovative vision-language-action flow strategy model designed for robotic manipulation tasks, capable of generating corresponding action outputs based on visual inputs and language instructions.

Model Features

Efficient Architecture

Employs a novel Transformer-based flow-matching architecture, achieving efficient and general-purpose VLA strategies with only about 1 billion parameters

Multimodal Encoding

Utilizes half of Florence-2 modules for multimodal vision-language encoding, effectively integrating visual and linguistic information

High Performance

Ranked first in the CALVIN D challenge, demonstrating outstanding performance

Model Capabilities

Vision-Language-Action Mapping

Robot Operation Control

Multimodal Information Processing

Use Cases

Robotics

Object Grasping

Identify and grasp specific objects based on language instructions

Achieves high success rates on the CALVIN D dataset

Task Sequence Execution

Execute complex multi-step manipulation tasks

Capable of completing long sequence tasks with an average length of 4.36

🚀 FlowerVLA - Vision-Language-Action Flow Model for CALVIN D

A pretrained FlowerVLA model for robotic manipulation, trained on the CALVIN D dataset. Flower is an efficient Vision-Language-Action Flow policy for robot learning, containing only 1B parameters.

🚀 Quick Start

Check out our full model implementation on Github todo and follow the instructions in the readme to test the model on one of the environments.

obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)

✨ Features

FlowerVLA is a novel architecture that:

Uses half of Florence-2 for multi-modal vision-language encoding
Employs an novel transformer-based flow matching architecture
Provides an efficient, versatile VLA policy with only ~1B parameters

📦 Installation

No installation steps provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)

Advanced Usage

No advanced usage code provided in the original document, so this part is skipped.

📚 Documentation

Model Performance

This checkpoint contains weights for the CALVIN D challenge and currently ranks 1 with the following results:

Train→Test	Method	1	2	3	4	5	Avg. Len.
{dataset_name}	FlowerVLA	98.4%	94.0%	87.9%	81.7%	74.1%	4.36

Input/Output Specifications

Inputs

RGB Static Camera: (B, T, 3, H, W) tensor
RGB Gripper Camera: (B, T, 3, H, W) tensor
Language Instructions: Text strings

Outputs

Action Space: (B, T, 7) tensor representing delta EEF actions

Training Details

Configuration

Optimizer: AdamW
Learning Rate: 2e-5
Weight Decay: 0.05

🔧 Technical Details

No specific technical details (more than 50 words) provided in the original document, so this section is skipped.

📄 License

This model is released under the MIT license.

@inproceedings{ reuss2025flower, # Add citation when available }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご