Open-source FlowerVLA robot manipulation model, customized for robot learning, efficiently supporting manipulation training!

Flower Calvin Abcd

Developed by mbreuss

FlowerVLA is a robot operation model pre-trained on the CALVIN ABCD dataset, employing an innovative vision-language-action flow strategy with only 1 billion parameters, specifically designed for robot learning.

Multimodal Fusion

Safetensors

EnglishOpen Source License:MIT #Robot Operation Control #Vision-Language-Action Flow #1 Billion Parameter Lightweight

Downloads 24

Release Time : 3/16/2025

Model Overview

FlowerVLA is an efficient vision-language-action flow strategy that uses half-scale Florence-2 for multimodal vision-language encoding, combined with a novel Transformer-based flow matching architecture to achieve efficient and general-purpose vision-language-action policies.

Model Features

Efficient Multimodal Encoding

Uses half-scale Florence-2 for multimodal vision-language encoding to achieve efficient vision-language-action policies.

Innovative Flow Matching Architecture

Adopts a novel Transformer-based flow matching architecture with only about 1 billion parameters to achieve efficient and general-purpose vision-language-action policies.

High Performance

Ranked first in the CALVIN ABCD challenge with an average length of 4.72.

Model Capabilities

Vision-Language-Action Encoding

Robot Operation

Multimodal Task Execution

Use Cases

Robotics

Object Picking

Picks specific objects based on language instructions, such as a blue cube.

Achieved a 99.1% success rate in testing.

🚀 FlowerVLA - Vision-Language-Action Flow Model for CALVIN ABCD

A pre - trained FlowerVLA model for robotic manipulation, trained on the CALVIN ABCD dataset. Flower is an efficient Vision - Language - Action Flow policy for robot learning with only 1B parameters.

🚀 Quick Start

Check out our full model implementation on Github todo and follow the instructions in the readme to test the model on one of the environments.

obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)

✨ Features

FlowerVLA is a novel architecture that:

Uses half of Florence - 2 for multi - modal vision - language encoding
Employs a novel transformer - based flow matching architecture
Provides an efficient, versatile VLA policy with only ~1B parameters

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)

📚 Documentation

Model Performance

This checkpoint contains weights for the CALVIN ABCD challenge and currently ranks 1 with the following results:

Train→Test	Method	1	2	3	4	5	Avg. Len.
{dataset_name}	FlowerVLA	99.1%	97.8%	95.2%	92.4%	87.8%	4.72

Input/Output Specifications

Inputs

RGB Static Camera: (B, T, 3, H, W) tensor
RGB Gripper Camera: (B, T, 3, H, W) tensor
Language Instructions: Text strings

Outputs

Action Space: (B, T, 7) tensor representing delta EEF actions

Training Details

Configuration

Optimizer: AdamW
Learning Rate: 2e - 5
Weight Decay: 0.05

🔧 Technical Details

No specific technical details beyond what is already covered are provided, so this section is skipped.

📄 License

This model is released under the MIT license.

@inproceedings{ reuss2025flower, # Add citation when available }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご