Cerule-v0.1 Open-source Vision-Language Model - Lightweight and Powerful, Focused on Image and Text Processing

Cerule V0.1

Developed by Tensoic

Cerule is a lightweight yet powerful vision-language model built on Google's Gemma-2b and SigLIP, focusing on image-text processing.

Image-to-Text

Transformers

English#Lightweight Vision-Language Model #Multimodal Image Understanding #Efficient Fine-Tuning Training

Downloads 157

Release Time : 4/2/2024

Model Overview

Cerule is a compact but powerful vision-language model that combines Google's Gemma-2b and SigLIP, providing an efficient solution for image-text processing.

Model Features

Lightweight and Powerful

Based on Google's Gemma-2b and SigLIP, the model has a small size but powerful performance

Rich Data

A large amount of image data was used for pre-training and fine-tuning, improving the model's generalization ability

Efficient Training

Training can be completed in only about 19 hours on 4 A100 80GB GPUs

Model Capabilities

Image description generation

Visual question answering

Image content analysis

Multimodal understanding

Use Cases

Image Understanding

Image Description

Generate detailed text descriptions for input images

In the example, multiple details of the astronaut image were successfully described

Character Recognition

Identify characters and their actions in the image

In the example, Mario, Luigi, and Yoshi were accurately identified

Humor/Creative Content Analysis

Unconventional Scene Understanding

Understand and describe humorous or unconventional image scenes

In the example, the humorous scene of 'extreme ironing' was correctly identified

🚀 Cerule - A Tiny Mighty Vision Model

Cerule is a compact yet powerful vision language model built upon Google's Gemma - 2b and SigLIP, offering remarkable performance in image - text - to - text tasks.

 ██████╗███████╗██████╗ ██╗   ██╗██╗     ███████╗
██╔════╝██╔════╝██╔══██╗██║   ██║██║     ██╔════╝
██║     █████╗  ██████╔╝██║   ██║██║     █████╗
██║     ██╔══╝  ██╔══██╗██║   ██║██║     ██╔══╝
╚██████╗███████╗██║  ██║╚██████╔╝███████╗███████╗
 ╚═════╝╚══════╝╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚══════╝

We trained and released "Cerule", a small but powerful Vision Language Model based on Google's newly released Gemma - 2b and Google's SigLIP.

- Pretraining stage: 650K images (A LAION Subset)
- Finetuning stage: 695K images (SVIT - mix - 665K - Bunny mix modified by BAAI)

The training setup was 4xA100's 80GB, taking approximately 6 hours for pretraining and about 13 hours for finetuning. We modified and adapted the training code from Bunny.

✨ Features

Powerful Base: Built on Google's Gemma - 2b and SigLIP, leveraging their advanced capabilities.
Efficient Training: Trained on specific image subsets with a reasonable time cost on 4xA100's 80GB setup.
Versatile Inference: Allows for CLI - based inference with provided repositories.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Image Description Examples

Image	Example
	Describe the image The image is a playful and surreal depiction of a man in a space suit, sitting on a chair and holding a green beer bottle. The man is wearing a white space suit, complete with a helmet and gloves. His feet are clad in black and white shoes, and he is placed on a sandy surface. The background features a large, blue planet, with a moon and a star visible in the sky.
	Who are the characters in the image? The image features three characters, two of them are Mario and Luigi, and the third one is Yoshi. Describe the actions of the characters The Mario and Luigi characters are holding their arms out, as if they are waving. Yoshi is standing on its own, with its arms folded.
	What's funny about this image? The image is quite humorous as it depicts a man ironing clothes on the back of a yellow taxi cab. This is not a typical sight you'd expect to see in everyday life.

📚 Documentation

Training

Training code Released !!! https://github.com/tensoic/Cerule

Inference

Clone the following repo and follow the instructions for a CLI - based inference. https://github.com/tensoic/Cerule

🔧 Technical Details

Training Data:
- Pretraining: 650K images (A LAION Subset)
- Finetuning: 695K images (SVIT - mix - 665K - Bunny mix modified by BAAI)
Training Setup: 4xA100's 80GB, ~6 hours for pretraining and ~13 hours for finetuning
Code Adaptation: Modified and adapted training code from Bunny

📄 License

Model subject to Gemma (base model license) terms of use along with the underlying datasets (LAOIN and SVIT) subject to their respective licenses. All codes are Apache 2.0

📝 Acknowledgments

We sincerely thank the amazing teams at Google, LLaVA, and BAAI without which this project would not have been possible!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご