đ Cerule - A Tiny Mighty Vision Model
Cerule is a compact yet powerful vision language model built upon Google's Gemma - 2b and SigLIP, offering remarkable performance in image - text - to - text tasks.
ââââââââââââââââââââââ âââ ââââââ ââââââââ
âââââââââââââââââââââââââââ ââââââ ââââââââ
âââ ââââââ âââââââââââ ââââââ ââââââ
âââ ââââââ âââââââââââ ââââââ ââââââ
âââââââââââââââââââ ââââââââââââââââââââââââââââ
ââââââââââââââââââ âââ âââââââ ââââââââââââââââ
We trained and released "Cerule", a small but powerful Vision Language Model based on Google's newly released Gemma - 2b and Google's SigLIP.
- Pretraining stage: 650K images (A LAION Subset)
- Finetuning stage: 695K images (SVIT - mix - 665K - Bunny mix modified by BAAI)
The training setup was 4xA100's 80GB
, taking approximately 6 hours for pretraining and about 13 hours for finetuning. We modified and adapted the training code from Bunny.
⨠Features
- Powerful Base: Built on Google's Gemma - 2b and SigLIP, leveraging their advanced capabilities.
- Efficient Training: Trained on specific image subsets with a reasonable time cost on
4xA100's 80GB
setup.
- Versatile Inference: Allows for CLI - based inference with provided repositories.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Image Description Examples
Image |
Example |
 |
Describe the image The image is a playful and surreal depiction of a man in a space suit, sitting on a chair and holding a green beer bottle. The man is wearing a white space suit, complete with a helmet and gloves. His feet are clad in black and white shoes, and he is placed on a sandy surface. The background features a large, blue planet, with a moon and a star visible in the sky. |
 |
Who are the characters in the image? The image features three characters, two of them are Mario and Luigi, and the third one is Yoshi.
Describe the actions of the characters The Mario and Luigi characters are holding their arms out, as if they are waving. Yoshi is standing on its own, with its arms folded. |
 |
What's funny about this image? The image is quite humorous as it depicts a man ironing clothes on the back of a yellow taxi cab. This is not a typical sight you'd expect to see in everyday life. |
đ Documentation
Training
Training code Released !!!
https://github.com/tensoic/Cerule
Inference
Clone the following repo and follow the instructions for a CLI - based inference.
https://github.com/tensoic/Cerule
đ§ Technical Details
- Training Data:
- Pretraining: 650K images (A LAION Subset)
- Finetuning: 695K images (SVIT - mix - 665K - Bunny mix modified by BAAI)
- Training Setup:
4xA100's 80GB
, ~6 hours for pretraining and ~13 hours for finetuning
- Code Adaptation: Modified and adapted training code from Bunny
đ License
Model subject to Gemma (base model license) terms of use along with the underlying datasets (LAOIN and SVIT) subject to their respective licenses. All codes are Apache 2.0
đ Acknowledgments
We sincerely thank the amazing teams at Google, LLaVA, and BAAI without which this project would not have been possible!