🚀 BAGEL - Unified Model for Multimodal Understanding and Generation
BAGEL is an open - source multimodal foundation model. It has 7B active parameters (14B total) and is trained on large - scale interleaved multimodal data. It outperforms many current top - tier open - source VLMs in multimodal understanding and generation tasks, and also shows excellent performance in image editing.
🚀 Quick Start
This repository hosts the model weights for BAGEL. For installation, usage instructions, and further documentation, please visit our GitHub repository.

✨ Features
INT8 Quantization
INT8 quant of ByteDance-Seed/BAGEL-7B-MoT
Model Architecture
BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model's capacity to learn from richly diverse multimodal information. It utilizes two separate encoders to capture pixel - level and semantic - level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.

Training Strategy
BAGEL scales MoT's capacity through Pre - training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in - context multimodal abilities like free - form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.
Emerging Properties
As we scale up BAGEL's pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well - formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual - semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.

📚 Documentation
Benchmarks
1. Visual Understanding
Model |
MME (%) |
MMBench (%) |
MMMU (%) |
MM - Vet (%) |
MathVista (%) |
Janus - Pro - 7B |
- |
79.2 |
41.0 |
50.0 |
- |
Qwen2.5 - VL - 7B |
2347 |
83.5 |
58.6 |
67.1 |
68.2 |
BAGEL |
2388 |
85.0 |
55.3 |
67.2 |
73.1 |
2. Text - to - Image Generation (GenEval)
Model |
Overall (%) |
FLUX - 1 - dev |
0.82 |
SD3 - Medium |
0.74 |
Janus - Pro - 7B |
0.80 |
BAGEL |
0.88 |
3. Image Editing
Model |
GEdit - Bench - EN (SC) (%) |
GEdit - Bench - EN (PQ) (%) |
GEdit - Bench - EN (O) (%) |
IntelligentBench (%) |
Step1X - Edit |
7.09 |
6.76 |
6.70 |
14.9 |
Gemini - 2 - exp. |
6.73 |
6.61 |
6.32 |
57.6 |
BAGEL |
7.36 |
6.83 |
6.52 |
44.0 |
BAGEL+CoT |
- |
- |
- |
55.3 |
📄 License
BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5 - 7B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 7B - Instruct) and [siglip - so400m - 14 - 384 - flash - attn2](https://huggingface.co/HuggingFaceM4/siglip - so400m - 14 - 384 - flash - attn2) model, and uses the [FLUX.1 - schnell VAE model](https://huggingface.co/black - forest - labs/FLUX.1 - schnell), all under Apache 2.0.
📚 Citation
@article{deng2025bagel,
title = {Emerging Properties in Unified Multimodal Pretraining},
author = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
journal = {arXiv preprint arXiv:2505.14683},
year = {2025}
}
Information Table
Property |
Details |
Model Type |
INT8 quant of ByteDance - Seed/BAGEL - 7B - MoT |
Base Model |
ByteDance - Seed/BAGEL - 7B - MoT |
Base Model Relation |
quantized |
Pipeline Tag |
any - to - any |
Library Name |
bagel - mot |
Tags |
quantized, bagel, mot, int8 |