Open-source bytedance_BAGEL-7B-MoT-INT8 Model - Free Support for Multimodal Understanding and Generation Tasks

Bytedance BAGEL 7B MoT INT8

Developed by Gapeleon

BAGEL is an open-source 7B active parameter multimodal foundation model supporting multimodal understanding and generation tasks

Text-to-Image Open Source License:Apache-2.0 #Multimodal Unified Model #INT8 Quantization #Intelligent Image Editing

Downloads 190

Release Time : 5/21/2025

Model Overview

A multimodal model based on Mixture of Experts Transformer architecture, excelling in vision understanding, text generation, and image editing tasks

Model Features

Unified Multimodal Architecture

Simultaneously supports vision understanding and generation tasks, processing multiple modalities through a single model

Advanced Editing Capabilities

Supports free-form visual editing, multi-view synthesis, and world navigation tasks

Quantization Optimization

Provides INT8 quantized version for optimized inference efficiency

Emergent Properties

Demonstrates phased capability emergence with increased training data

Model Capabilities

Multimodal Understanding

Text-to-Image Generation

Image Editing

Multi-view Synthesis

World Navigation

Sequential Reasoning

Use Cases

Vision Understanding

Multimodal Q&A

Question answering system based on image content

Scored 85.0 on MMBench benchmark

Content Generation

Text-to-Image Generation

Generates high-quality images from text descriptions

Achieved composite score of 0.88 on GenEval benchmark

Image Editing

Intelligent Editing

Image editing based on natural language instructions

Scored 7.36 on GEdit-Bench-EN benchmark

🚀 BAGEL - Unified Model for Multimodal Understanding and Generation

BAGEL is an open - source multimodal foundation model. It has 7B active parameters (14B total) and is trained on large - scale interleaved multimodal data. It outperforms many current top - tier open - source VLMs in multimodal understanding and generation tasks, and also shows excellent performance in image editing.

🚀 Quick Start

This repository hosts the model weights for BAGEL. For installation, usage instructions, and further documentation, please visit our GitHub repository.

✨ Features

INT8 Quantization

INT8 quant of ByteDance-Seed/BAGEL-7B-MoT

Model Architecture

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model's capacity to learn from richly diverse multimodal information. It utilizes two separate encoders to capture pixel - level and semantic - level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.

Training Strategy

BAGEL scales MoT's capacity through Pre - training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in - context multimodal abilities like free - form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.

Emerging Properties

As we scale up BAGEL's pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well - formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual - semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.

📚 Documentation

Benchmarks

1. Visual Understanding

Model	MME (%)	MMBench (%)	MMMU (%)	MM - Vet (%)	MathVista (%)
Janus - Pro - 7B	-	79.2	41.0	50.0	-
Qwen2.5 - VL - 7B	2347	83.5	58.6	67.1	68.2
BAGEL	2388	85.0	55.3	67.2	73.1

2. Text - to - Image Generation (GenEval)

Model	Overall (%)
FLUX - 1 - dev	0.82
SD3 - Medium	0.74
Janus - Pro - 7B	0.80
BAGEL	0.88

3. Image Editing

Model	GEdit - Bench - EN (SC) (%)	GEdit - Bench - EN (PQ) (%)	GEdit - Bench - EN (O) (%)	IntelligentBench (%)
Step1X - Edit	7.09	6.76	6.70	14.9
Gemini - 2 - exp.	6.73	6.61	6.32	57.6
BAGEL	7.36	6.83	6.52	44.0
BAGEL+CoT	-	-	-	55.3

📄 License

BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5 - 7B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 7B - Instruct) and [siglip - so400m - 14 - 384 - flash - attn2](https://huggingface.co/HuggingFaceM4/siglip - so400m - 14 - 384 - flash - attn2) model, and uses the [FLUX.1 - schnell VAE model](https://huggingface.co/black - forest - labs/FLUX.1 - schnell), all under Apache 2.0.

📚 Citation

@article{deng2025bagel,
  title   = {Emerging Properties in Unified Multimodal Pretraining},
  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
  journal = {arXiv preprint arXiv:2505.14683},
  year    = {2025}
}

Information Table

Property	Details
Model Type	INT8 quant of ByteDance - Seed/BAGEL - 7B - MoT
Base Model	ByteDance - Seed/BAGEL - 7B - MoT
Base Model Relation	quantized
Pipeline Tag	any - to - any
Library Name	bagel - mot
Tags	quantized, bagel, mot, int8

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご