Janus-Pro-7B Open-Source Model - Unified Multimodal Understanding and Generation, Resolving Visual Encoding Conflicts

Janus Pro 7B

Developed by deepseek-ai

Janus-Pro is an innovative autoregressive framework that unifies multimodal understanding and generation capabilities. By decoupling visual encoding paths and employing a single Transformer architecture, it resolves conflicts in the roles of visual encoders between understanding and generation.

Text-to-Image

Transformers

Open Source License:MIT #Multimodal Unified Model #Autoregressive Image Generation #Decoupled Visual Encoding

Downloads 139.64k

Release Time : 1/26/2025

Model Overview

Janus-Pro is a unified multimodal large language model (MLLM) for understanding and generation, achieving multimodal understanding and generation through decoupled visual encoding. Its performance matches or surpasses specialized task models, offering high flexibility and efficiency.

Model Features

Decoupled Visual Encoding

Decouples visual encoding into independent paths, alleviating conflicts in the roles of visual encoders between understanding and generation, enhancing framework flexibility.

Unified Architecture

Employs a single unified Transformer architecture for multimodal understanding and generation, simplifying the model structure.

High Performance

Performance matches or surpasses specialized task models, making it a strong candidate for next-generation unified multimodal models.

Model Capabilities

Multimodal Understanding

Text-to-Image Generation

Image Analysis

Use Cases

Multimodal Applications

Image Generation

Generates high-quality images based on text descriptions.

Generated images are of high quality and align with the text descriptions.

Multimodal Understanding

Understands joint inputs of images and text for complex multimodal reasoning.

Excels in multimodal tasks.

🚀 Janus-Pro

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation, offering high flexibility and effectiveness.

🚀 Quick Start

Please refer to Github Repository

✨ Features

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility.

Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models.

Github Repository

📚 Documentation

Model Summary

Janus-Pro is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation. Janus-Pro is constructed based on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base.

For multimodal understanding, it uses the SigLIP-L as the vision encoder, which supports 384 x 384 image input. For image generation, Janus-Pro uses the tokenizer from here with a downsample rate of 16.

Property	Details
Pipeline Tag	any-to-any
Library Name	transformers
Tags	muiltimodal, text-to-image, unified-model

📄 License

This code repository is licensed under the MIT License. The use of Janus-Pro models is subject to DeepSeek Model License.

📚 Citation

@article{chen2025janus,
  title={Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling},
  author={Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong},
  journal={arXiv preprint arXiv:2501.17811},
  year={2025}
}

📞 Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご