🚀 Janus-Pro
Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation, offering high flexibility and effectiveness.
🚀 Quick Start
Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task - specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next - generation unified multimodal models.
Github Repository
✨ Features
- Unified Framework: Janus-Pro unifies multimodal understanding and generation in a single autoregressive framework.
- Decoupled Visual Encoding: Decouples visual encoding into separate pathways, enhancing flexibility and reducing conflicts.
- High Performance: Surpasses previous unified models and competes with task - specific models.
📚 Documentation
Model Summary
Janus-Pro is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation. Janus-Pro is constructed based on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base.
For multimodal understanding, it uses the SigLIP-L as the vision encoder, which supports 384 x 384 image input. For image generation, Janus-Pro uses the tokenizer from here with a downsample rate of 16.
Property |
Details |
Model Type |
Unified multimodal understanding and generation MLLM |
Training Data |
Not specified |
💻 Usage Examples
Basic Usage
Single Image Inference
Here is an example of visual understanding with a single image.
import torch
from PIL import Image
import requests
from transformers import JanusForConditionalGeneration, JanusProcessor
model_id = "deepseek-community/Janus-Pro-1B"
messages = [
{
"role": "user",
"content": [
{'type': 'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
{'type': 'text', 'text': "What do you see in this image?"}
]
},
]
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
generation_mode="text",
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, max_new_tokens=40, generation_mode='text', do_sample=True)
text = processor.decode(output[0], skip_special_tokens=True)
print(text)
Advanced Usage
Text to Image generation
Janus can also generate images from prompts by simply setting the generation mode to image
as shown below.
import torch
from transformers import JanusForConditionalGeneration, JanusProcessor
model_id = "deepseek-community/Janus-Pro-1B"
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "A dog running under the rain."}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=prompt,
generation_mode="image",
return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
model.generation_config.num_return_sequences = 2
outputs = model.generate(
**inputs,
generation_mode="image",
do_sample=True,
use_cache=True
)
decoded_image = model.decode_image_tokens(outputs)
images = processor.postprocess(list(decoded_image.float()), return_tensors="PIL.Image.Image")
for i, image in enumerate(images["pixel_values"]):
image.save(f"image{i}.png")
📄 License
This code repository is licensed under the MIT License. The use of Janus-Pro models is subject to DeepSeek Model License.
🔗 Citation
@article{chen2025janus,
title={Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling},
author={Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong},
journal={arXiv preprint arXiv:2501.17811},
year={2025}
}
📞 Contact
If you have any questions, please raise an issue or contact us at service@deepseek.com.