Janus-Pro-1B Open-source Multimodal AI Model - Unified Understanding and Generation, Enhancing Usage Flexibility

Janus Pro 1B

Developed by deepseek-community

Janus-Pro is a novel autoregressive framework that unifies multi-modal understanding and generation tasks and enhances flexibility by decoupling visual encoding.

Text-to-Image

Transformers

Open Source License:MIT #Multi-modal unified framework #Visual encoding decoupling #Bidirectional image-text generation

Downloads 4,636

Release Time : 3/1/2025

Model Overview

Janus-Pro is a unified multi-modal understanding and generation model that addresses the limitations of previous methods by decoupling visual encoding, and its performance is comparable to or even better than that of task-specific models.

Model Features

Unify multi-modal understanding and generation

Unifies multi-modal understanding and generation tasks within a single framework, addressing the limitations of previous methods.

Decouple visual encoding

Alleviates the role conflict of the visual encoder in understanding and generation tasks by decoupling visual encoding, enhancing the flexibility of the framework.

High performance

The performance is comparable to or even better than that of task-specific models, surpassing previous unified models.

Model Capabilities

Multi-modal understanding

Image generation

Text generation

Use Cases

Visual understanding

Image content description

Generate descriptive text based on the input image

Can accurately describe the image content

Image generation

Text-to-image generation

Generate an image based on a text prompt

Generate an image that matches the text description

🚀 Janus-Pro

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation, offering high flexibility and effectiveness.

🚀 Quick Start

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task - specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next - generation unified multimodal models.

Github Repository

✨ Features

Unified Framework: Janus-Pro unifies multimodal understanding and generation in a single autoregressive framework.
Decoupled Visual Encoding: Decouples visual encoding into separate pathways, enhancing flexibility and reducing conflicts.
High Performance: Surpasses previous unified models and competes with task - specific models.

📚 Documentation

Model Summary

Janus-Pro is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation. Janus-Pro is constructed based on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base.

For multimodal understanding, it uses the SigLIP-L as the vision encoder, which supports 384 x 384 image input. For image generation, Janus-Pro uses the tokenizer from here with a downsample rate of 16.

Property	Details
Model Type	Unified multimodal understanding and generation MLLM
Training Data	Not specified

💻 Usage Examples

Basic Usage

Single Image Inference

Here is an example of visual understanding with a single image.

import torch  
from PIL import Image  
import requests  
from transformers import JanusForConditionalGeneration, JanusProcessor  

model_id = "deepseek-community/Janus-Pro-1B"

# Prepare input for generation
messages = [
    {
        "role": "user",
        "content": [
            {'type': 'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
            {'type': 'text', 'text': "What do you see in this image?"}
        ]
    },
]

# Set generation mode to 'text' to perform text generation
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    generation_mode="text",
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, max_new_tokens=40, generation_mode='text', do_sample=True)
text = processor.decode(output[0], skip_special_tokens=True)
print(text)

Advanced Usage

Text to Image generation

Janus can also generate images from prompts by simply setting the generation mode to image as shown below.

import torch
from transformers import JanusForConditionalGeneration, JanusProcessor

model_id = "deepseek-community/Janus-Pro-1B"

# Load processor and model
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "A dog running under the rain."}
        ]
    }
]

# Apply chat template
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=prompt,
    generation_mode="image",
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

# Set number of images to generate
model.generation_config.num_return_sequences = 2

outputs = model.generate(
    **inputs,
    generation_mode="image",
    do_sample=True,
    use_cache=True
)

# Decode and save images
decoded_image = model.decode_image_tokens(outputs)
images = processor.postprocess(list(decoded_image.float()), return_tensors="PIL.Image.Image")

for i, image in enumerate(images["pixel_values"]):
    image.save(f"image{i}.png")

📄 License

This code repository is licensed under the MIT License. The use of Janus-Pro models is subject to DeepSeek Model License.

🔗 Citation

@article{chen2025janus,
  title={Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling},
  author={Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong},
  journal={arXiv preprint arXiv:2501.17811},
  year={2025}
}

📞 Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご