SAIL-7B Open Source Multimodal Large Model - Seamlessly Integrates Visual and Language Processing Capabilities

SAIL 7B

Developed by ByteDance-Seed

SAIL is a single Transformer model specifically designed for vision and language, serving as a unified Multimodal Large Language Model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Single Transformer Architecture #Multimodal Large Language Model #Native Visual Encoding

Downloads 119

Release Time : 5/7/2025

Model Overview

SAIL is a multimodal large language model that does not rely on pre-trained visual encoders, demonstrating outstanding performance across a wide range of vision-language tasks. Its powerful visual representation capabilities are comparable to state-of-the-art vision models in tasks such as semantic segmentation.

Model Features

Single Transformer Architecture

Seamlessly integrates raw pixel encoding and language decoding within a single architecture, eliminating the need for pre-trained visual encoders.

Powerful Visual Representation Capabilities

Demonstrates outstanding performance across a wide range of vision-language tasks, comparable to state-of-the-art vision models in tasks such as semantic segmentation.

Multimodal Capabilities

Capable of processing both visual and linguistic information simultaneously, suitable for complex multimodal tasks.

Model Capabilities

Vision-Language Understanding

Image-Text Generation

Multimodal Reasoning

Use Cases

Vision-Language Tasks

Image Caption Generation

Generates detailed textual descriptions based on input images.

Visual Question Answering

Answers complex questions about image content.

Semantic Segmentation

Image Semantic Segmentation

Performs semantic labeling of different parts within an image.

Performance is comparable to state-of-the-art vision models.

🚀 SAIL

SAIL is a Single Transformer model for vision and language. It's a unified multimodal large language model that integrates raw pixel encoding and language decoding in one architecture, achieving good performance in various vision - language tasks without pre - trained vision encoders.

🚀 Quick Start

We provide an example code to run SAIL.

from example import *

NON_VISION_TOKEN_ID = -1
PATH_TO_MODEL = "path to model"
PATH_TO_TOKENIZER = "path to tokenizer"
IMAGE_PATH = "path to image"
PROMPT = "content of prompt"

model, tokenizer = get_transformer_and_tokenizer(
    PATH_TO_MODEL,
    PATH_TO_TOKENIZER
)
model = model.cuda()

image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)
prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)
image_path = IMAGE_PATH   
image_patches = image_processor(image_path)
nh, nw = image_patches.shape[:2]
image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)

input_tokens = image_tokens + prompt_inp
input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids
vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)
vision_patches = image_patches.view(nh * nw, -1)
assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)
assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len

vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))
attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)
position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)

input_ids = input_ids.long().cuda()
vision_patch_indices = vision_patch_indices.long().cuda()
vision_patches = vision_patches.to(torch.bfloat16).cuda()
position_ids = position_ids.long().cuda()
attention_mask = attention_mask.cuda()

padding_attention_mask = torch.ones_like(input_ids).cuda()

inputs = dict(
    input_ids = input_ids,
    position_ids = position_ids,
    attention_mask = padding_attention_mask,
    vision_patches = vision_patches,
    vision_patch_indices = vision_patch_indices,
    use_cache=True
)

cached_inputs = dict(
    input_ids = input_ids[:, :image_tokens_len],
    position_ids = position_ids[:, :, :image_tokens_len],
    attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],
    vision_patches = vision_patches,
    vision_patch_indices = vision_patch_indices[:, :image_tokens_len],
    use_cache=True
)

prefix_cache = DynamicCache()
with torch.no_grad():
    prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values

past_key_values = copy.deepcopy(prefix_cache)
generate_config = GenerationConfig(
    max_new_tokens=1024,
    return_dict_in_generate=True,
    output_attentions=False
)
generated = model.generate(
    **inputs,
    past_key_values=past_key_values,
    generation_config=generate_config
)
generated_ids = generated['sequences'][:, input_ids.size(1):]
response = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(f"\nModel Response: ===\n{response}\n===")

✨ Features

SAIL is a unified multimodal large language model (MLLM). It seamlessly integrates raw pixel encoding and language decoding within a single architecture. Without relying on pre - trained vision encoders, it achieves competitive performance across a wide range of vision - language tasks and demonstrates strong visual representation, rivaling state - of - the - art vision models in tasks like semantic segmentation.

📦 Model

Property	Details
Model Type	SAIL-7B
HF Link	🤗 link

📄 License

This project is licensed under the Apache-2.0 license.

📚 Documentation

Citation

If you find this project useful in your research, please consider citing:

@article{lei2025sail,
  title={The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer},
  author={Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong},
  journal={arXiv preprint arXiv:2504.10462},
  year={2025}
}

Links

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご