SAIL-7B開源多模態大模型 - 無縫融合視覺與語言處理能力

首頁

SAIL 7B

由ByteDance-Seed開發

SAIL是一個專為視覺與語言設計的單一Transformer模型，作為統一的多模態大語言模型（MLLM），它在單一架構中無縫集成了原始像素編碼和語言解碼功能。

圖像生成文本

Transformers

開源協議:Apache-2.0 #單Transformer架構 #多模態大語言模型 #原生視覺編碼

下載量 119

發布時間 : 5/7/2025

模型概述

SAIL是一個無需依賴預訓練視覺編碼器的多模態大語言模型，能夠在廣泛的視覺語言任務中展現出色性能，其強大的視覺表徵能力可與最先進的視覺模型在語義分割等任務中相媲美。

模型特點

單一Transformer架構

在單一架構中無縫集成原始像素編碼和語言解碼功能，無需依賴預訓練的視覺編碼器。

強大的視覺表徵能力

在廣泛的視覺語言任務中展現出色性能，可與最先進的視覺模型在語義分割等任務中相媲美。

多模態能力

能夠同時處理視覺和語言信息，適用於複雜的多模態任務。

模型能力

視覺語言理解

圖像文本生成

多模態推理

使用案例

視覺語言任務

圖像描述生成

根據輸入的圖像生成詳細的文本描述。

視覺問答

回答關於圖像內容的複雜問題。

語義分割

圖像語義分割

對圖像中的不同部分進行語義標註。

性能可與最先進的視覺模型相媲美。

🚀 SAIL

SAIL是一個用於視覺和語言的單Transformer模型，是統一的多模態大語言模型，能在多種視覺語言任務中展現出強大性能。

鏈接

✨ 主要特性

SAIL是一個用於視覺和語言的單Transformer模型，是統一的多模態大語言模型（MLLM），它在單一架構中無縫集成了原始像素編碼和語言解碼。無需依賴預訓練的視覺編碼器，SAIL在廣泛的視覺語言任務中取得了有競爭力的性能，並展現出強大的視覺表示能力，在語義分割等任務中可與最先進的視覺模型相媲美。

📦 模型

模型名稱	HF鏈接
SAIL - 7B	🤝 鏈接

🚀 快速開始

我們提供了一個運行SAIL的示例代碼。

from example import *

NON_VISION_TOKEN_ID = -1
PATH_TO_MODEL = "path to model"
PATH_TO_TOKENIZER = "path to tokenizer"
IMAGE_PATH = "path to image"
PROMPT = "content of prompt"

model, tokenizer = get_transformer_and_tokenizer(
    PATH_TO_MODEL,
    PATH_TO_TOKENIZER
)
model = model.cuda()

image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)
prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)
image_path = IMAGE_PATH   
image_patches = image_processor(image_path)
nh, nw = image_patches.shape[:2]
image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)

input_tokens = image_tokens + prompt_inp
input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids
vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)
vision_patches = image_patches.view(nh * nw, -1)
assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)
assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len

vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))
attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)
position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)

input_ids = input_ids.long().cuda()
vision_patch_indices = vision_patch_indices.long().cuda()
vision_patches = vision_patches.to(torch.bfloat16).cuda()
position_ids = position_ids.long().cuda()
attention_mask = attention_mask.cuda()

padding_attention_mask = torch.ones_like(input_ids).cuda()

inputs = dict(
    input_ids = input_ids,
    position_ids = position_ids,
    attention_mask = padding_attention_mask,
    vision_patches = vision_patches,
    vision_patch_indices = vision_patch_indices,
    use_cache=True
)

cached_inputs = dict(
    input_ids = input_ids[:, :image_tokens_len],
    position_ids = position_ids[:, :, :image_tokens_len],
    attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],
    vision_patches = vision_patches,
    vision_patch_indices = vision_patch_indices[:, :image_tokens_len],
    use_cache=True
)

prefix_cache = DynamicCache()
with torch.no_grad():
    prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values

past_key_values = copy.deepcopy(prefix_cache)
generate_config = GenerationConfig(
    max_new_tokens=1024,
    return_dict_in_generate=True,
    output_attentions=False
)
generated = model.generate(
    **inputs,
    past_key_values=past_key_values,
    generation_config=generate_config
)
generated_ids = generated['sequences'][:, input_ids.size(1):]
response = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(f"\nModel Response: ===\n{response}\n===")

📄 許可證

本項目採用Apache - 2.0許可證。

📚 引用

如果您在研究中發現本項目有用，請考慮引用：

@article{lei2025sail,
  title={The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer},
  author={Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong},
  journal={arXiv preprint arXiv:2504.10462},
  year={2025}
}