sam-hq-vit-huge開源模型 - 精準生成高質量物體掩碼，複雜物體也適用！

首頁

Sam Hq Vit Huge

由syscv-community開發

SAM-HQ是Segment Anything Model（SAM）的增強版本，能夠生成更高質量的物體掩碼，特別適合處理複雜結構的物體。

圖像分割

Transformers

開源協議:Apache-2.0 #高質量分割 #零樣本泛化 #複雜邊界處理

下載量 516

發布時間 : 5/5/2025

模型概述

SAM-HQ通過引入高質量輸出令牌和全局-局部特徵融合技術，顯著提升了分割掩碼的質量，同時保持了原版SAM的可提示設計、效率和零樣本泛化能力。

模型特點

高質量輸出令牌

專門設計的可學習令牌，注入到掩碼解碼器中，負責預測更精確的分割掩碼。

全局-局部特徵融合

將掩碼解碼器特徵與早期和最終的ViT特徵融合，結合高級語義和低級邊界信息，改善掩碼細節。

高效改進

僅增加不到0.5%的參數，訓練時間僅需8個GPU上的4小時，即可顯著提升分割質量。

零樣本泛化

保持原版SAM的零樣本泛化能力，可在未見過的數據上直接應用。

模型能力

高質量圖像分割

基於提示的分割（點、框等）

自動掩碼生成

零樣本遷移學習

使用案例

圖像編輯

精確物體提取

從複雜背景中精確分割物體，保留細節和薄結構

相比原版SAM，能更好地保留物體邊界細節

自動化標註

高質量數據標註

自動生成精確的物體掩碼用於訓練數據標註

減少人工標註工作量，提高標註質量

醫學圖像分析

醫學結構分割

分割醫學圖像中的精細結構

適用於需要高精度分割的醫學應用

🚀 高質量分割一切模型（SAM - HQ）

SAM - HQ是分割一切模型（SAM）的增強版本，它能夠根據點或框等輸入提示生成更高質量的對象掩碼。在處理具有複雜結構的對象時，SAM的掩碼預測質量往往不足，而SAM - HQ以極小的額外參數和計算成本解決了這些問題，在保持SAM原有設計和泛化能力的同時，顯著提升了掩碼質量。

🚀 快速開始

提示掩碼生成

from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor

model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-huge")
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-huge")

img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_boxes = [[[306, 132, 925, 893]]]  # Bounding box for the image

inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt").to("cuda")
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

在生成掩碼的其他參數中，你可以傳入感興趣對象近似位置的二維座標、包圍感興趣對象的邊界框（格式應為邊界框右上角和左下角的x、y座標）、分割掩碼。根據官方倉庫，截至編寫本文時，官方模型不支持將文本作為輸入。更多詳情，請參考這個筆記本，其中展示瞭如何使用該模型的詳細步驟，並配有可視化示例！

自動掩碼生成

該模型可用於以“零樣本”方式生成輸入圖像的分割掩碼。模型會自動使用一個包含1024個點的網格進行提示，並將這些點全部輸入模型。

以下代碼片段展示瞭如何輕鬆運行自動掩碼生成（可在任何設備上運行！只需傳入合適的points_per_batch參數）：

from transformers import pipeline
generator = pipeline("mask-generation", model="syscv-community/sam-hq-vit-huge", device=0, points_per_batch=256)
image_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
outputs = generator(image_url, points_per_batch=256)

現在來顯示圖像：

import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
    
plt.imshow(np.array(raw_image))
ax = plt.gca()
for mask in outputs["masks"]:
    show_mask(mask, ax=ax, random_color=True)
plt.axis("off")
plt.show()

帶可視化的完整示例

import numpy as np
import matplotlib.pyplot as plt
def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
def show_box(box, ax):
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))  
def show_boxes_on_image(raw_image, boxes):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_on_image(raw_image, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_and_boxes_on_image(raw_image, boxes, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_and_boxes_on_image(raw_image, boxes, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points(coords, labels, ax, marker_size=375):
    pos_points = coords[labels==1]
    neg_points = coords[labels==0]
    ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
    ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
      masks = masks.squeeze()
    if scores.shape[0] == 1:
      scores = scores.squeeze()
    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(15, 15))
    for i, (mask, score) in enumerate(zip(masks, scores)):
      mask = mask.cpu().detach()
      axes[i].imshow(np.array(raw_image))
      show_mask(mask, axes[i])
      axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
      axes[i].axis("off")
    plt.show()
def show_masks_on_single_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()
    # Convert image to numpy array if it's not already
    image_np = np.array(raw_image)
    # Create a figure
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.imshow(image_np)
    # Overlay all masks on the same image
    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach().numpy()  # Convert to NumPy
        show_mask(mask, ax)  # Assuming `show_mask` properly overlays the mask
    ax.set_title(f"Overlayed Masks with Scores")
    ax.axis("off")
    plt.show()

import torch
from transformers import SamHQModel, SamHQProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-huge").to(device)
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-huge")

from PIL import Image
import requests
img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
plt.imshow(raw_image)

inputs = processor(raw_image, return_tensors="pt").to(device)
image_embeddings, intermediate_embeddings = model.get_image_embeddings(inputs["pixel_values"])

input_boxes = [[[306, 132, 925, 893]]]
show_boxes_on_image(raw_image, input_boxes[0]) 

inputs.pop("pixel_values", None)
inputs.update({"image_embeddings": image_embeddings})
inputs.update({"intermediate_embeddings": intermediate_embeddings})
with torch.no_grad():
    outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

show_masks_on_single_image(raw_image, masks[0], scores)

show_masks_on_image(raw_image, masks[0], scores)

✨ 主要特性

高質量輸出：即使對於具有複雜邊界和薄結構的對象，SAM - HQ也能生成高質量的分割掩碼，而原始SAM模型在這些情況下往往表現不佳。
保留原有設計：SAM - HQ保留了SAM的原始可提示設計、效率和零樣本泛化能力，同時顯著提高了掩碼質量。
創新架構：通過引入高質量輸出令牌和全局 - 局部特徵融合兩個關鍵創新，在保持SAM預訓練權重的基礎上進行改進。

📚 詳細文檔

模型細節

SAM - HQ在保留SAM預訓練權重的基礎上，通過兩項關鍵創新對原始SAM架構進行了改進：

高質量輸出令牌：這是一個可學習的令牌，被注入到SAM的掩碼解碼器中，負責預測高質量的掩碼。與SAM的原始輸出令牌不同，這個令牌及其相關的MLP層經過專門訓練，以生成高度準確的分割掩碼。
全局 - 局部特徵融合：SAM - HQ不是僅在掩碼解碼器特徵上應用高質量輸出令牌，而是首先將這些特徵與早期和最終的ViT特徵進行融合，以改善掩碼細節。這結合了高級語義上下文和低級邊界信息，實現更準確的分割。

SAM - HQ在精心策劃的44K細粒度掩碼數據集（HQSeg - 44K）上進行訓練，該數據集由多個來源的極其準確的註釋編譯而成。訓練過程在8個GPU上僅需4小時，與原始SAM模型相比，引入的額外參數不到0.5%。

該模型在10個不同的分割數據集上進行了評估，涵蓋了各種下游任務，其中8個數據集採用零樣本遷移協議進行評估。結果表明，SAM - HQ在保持零樣本泛化能力的同時，能夠生成比原始SAM模型明顯更好的掩碼。

SAM - HQ解決了原始SAM模型的兩個關鍵問題：

掩碼邊界粗糙：在許多情況下，原始SAM模型生成的掩碼邊界粗糙，常常忽略薄對象結構。
預測錯誤：在具有挑戰性的情況下，原始SAM模型可能會出現錯誤預測、破碎掩碼或較大誤差。

這些改進使得SAM - HQ在需要高精度圖像掩碼的應用中特別有價值，例如自動註釋和圖像/視頻編輯任務。

📄 許可證

本項目採用Apache - 2.0許可證。

📑 引用

@misc{ke2023segmenthighquality,
      title={Segment Anything in High Quality}, 
      author={Lei Ke and Mingqiao Ye and Martin Danelljan and Yifan Liu and Yu-Wing Tai and Chi-Keung Tang and Fisher Yu},
      year={2023},
      eprint={2306.01567},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2306.01567}, 
}