Sam-hq-vit-huge Open Source Model - Accurately Generate High-Quality Object Masks, Applicable to Complex Objects!

Sam Hq Vit Huge

Developed by syscv-community

SAM-HQ is an enhanced version of the Segment Anything Model (SAM), capable of generating higher-quality object masks, especially suitable for handling objects with complex structures.

Image Segmentation

Transformers

Open Source License:Apache-2.0 #High-quality segmentation #Zero-shot generalization #Complex boundary handling

Downloads 516

Release Time : 5/5/2025

Model Overview

By introducing high-quality output tokens and global-local feature fusion techniques, SAM-HQ significantly improves the quality of segmentation masks while retaining the original SAM's promptable design, efficiency, and zero-shot generalization capabilities.

Model Features

High-Quality Output Tokens

Specially designed learnable tokens injected into the mask decoder to predict more accurate segmentation masks.

Global-Local Feature Fusion

Fuses mask decoder features with early and final ViT features, combining high-level semantics and low-level boundary information to improve mask details.

Efficient Improvements

Adds less than 0.5% parameters and requires only 4 hours of training on 8 GPUs to significantly enhance segmentation quality.

Zero-Shot Generalization

Maintains the original SAM's zero-shot generalization capability, allowing direct application to unseen data.

Model Capabilities

High-quality image segmentation

Prompt-based segmentation (points, boxes, etc.)

Automatic mask generation

Zero-shot transfer learning

Use Cases

Image Editing

Precise Object Extraction

Accurately segments objects from complex backgrounds, preserving details and thin structures.

Compared to the original SAM, it better preserves object boundary details.

Automated Annotation

High-Quality Data Annotation

Automatically generates precise object masks for training data annotation.

Reduces manual annotation workload and improves annotation quality.

Medical Image Analysis

Medical Structure Segmentation

Segments fine structures in medical images.

Suitable for medical applications requiring high-precision segmentation.

🚀 Model Card for Segment Anything Model in High Quality (SAM-HQ)

SAM-HQ (Segment Anything in High Quality) is an enhanced version of the Segment Anything Model (SAM). It can generate higher - quality object masks from input prompts like points or boxes. While SAM was trained on a large dataset, its mask prediction quality has limitations, especially for objects with complex structures. SAM - HQ addresses these issues with minimal additional parameters and computation cost.

🚀 Quick Start

SAM-HQ is an advanced model for generating high - quality segmentation masks. It builds on the original SAM architecture and can be used in various scenarios such as prompted - mask - generation and automatic - mask - generation.

✨ Features

High - Quality Output Token: A learnable token in the mask decoder, trained to predict high - quality masks.
Global - local Feature Fusion: Fuses features from different stages of ViT for more accurate segmentation, combining high - level semantic context and low - level boundary information.
Improved Mask Quality: Produces better masks, especially for objects with complex boundaries and thin structures, while maintaining SAM's original promptable design, efficiency, and zero - shot generalizability.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

Prompted - Mask - Generation

from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor

model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-huge")
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-huge")

img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_boxes = [[[306, 132, 925, 893]]]  # Bounding box for the image

inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt").to("cuda")
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

Automatic - Mask - Generation

from transformers import pipeline
generator = pipeline("mask-generation", model="syscv-community/sam-hq-vit-huge", device=0, points_per_batch=256)
image_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
outputs = generator(image_url, points_per_batch=256)

Advanced Usage

Complete Example with Visualization

import numpy as np
import matplotlib.pyplot as plt
def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
def show_box(box, ax):
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))  
def show_boxes_on_image(raw_image, boxes):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_on_image(raw_image, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_and_boxes_on_image(raw_image, boxes, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_and_boxes_on_image(raw_image, boxes, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points(coords, labels, ax, marker_size=375):
    pos_points = coords[labels==1]
    neg_points = coords[labels==0]
    ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
    ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
      masks = masks.squeeze()
    if scores.shape[0] == 1:
      scores = scores.squeeze()
    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(15, 15))
    for i, (mask, score) in enumerate(zip(masks, scores)):
      mask = mask.cpu().detach()
      axes[i].imshow(np.array(raw_image))
      show_mask(mask, axes[i])
      axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
      axes[i].axis("off")
    plt.show()
def show_masks_on_single_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()
    # Convert image to numpy array if it's not already
    image_np = np.array(raw_image)
    # Create a figure
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.imshow(image_np)
    # Overlay all masks on the same image
    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach().numpy()  # Convert to NumPy
        show_mask(mask, ax)  # Assuming `show_mask` properly overlays the mask
    ax.set_title(f"Overlayed Masks with Scores")
    ax.axis("off")
    plt.show()

import torch
from transformers import SamHQModel, SamHQProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-huge").to(device)
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-huge")

from PIL import Image
import requests
img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
plt.imshow(raw_image)

inputs = processor(raw_image, return_tensors="pt").to(device)
image_embeddings, intermediate_embeddings = model.get_image_embeddings(inputs["pixel_values"])

input_boxes = [[[306, 132, 925, 893]]]
show_boxes_on_image(raw_image, input_boxes[0]) 

inputs.pop("pixel_values", None)
inputs.update({"image_embeddings": image_embeddings})
inputs.update({"intermediate_embeddings": intermediate_embeddings})
with torch.no_grad():
    outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

show_masks_on_single_image(raw_image, masks[0], scores)

show_masks_on_image(raw_image, masks[0], scores)

📚 Documentation

Model Details

SAM-HQ builds on the original SAM architecture with two key innovations while keeping SAM's pretrained weights:

High - Quality Output Token: A learnable token in the mask decoder, responsible for predicting high - quality masks. It and its associated MLP layers are specifically trained for accurate segmentation masks.
Global - local Feature Fusion: Instead of only using the HQ - Output Token on mask - decoder features, SAM - HQ first fuses these features with early and final ViT features to improve mask details.

SAM-HQ was trained on a carefully curated dataset of 44K fine - grained masks (HQSeg - 44K). The training takes only 4 hours on 8 GPUs, with less than 0.5% additional parameters compared to the original SAM model.

The model has been evaluated on 10 diverse segmentation datasets. Results show that SAM - HQ can produce better masks than the original SAM model while maintaining zero - shot generalization capabilities.

Citation

@misc{ke2023segmenthighquality,
      title={Segment Anything in High Quality}, 
      author={Lei Ke and Mingqiao Ye and Martin Danelljan and Yifan Liu and Yu - Wing Tai and Chi - Keung Tang and Fisher Yu},
      year={2023},
      eprint={2306.01567},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2306.01567}, 
}

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご