SynthPose-VitPose-Huge-HF開源模型 - 精準檢測52個人體關鍵點，助力運動學分析

首頁

Synthpose Vitpose Huge Hf

由stanfordmimi開發

SynthPose是基於VitPose巨型主幹網絡的關鍵點檢測模型，通過合成數據微調預測52個人體關鍵點，適用於運動學分析。

姿態估計

Transformers

開源協議:Apache-2.0 #密集關鍵點檢測 #合成數據微調 #生物力學分析

下載量 1,320

發布時間 : 1/10/2025

模型概述

該模型採用VitPose巨型主幹網絡，通過合成數據微調，能夠預測包含COCO關鍵點在內的52個解剖學標記點，特別適用於運動捕捉和生物力學分析場景。

模型特點

密集關鍵點預測

能夠預測52個解剖學標記點，包括17個標準COCO關鍵點和35個額外生物力學分析用關鍵點

合成數據微調

採用合成數據對預訓練模型進行微調，提高了對特定關鍵點集的預測精度

兩階段檢測流程

先檢測人體邊界框，再預測關鍵點，提高檢測精度

模型能力

人體關鍵點檢測

運動學分析

生物力學標記點預測

多人姿態估計

使用案例

運動捕捉

運動生物力學分析

用於分析運動員動作姿態，提供精確的關節角度和運動軌跡數據

可輸出52個解剖學標記點的精確位置

醫療康復

康復訓練監測

監測患者康復訓練中的動作姿態變化

提供詳細的關節運動數據用於療效評估

🚀 SynthPose (Transformers 🤗 VitPose Huge變體)

SynthPose是一種用於關鍵點檢測的模型，它基於VitPose Huge骨幹網絡，藉助合成數據微調預訓練的2D人體姿態模型，以預測更密集的關鍵點，從而實現精確的運動學分析。

✨ 主要特性

創新方法：SynthPose是一種全新的方法，能夠對預訓練的2D人體姿態模型進行微調，以預測更密集的關鍵點集，從而實現精確的運動學分析。
豐富的關鍵點：該模型可以預測52個關鍵點，其中前17個是COCO關鍵點，其餘35個是解剖學標記。
多階段檢測：通過兩階段的檢測流程，先檢測圖像中的人體，再為每個檢測到的人體檢測關鍵點。

📚 詳細文檔

模型背景

SynthPose模型由Yoni Gozlan、Antoine Falisse、Scott Uhlrich、Anthony Gatti、Michael Black和Akshay Chaudhari在論文OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics中提出。該模型由Yoni Gozlan貢獻。

預期用例

此模型使用VitPose Huge骨幹網絡。SynthPose是一種新方法，它通過使用合成數據，能夠對預訓練的2D人體姿態模型進行微調，以預測任意更密集的關鍵點集，從而進行精確的運動學分析。這個特定的變體在通常在運動捕捉設置中發現的一組關鍵點上進行了微調，其中也包括COCO關鍵點。

模型預測以下52個標記：

{
    0: "Nose",
    1: "L_Eye",
    2: "R_Eye",
    3: "L_Ear",
    4: "R_Ear",
    5: "L_Shoulder",
    6: "R_Shoulder",
    7: "L_Elbow",
    8: "R_Elbow",
    9: "L_Wrist",
    10: "R_Wrist",
    11: "L_Hip",
    12: "R_Hip",
    13: "L_Knee",
    14: "R_Knee",
    15: "L_Ankle",
    16: "R_Ankle",
    17: "sternum",
    18: "rshoulder",
    19: "lshoulder",
    20: "r_lelbow",
    21: "l_lelbow",
    22: "r_melbow",
    23: "l_melbow",
    24: "r_lwrist",
    25: "l_lwrist",
    26: "r_mwrist",
    27: "l_mwrist",
    28: "r_ASIS",
    29: "l_ASIS",
    30: "r_PSIS",
    31: "l_PSIS",
    32: "r_knee",
    33: "l_knee",
    34: "r_mknee",
    35: "l_mknee",
    36: "r_ankle",
    37: "l_ankle",
    38: "r_mankle",
    39: "l_mankle",
    40: "r_5meta",
    41: "l_5meta",
    42: "r_toe",
    43: "l_toe",
    44: "r_big_toe",
    45: "l_big_toe",
    46: "l_calc",
    47: "r_calc",
    48: "C7",
    49: "L2",
    50: "T11",
    51: "T6",
}

其中前17個關鍵點是COCO關鍵點，接下來的35個是解剖學標記。

💻 使用示例

基礎用法

以下是如何加載模型並在圖像上運行推理的示例：

import torch
import requests
import numpy as np

from PIL import Image

from transformers import (
    AutoProcessor,
    RTDetrForObjectDetection,
    VitPoseForPoseEstimation,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

url = "http://farm4.staticflickr.com/3300/3416216247_f9c6dfc939_z.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# ------------------------------------------------------------------------
# Stage 1. Detect humans on the image
# ------------------------------------------------------------------------

# You can choose detector by your choice
person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365")
person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device)

inputs = person_image_processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = person_model(**inputs)

results = person_image_processor.post_process_object_detection(
    outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3
)
result = results[0]  # take first image results

# Human label refers 0 index in COCO dataset
person_boxes = result["boxes"][result["labels"] == 0]
person_boxes = person_boxes.cpu().numpy()

# Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format
person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0]
person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1]

# ------------------------------------------------------------------------
# Stage 2. Detect keypoints for each person found
# ------------------------------------------------------------------------

image_processor = AutoProcessor.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf")
model = VitPoseForPoseEstimation.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf", device_map=device)

inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes])
image_pose_result = pose_results[0]  # results for first image

高級用法

可視化監督用戶示例

import supervision as sv

xy = torch.stack([pose_result['keypoints'] for pose_result in image_pose_result]).cpu().numpy()
scores = torch.stack([pose_result['scores'] for pose_result in image_pose_result]).cpu().numpy()

key_points = sv.KeyPoints(
    xy=xy, confidence=scores
)

vertex_annotator = sv.VertexAnnotator(
    color=sv.Color.PINK,
    radius=2
)

annotated_frame = vertex_annotator.annotate(
    scene=image.copy(),
    key_points=key_points
)
annotated_frame

可視化結果

高級手動可視化示例

import math
import cv2

def draw_points(image, keypoints, scores, pose_keypoint_color, keypoint_score_threshold, radius, show_keypoint_weight):
    if pose_keypoint_color is not None:
        assert len(pose_keypoint_color) == len(keypoints)
    for kid, (kpt, kpt_score) in enumerate(zip(keypoints, scores)):
        x_coord, y_coord = int(kpt[0]), int(kpt[1])
        if kpt_score > keypoint_score_threshold:
            color = tuple(int(c) for c in pose_keypoint_color[kid])
            if show_keypoint_weight:
                cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)
                transparency = max(0, min(1, kpt_score))
                cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
            else:
                cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)

def draw_links(image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold, thickness, show_keypoint_weight, stick_width = 2):
    height, width, _ = image.shape
    if keypoint_edges is not None and link_colors is not None:
        assert len(link_colors) == len(keypoint_edges)
        for sk_id, sk in enumerate(keypoint_edges):
            x1, y1, score1 = (int(keypoints[sk[0], 0]), int(keypoints[sk[0], 1]), scores[sk[0]])
            x2, y2, score2 = (int(keypoints[sk[1], 0]), int(keypoints[sk[1], 1]), scores[sk[1]])
            if (
                x1 > 0
                and x1 < width
                and y1 > 0
                and y1 < height
                and x2 > 0
                and x2 < width
                and y2 > 0
                and y2 < height
                and score1 > keypoint_score_threshold
                and score2 > keypoint_score_threshold
            ):
                color = tuple(int(c) for c in link_colors[sk_id])
                if show_keypoint_weight:
                    X = (x1, x2)
                    Y = (y1, y2)
                    mean_x = np.mean(X)
                    mean_y = np.mean(Y)
                    length = ((Y[0] - Y[1]) ** 2 + (X[0] - X[1]) ** 2) ** 0.5
                    angle = math.degrees(math.atan2(Y[0] - Y[1], X[0] - X[1]))
                    polygon = cv2.ellipse2Poly(
                        (int(mean_x), int(mean_y)), (int(length / 2), int(stick_width)), int(angle), 0, 360, 1
                    )
                    cv2.fillConvexPoly(image, polygon, color)
                    transparency = max(0, min(1, 0.5 * (keypoints[sk[0], 2] + keypoints[sk[1], 2])))
                    cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
                else:
                    cv2.line(image, (x1, y1), (x2, y2), color, thickness=thickness)


# Note: keypoint_edges and color palette are dataset-specific
keypoint_edges = model.config.edges

palette = np.array(
    [
        [255, 128, 0],
        [255, 153, 51],
        [255, 178, 102],
        [230, 230, 0],
        [255, 153, 255],
        [153, 204, 255],
        [255, 102, 255],
        [255, 51, 255],
        [102, 178, 255],
        [51, 153, 255],
        [255, 153, 153],
        [255, 102, 102],
        [255, 51, 51],
        [153, 255, 153],
        [102, 255, 102],
        [51, 255, 51],
        [0, 255, 0],
        [0, 0, 255],
        [255, 0, 0],
        [255, 255, 255],
    ]
)

link_colors = palette[[0, 0, 0, 0, 7, 7, 7, 9, 9, 9, 9, 9, 16, 16, 16, 16, 16, 16, 16]]
keypoint_colors = palette[[16, 16, 16, 16, 16, 9, 9, 9, 9, 9, 9, 0, 0, 0, 0, 0, 0]+[4]*(52-17)]

numpy_image = np.array(image)

for pose_result in image_pose_result:
    scores = np.array(pose_result["scores"])
    keypoints = np.array(pose_result["keypoints"])

    # draw each point on image
    draw_points(numpy_image, keypoints, scores, keypoint_colors, keypoint_score_threshold=0.3, radius=2, show_keypoint_weight=False)

    # draw links
    draw_links(numpy_image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold=0.3, thickness=1, show_keypoint_weight=False)

pose_image = Image.fromarray(numpy_image)
pose_image