Synthpose Vitpose Huge Hf
SynthPose是基於VitPose巨型主幹網絡的關鍵點檢測模型,通過合成數據微調預測52個人體關鍵點,適用於運動學分析。
下載量 1,320
發布時間 : 1/10/2025
模型概述
該模型採用VitPose巨型主幹網絡,通過合成數據微調,能夠預測包含COCO關鍵點在內的52個解剖學標記點,特別適用於運動捕捉和生物力學分析場景。
模型特點
密集關鍵點預測
能夠預測52個解剖學標記點,包括17個標準COCO關鍵點和35個額外生物力學分析用關鍵點
合成數據微調
採用合成數據對預訓練模型進行微調,提高了對特定關鍵點集的預測精度
兩階段檢測流程
先檢測人體邊界框,再預測關鍵點,提高檢測精度
模型能力
人體關鍵點檢測
運動學分析
生物力學標記點預測
多人姿態估計
使用案例
運動捕捉
運動生物力學分析
用於分析運動員動作姿態,提供精確的關節角度和運動軌跡數據
可輸出52個解剖學標記點的精確位置
醫療康復
康復訓練監測
監測患者康復訓練中的動作姿態變化
提供詳細的關節運動數據用於療效評估
🚀 SynthPose (Transformers 🤗 VitPose Huge變體)
SynthPose是一種用於關鍵點檢測的模型,它基於VitPose Huge骨幹網絡,藉助合成數據微調預訓練的2D人體姿態模型,以預測更密集的關鍵點,從而實現精確的運動學分析。
✨ 主要特性
- 創新方法:SynthPose是一種全新的方法,能夠對預訓練的2D人體姿態模型進行微調,以預測更密集的關鍵點集,從而實現精確的運動學分析。
- 豐富的關鍵點:該模型可以預測52個關鍵點,其中前17個是COCO關鍵點,其餘35個是解剖學標記。
- 多階段檢測:通過兩階段的檢測流程,先檢測圖像中的人體,再為每個檢測到的人體檢測關鍵點。
📚 詳細文檔
模型背景
SynthPose模型由Yoni Gozlan、Antoine Falisse、Scott Uhlrich、Anthony Gatti、Michael Black和Akshay Chaudhari在論文OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics中提出。該模型由Yoni Gozlan貢獻。
預期用例
此模型使用VitPose Huge骨幹網絡。SynthPose是一種新方法,它通過使用合成數據,能夠對預訓練的2D人體姿態模型進行微調,以預測任意更密集的關鍵點集,從而進行精確的運動學分析。這個特定的變體在通常在運動捕捉設置中發現的一組關鍵點上進行了微調,其中也包括COCO關鍵點。
模型預測以下52個標記:
{
0: "Nose",
1: "L_Eye",
2: "R_Eye",
3: "L_Ear",
4: "R_Ear",
5: "L_Shoulder",
6: "R_Shoulder",
7: "L_Elbow",
8: "R_Elbow",
9: "L_Wrist",
10: "R_Wrist",
11: "L_Hip",
12: "R_Hip",
13: "L_Knee",
14: "R_Knee",
15: "L_Ankle",
16: "R_Ankle",
17: "sternum",
18: "rshoulder",
19: "lshoulder",
20: "r_lelbow",
21: "l_lelbow",
22: "r_melbow",
23: "l_melbow",
24: "r_lwrist",
25: "l_lwrist",
26: "r_mwrist",
27: "l_mwrist",
28: "r_ASIS",
29: "l_ASIS",
30: "r_PSIS",
31: "l_PSIS",
32: "r_knee",
33: "l_knee",
34: "r_mknee",
35: "l_mknee",
36: "r_ankle",
37: "l_ankle",
38: "r_mankle",
39: "l_mankle",
40: "r_5meta",
41: "l_5meta",
42: "r_toe",
43: "l_toe",
44: "r_big_toe",
45: "l_big_toe",
46: "l_calc",
47: "r_calc",
48: "C7",
49: "L2",
50: "T11",
51: "T6",
}
其中前17個關鍵點是COCO關鍵點,接下來的35個是解剖學標記。
💻 使用示例
基礎用法
以下是如何加載模型並在圖像上運行推理的示例:
import torch
import requests
import numpy as np
from PIL import Image
from transformers import (
AutoProcessor,
RTDetrForObjectDetection,
VitPoseForPoseEstimation,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
url = "http://farm4.staticflickr.com/3300/3416216247_f9c6dfc939_z.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# ------------------------------------------------------------------------
# Stage 1. Detect humans on the image
# ------------------------------------------------------------------------
# You can choose detector by your choice
person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365")
person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device)
inputs = person_image_processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
outputs = person_model(**inputs)
results = person_image_processor.post_process_object_detection(
outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3
)
result = results[0] # take first image results
# Human label refers 0 index in COCO dataset
person_boxes = result["boxes"][result["labels"] == 0]
person_boxes = person_boxes.cpu().numpy()
# Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format
person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0]
person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1]
# ------------------------------------------------------------------------
# Stage 2. Detect keypoints for each person found
# ------------------------------------------------------------------------
image_processor = AutoProcessor.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf")
model = VitPoseForPoseEstimation.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf", device_map=device)
inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes])
image_pose_result = pose_results[0] # results for first image
高級用法
可視化監督用戶示例
import supervision as sv
xy = torch.stack([pose_result['keypoints'] for pose_result in image_pose_result]).cpu().numpy()
scores = torch.stack([pose_result['scores'] for pose_result in image_pose_result]).cpu().numpy()
key_points = sv.KeyPoints(
xy=xy, confidence=scores
)
vertex_annotator = sv.VertexAnnotator(
color=sv.Color.PINK,
radius=2
)
annotated_frame = vertex_annotator.annotate(
scene=image.copy(),
key_points=key_points
)
annotated_frame
高級手動可視化示例
import math
import cv2
def draw_points(image, keypoints, scores, pose_keypoint_color, keypoint_score_threshold, radius, show_keypoint_weight):
if pose_keypoint_color is not None:
assert len(pose_keypoint_color) == len(keypoints)
for kid, (kpt, kpt_score) in enumerate(zip(keypoints, scores)):
x_coord, y_coord = int(kpt[0]), int(kpt[1])
if kpt_score > keypoint_score_threshold:
color = tuple(int(c) for c in pose_keypoint_color[kid])
if show_keypoint_weight:
cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)
transparency = max(0, min(1, kpt_score))
cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
else:
cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)
def draw_links(image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold, thickness, show_keypoint_weight, stick_width = 2):
height, width, _ = image.shape
if keypoint_edges is not None and link_colors is not None:
assert len(link_colors) == len(keypoint_edges)
for sk_id, sk in enumerate(keypoint_edges):
x1, y1, score1 = (int(keypoints[sk[0], 0]), int(keypoints[sk[0], 1]), scores[sk[0]])
x2, y2, score2 = (int(keypoints[sk[1], 0]), int(keypoints[sk[1], 1]), scores[sk[1]])
if (
x1 > 0
and x1 < width
and y1 > 0
and y1 < height
and x2 > 0
and x2 < width
and y2 > 0
and y2 < height
and score1 > keypoint_score_threshold
and score2 > keypoint_score_threshold
):
color = tuple(int(c) for c in link_colors[sk_id])
if show_keypoint_weight:
X = (x1, x2)
Y = (y1, y2)
mean_x = np.mean(X)
mean_y = np.mean(Y)
length = ((Y[0] - Y[1]) ** 2 + (X[0] - X[1]) ** 2) ** 0.5
angle = math.degrees(math.atan2(Y[0] - Y[1], X[0] - X[1]))
polygon = cv2.ellipse2Poly(
(int(mean_x), int(mean_y)), (int(length / 2), int(stick_width)), int(angle), 0, 360, 1
)
cv2.fillConvexPoly(image, polygon, color)
transparency = max(0, min(1, 0.5 * (keypoints[sk[0], 2] + keypoints[sk[1], 2])))
cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
else:
cv2.line(image, (x1, y1), (x2, y2), color, thickness=thickness)
# Note: keypoint_edges and color palette are dataset-specific
keypoint_edges = model.config.edges
palette = np.array(
[
[255, 128, 0],
[255, 153, 51],
[255, 178, 102],
[230, 230, 0],
[255, 153, 255],
[153, 204, 255],
[255, 102, 255],
[255, 51, 255],
[102, 178, 255],
[51, 153, 255],
[255, 153, 153],
[255, 102, 102],
[255, 51, 51],
[153, 255, 153],
[102, 255, 102],
[51, 255, 51],
[0, 255, 0],
[0, 0, 255],
[255, 0, 0],
[255, 255, 255],
]
)
link_colors = palette[[0, 0, 0, 0, 7, 7, 7, 9, 9, 9, 9, 9, 16, 16, 16, 16, 16, 16, 16]]
keypoint_colors = palette[[16, 16, 16, 16, 16, 9, 9, 9, 9, 9, 9, 0, 0, 0, 0, 0, 0]+[4]*(52-17)]
numpy_image = np.array(image)
for pose_result in image_pose_result:
scores = np.array(pose_result["scores"])
keypoints = np.array(pose_result["keypoints"])
# draw each point on image
draw_points(numpy_image, keypoints, scores, keypoint_colors, keypoint_score_threshold=0.3, radius=2, show_keypoint_weight=False)
# draw links
draw_links(numpy_image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold=0.3, thickness=1, show_keypoint_weight=False)
pose_image = Image.fromarray(numpy_image)
pose_image
📄 許可證
本項目採用Apache-2.0許可證。
Superpoint
其他
SuperPoint是一種自監督訓練的全卷積網絡,用於興趣點檢測和描述。
姿態估計
Transformers

S
magic-leap-community
59.12k
13
Vitpose Base Simple
Apache-2.0
ViTPose是基於視覺Transformer的人體姿態估計模型,在MS COCO關鍵點測試集上達到81.1 AP的精度,具有模型簡潔、規模可擴展、訓練靈活等優勢
姿態估計
Transformers 英語

V
usyd-community
51.40k
20
Vitpose Plus Small
Apache-2.0
ViTPose++是基於視覺Transformer的人體姿態估計模型,在MS COCO關鍵點檢測基準上達到81.1 AP的優異表現。
姿態估計
Transformers

V
usyd-community
30.02k
2
Vitpose Plus Base
Apache-2.0
ViTPose是一個基於視覺Transformer的人體姿態估計模型,採用簡潔設計在MS COCO關鍵點檢測基準上取得81.1 AP的優異表現。
姿態估計
Transformers 英語

V
usyd-community
22.26k
10
Superglue Outdoor
其他
SuperGlue是一種基於圖神經網絡的特徵匹配模型,用於匹配圖像中的興趣點,適用於圖像匹配和姿態估計任務。
姿態估計
Transformers

S
magic-leap-community
18.39k
2
Vitpose Plus Huge
Apache-2.0
ViTPose++是基於視覺Transformer的人體姿態估計基礎模型,在MS COCO關鍵點測試集上達到81.1 AP的優異表現。
姿態估計
Transformers

V
usyd-community
14.49k
6
Img2pose
img2pose是一個基於Faster R-CNN的模型,用於預測照片中所有人臉的六自由度姿態(6DoF),並能將3D人臉投影到2D平面。
姿態估計
Safetensors
I
py-feat
4,440
0
Vitpose Plus Large
Apache-2.0
ViTPose++是基於視覺Transformer的人體姿態估計基礎模型,在MS COCO關鍵點測試集上達到81.1 AP的優異表現。
姿態估計
Transformers

V
usyd-community
1,731
1
Synthpose Vitpose Huge Hf
Apache-2.0
SynthPose是基於VitPose巨型主幹網絡的關鍵點檢測模型,通過合成數據微調預測52個人體關鍵點,適用於運動學分析。
姿態估計
Transformers

S
stanfordmimi
1,320
1
Sapiens Pose 1b Torchscript
Sapiens是基於3億張1024x1024分辨率人體圖像預訓練的視覺Transformer模型,專為高精度姿態估計任務設計。
姿態估計 英語
S
facebook
1,245
7
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98