Synthpose Vitpose Huge Hf
SynthPose是基于VitPose巨型主干网络的关键点检测模型,通过合成数据微调预测52个人体关键点,适用于运动学分析。
下载量 1,320
发布时间 : 1/10/2025
模型简介
该模型采用VitPose巨型主干网络,通过合成数据微调,能够预测包含COCO关键点在内的52个解剖学标记点,特别适用于运动捕捉和生物力学分析场景。
模型特点
密集关键点预测
能够预测52个解剖学标记点,包括17个标准COCO关键点和35个额外生物力学分析用关键点
合成数据微调
采用合成数据对预训练模型进行微调,提高了对特定关键点集的预测精度
两阶段检测流程
先检测人体边界框,再预测关键点,提高检测精度
模型能力
人体关键点检测
运动学分析
生物力学标记点预测
多人姿态估计
使用案例
运动捕捉
运动生物力学分析
用于分析运动员动作姿态,提供精确的关节角度和运动轨迹数据
可输出52个解剖学标记点的精确位置
医疗康复
康复训练监测
监测患者康复训练中的动作姿态变化
提供详细的关节运动数据用于疗效评估
🚀 SynthPose (Transformers 🤗 VitPose Huge变体)
SynthPose是一种用于关键点检测的模型,它基于VitPose Huge骨干网络,借助合成数据微调预训练的2D人体姿态模型,以预测更密集的关键点,从而实现精确的运动学分析。
✨ 主要特性
- 创新方法:SynthPose是一种全新的方法,能够对预训练的2D人体姿态模型进行微调,以预测更密集的关键点集,从而实现精确的运动学分析。
- 丰富的关键点:该模型可以预测52个关键点,其中前17个是COCO关键点,其余35个是解剖学标记。
- 多阶段检测:通过两阶段的检测流程,先检测图像中的人体,再为每个检测到的人体检测关键点。
📚 详细文档
模型背景
SynthPose模型由Yoni Gozlan、Antoine Falisse、Scott Uhlrich、Anthony Gatti、Michael Black和Akshay Chaudhari在论文OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics中提出。该模型由Yoni Gozlan贡献。
预期用例
此模型使用VitPose Huge骨干网络。SynthPose是一种新方法,它通过使用合成数据,能够对预训练的2D人体姿态模型进行微调,以预测任意更密集的关键点集,从而进行精确的运动学分析。这个特定的变体在通常在运动捕捉设置中发现的一组关键点上进行了微调,其中也包括COCO关键点。
模型预测以下52个标记:
{
0: "Nose",
1: "L_Eye",
2: "R_Eye",
3: "L_Ear",
4: "R_Ear",
5: "L_Shoulder",
6: "R_Shoulder",
7: "L_Elbow",
8: "R_Elbow",
9: "L_Wrist",
10: "R_Wrist",
11: "L_Hip",
12: "R_Hip",
13: "L_Knee",
14: "R_Knee",
15: "L_Ankle",
16: "R_Ankle",
17: "sternum",
18: "rshoulder",
19: "lshoulder",
20: "r_lelbow",
21: "l_lelbow",
22: "r_melbow",
23: "l_melbow",
24: "r_lwrist",
25: "l_lwrist",
26: "r_mwrist",
27: "l_mwrist",
28: "r_ASIS",
29: "l_ASIS",
30: "r_PSIS",
31: "l_PSIS",
32: "r_knee",
33: "l_knee",
34: "r_mknee",
35: "l_mknee",
36: "r_ankle",
37: "l_ankle",
38: "r_mankle",
39: "l_mankle",
40: "r_5meta",
41: "l_5meta",
42: "r_toe",
43: "l_toe",
44: "r_big_toe",
45: "l_big_toe",
46: "l_calc",
47: "r_calc",
48: "C7",
49: "L2",
50: "T11",
51: "T6",
}
其中前17个关键点是COCO关键点,接下来的35个是解剖学标记。
💻 使用示例
基础用法
以下是如何加载模型并在图像上运行推理的示例:
import torch
import requests
import numpy as np
from PIL import Image
from transformers import (
AutoProcessor,
RTDetrForObjectDetection,
VitPoseForPoseEstimation,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
url = "http://farm4.staticflickr.com/3300/3416216247_f9c6dfc939_z.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# ------------------------------------------------------------------------
# Stage 1. Detect humans on the image
# ------------------------------------------------------------------------
# You can choose detector by your choice
person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365")
person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device)
inputs = person_image_processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
outputs = person_model(**inputs)
results = person_image_processor.post_process_object_detection(
outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3
)
result = results[0] # take first image results
# Human label refers 0 index in COCO dataset
person_boxes = result["boxes"][result["labels"] == 0]
person_boxes = person_boxes.cpu().numpy()
# Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format
person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0]
person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1]
# ------------------------------------------------------------------------
# Stage 2. Detect keypoints for each person found
# ------------------------------------------------------------------------
image_processor = AutoProcessor.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf")
model = VitPoseForPoseEstimation.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf", device_map=device)
inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes])
image_pose_result = pose_results[0] # results for first image
高级用法
可视化监督用户示例
import supervision as sv
xy = torch.stack([pose_result['keypoints'] for pose_result in image_pose_result]).cpu().numpy()
scores = torch.stack([pose_result['scores'] for pose_result in image_pose_result]).cpu().numpy()
key_points = sv.KeyPoints(
xy=xy, confidence=scores
)
vertex_annotator = sv.VertexAnnotator(
color=sv.Color.PINK,
radius=2
)
annotated_frame = vertex_annotator.annotate(
scene=image.copy(),
key_points=key_points
)
annotated_frame
高级手动可视化示例
import math
import cv2
def draw_points(image, keypoints, scores, pose_keypoint_color, keypoint_score_threshold, radius, show_keypoint_weight):
if pose_keypoint_color is not None:
assert len(pose_keypoint_color) == len(keypoints)
for kid, (kpt, kpt_score) in enumerate(zip(keypoints, scores)):
x_coord, y_coord = int(kpt[0]), int(kpt[1])
if kpt_score > keypoint_score_threshold:
color = tuple(int(c) for c in pose_keypoint_color[kid])
if show_keypoint_weight:
cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)
transparency = max(0, min(1, kpt_score))
cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
else:
cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)
def draw_links(image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold, thickness, show_keypoint_weight, stick_width = 2):
height, width, _ = image.shape
if keypoint_edges is not None and link_colors is not None:
assert len(link_colors) == len(keypoint_edges)
for sk_id, sk in enumerate(keypoint_edges):
x1, y1, score1 = (int(keypoints[sk[0], 0]), int(keypoints[sk[0], 1]), scores[sk[0]])
x2, y2, score2 = (int(keypoints[sk[1], 0]), int(keypoints[sk[1], 1]), scores[sk[1]])
if (
x1 > 0
and x1 < width
and y1 > 0
and y1 < height
and x2 > 0
and x2 < width
and y2 > 0
and y2 < height
and score1 > keypoint_score_threshold
and score2 > keypoint_score_threshold
):
color = tuple(int(c) for c in link_colors[sk_id])
if show_keypoint_weight:
X = (x1, x2)
Y = (y1, y2)
mean_x = np.mean(X)
mean_y = np.mean(Y)
length = ((Y[0] - Y[1]) ** 2 + (X[0] - X[1]) ** 2) ** 0.5
angle = math.degrees(math.atan2(Y[0] - Y[1], X[0] - X[1]))
polygon = cv2.ellipse2Poly(
(int(mean_x), int(mean_y)), (int(length / 2), int(stick_width)), int(angle), 0, 360, 1
)
cv2.fillConvexPoly(image, polygon, color)
transparency = max(0, min(1, 0.5 * (keypoints[sk[0], 2] + keypoints[sk[1], 2])))
cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
else:
cv2.line(image, (x1, y1), (x2, y2), color, thickness=thickness)
# Note: keypoint_edges and color palette are dataset-specific
keypoint_edges = model.config.edges
palette = np.array(
[
[255, 128, 0],
[255, 153, 51],
[255, 178, 102],
[230, 230, 0],
[255, 153, 255],
[153, 204, 255],
[255, 102, 255],
[255, 51, 255],
[102, 178, 255],
[51, 153, 255],
[255, 153, 153],
[255, 102, 102],
[255, 51, 51],
[153, 255, 153],
[102, 255, 102],
[51, 255, 51],
[0, 255, 0],
[0, 0, 255],
[255, 0, 0],
[255, 255, 255],
]
)
link_colors = palette[[0, 0, 0, 0, 7, 7, 7, 9, 9, 9, 9, 9, 16, 16, 16, 16, 16, 16, 16]]
keypoint_colors = palette[[16, 16, 16, 16, 16, 9, 9, 9, 9, 9, 9, 0, 0, 0, 0, 0, 0]+[4]*(52-17)]
numpy_image = np.array(image)
for pose_result in image_pose_result:
scores = np.array(pose_result["scores"])
keypoints = np.array(pose_result["keypoints"])
# draw each point on image
draw_points(numpy_image, keypoints, scores, keypoint_colors, keypoint_score_threshold=0.3, radius=2, show_keypoint_weight=False)
# draw links
draw_links(numpy_image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold=0.3, thickness=1, show_keypoint_weight=False)
pose_image = Image.fromarray(numpy_image)
pose_image
📄 许可证
本项目采用Apache-2.0许可证。
Superpoint
其他
SuperPoint是一种自监督训练的全卷积网络,用于兴趣点检测和描述。
姿态估计
Transformers

S
magic-leap-community
59.12k
13
Vitpose Base Simple
Apache-2.0
ViTPose是基于视觉Transformer的人体姿态估计模型,在MS COCO关键点测试集上达到81.1 AP的精度,具有模型简洁、规模可扩展、训练灵活等优势
姿态估计
Transformers 英语

V
usyd-community
51.40k
20
Vitpose Plus Small
Apache-2.0
ViTPose++是基于视觉Transformer的人体姿态估计模型,在MS COCO关键点检测基准上达到81.1 AP的优异表现。
姿态估计
Transformers

V
usyd-community
30.02k
2
Vitpose Plus Base
Apache-2.0
ViTPose是一个基于视觉Transformer的人体姿态估计模型,采用简洁设计在MS COCO关键点检测基准上取得81.1 AP的优异表现。
姿态估计
Transformers 英语

V
usyd-community
22.26k
10
Superglue Outdoor
其他
SuperGlue是一种基于图神经网络的特征匹配模型,用于匹配图像中的兴趣点,适用于图像匹配和姿态估计任务。
姿态估计
Transformers

S
magic-leap-community
18.39k
2
Vitpose Plus Huge
Apache-2.0
ViTPose++是基于视觉Transformer的人体姿态估计基础模型,在MS COCO关键点测试集上达到81.1 AP的优异表现。
姿态估计
Transformers

V
usyd-community
14.49k
6
Img2pose
img2pose是一个基于Faster R-CNN的模型,用于预测照片中所有人脸的六自由度姿态(6DoF),并能将3D人脸投影到2D平面。
姿态估计
Safetensors
I
py-feat
4,440
0
Vitpose Plus Large
Apache-2.0
ViTPose++是基于视觉Transformer的人体姿态估计基础模型,在MS COCO关键点测试集上达到81.1 AP的优异表现。
姿态估计
Transformers

V
usyd-community
1,731
1
Synthpose Vitpose Huge Hf
Apache-2.0
SynthPose是基于VitPose巨型主干网络的关键点检测模型,通过合成数据微调预测52个人体关键点,适用于运动学分析。
姿态估计
Transformers

S
stanfordmimi
1,320
1
Sapiens Pose 1b Torchscript
Sapiens是基于3亿张1024x1024分辨率人体图像预训练的视觉Transformer模型,专为高精度姿态估计任务设计。
姿态估计 英语
S
facebook
1,245
7
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98