# Vision Transformer
Clip Vitl14 Test Time Registers
MIT
Based on the OpenCLIP-ViT-L-14 model, the Test-Time Register technology is introduced to improve the model's interpretability and the performance of downstream tasks.
Text-to-Image
Transformers

C
amildravid4292
236
0
Virtus
MIT
A Vision Transformer-based binary classification model specifically designed for detecting deepfake images, with an accuracy rate of 99.2%
Image Classification
Transformers

V
agasta
970
1
Coco Instance Eomt Large 1280
MIT
This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, demonstrating ViT's potential in image segmentation tasks.
Image Segmentation
C
tue-mps
105
0
Ade20k Panoptic Eomt Giant 1280
MIT
This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, revealing ViT's potential in image segmentation tasks.
Image Segmentation
A
tue-mps
96
0
Ade20k Panoptic Eomt Large 1280
MIT
This paper proposes an image segmentation model based on Vision Transformer (ViT), revealing the potential of ViT in image segmentation tasks.
Image Segmentation
PyTorch
A
tue-mps
129
0
Ade20k Panoptic Eomt Large 640
MIT
This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, demonstrating ViT's potential in image segmentation tasks.
Image Segmentation
PyTorch
A
tue-mps
105
0
Ade20k Panoptic Eomt Giant 640
MIT
This model reveals the potential of Vision Transformer (ViT) in image segmentation tasks by adapting its architecture specifically for segmentation.
Image Segmentation
A
tue-mps
116
0
Coco Panoptic Eomt Giant 640
MIT
The model proposed in this paper reveals the potential of Vision Transformer (ViT) in image segmentation tasks.
Image Segmentation
C
tue-mps
92
0
Coco Panoptic Eomt Large 1280
MIT
This paper proposes a novel perspective by treating Vision Transformer (ViT) as an image segmentation model and explores its potential in image segmentation tasks.
Image Segmentation
C
tue-mps
119
0
Ade20k Semantic Eomt Large 512
MIT
This model is developed based on the paper 'Your ViT is Actually an Image Segmentation Model' and is a Vision Transformer model for image segmentation tasks.
Image Segmentation
A
tue-mps
108
0
Cityscapes Semantic Eomt Large 1024
MIT
This model reveals the potential of Vision Transformer (ViT) in image segmentation tasks by transforming ViT into an efficient image segmentation model through specific methods.
Image Segmentation
C
tue-mps
85
0
Coco Panoptic Eomt Large 640
MIT
This model reveals the potential of Vision Transformer (ViT) in image segmentation tasks by adapting its architecture for segmentation purposes.
Image Segmentation
C
tue-mps
217
0
Coco Instance Eomt Large 640
MIT
This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, demonstrating ViT's potential in image segmentation tasks.
Image Segmentation
C
tue-mps
99
0
Coco Panoptic Eomt Giant 1280
MIT
By rethinking the architecture of Vision Transformer (ViT), this model demonstrates its potential in image segmentation tasks.
Image Segmentation
PyTorch
C
tue-mps
90
0
Smart Tv Hand Gestures Image Detection
Apache-2.0
A smart TV gesture recognition model based on the Vision Transformer architecture, capable of accurately classifying 9 common gestures.
Image Classification
Transformers

S
dima806
65
1
Ai Vs Human Generated Image Detection
Apache-2.0
A Vision Transformer (ViT)-based image classification model for distinguishing AI-generated from human-created images, achieving 98% accuracy.
Image Classification
Transformers

A
dima806
148
2
Vitpose Plus Huge
Apache-2.0
ViTPose++ is a vision Transformer-based foundational model for human pose estimation, achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set.
Pose Estimation
Transformers

V
usyd-community
14.49k
6
Vitpose Plus Large
Apache-2.0
ViTPose++ is a vision Transformer-based foundation model for human pose estimation, achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set.
Pose Estimation
Transformers

V
usyd-community
1,731
1
Vitpose Plus Small
Apache-2.0
ViTPose++ is a vision Transformer-based human pose estimation model, achieving outstanding performance of 81.1 AP on the MS COCO keypoint detection benchmark.
Pose Estimation
Transformers

V
usyd-community
30.02k
2
Vitpose Plus Base
Apache-2.0
ViTPose is a vision Transformer-based human pose estimation model that achieves an outstanding performance of 81.1 AP on the MS COCO keypoint detection benchmark with a simple design.
Pose Estimation
Transformers English

V
usyd-community
22.26k
10
Vitpose Base Coco Aic Mpii
Apache-2.0
ViTPose is a human pose estimation model based on Vision Transformer, achieving outstanding performance on benchmarks like MS COCO through simple architectural design.
Pose Estimation
Transformers English

V
usyd-community
38
1
Vitpose Base
Apache-2.0
A vision Transformer-based human pose estimation model achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set
Pose Estimation
Transformers English

V
usyd-community
761
9
Vitpose Base Simple
Apache-2.0
ViTPose is a human pose estimation model based on Vision Transformer, achieving 81.1 AP accuracy on the MS COCO keypoint test set, with advantages such as model simplicity, scalable size, and flexible training.
Pose Estimation
Transformers English

V
usyd-community
51.40k
20
Aimv2 3b Patch14 448.apple Pt
AIM-v2 is an image encoder model based on the timm library, with a 3B parameter scale, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
79
0
Aimv2 3b Patch14 336.apple Pt
AIM-v2 is an image encoder model based on the timm library, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
35
0
Dinov2 With Registers Giant
Apache-2.0
This is a vision transformer model based on DINOv2, which improves the attention mechanism by adding register tokens for unsupervised image feature extraction.
Image Classification
Transformers

D
facebook
9,811
6
Face Emotion
MIT
This model is a facial expression classifier fine-tuned on the FER2013 dataset based on Google's ViT base model, capable of classifying images into four facial expressions.
Face-related
F
gerhardien
34
6
Ai Image Detector
MIT
This model is used to detect whether an image is real or AI-generated, employing the Vision Transformer (ViT) architecture for precise classification.
Image Classification
PyTorch English
A
yaya36095
626
1
Vitpose Base Simple
Apache-2.0
ViTPose is a baseline model for human pose estimation based on plain vision transformers, achieving high-performance keypoint detection with a simple architecture
Pose Estimation
Transformers English

V
danelcsb
20
1
Vit Base Patch16 Clip 224.metaclip 2pt5b
A dual-framework compatible vision model trained on the MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks
Image Classification
V
timm
889
1
Vit Base Patch16 Clip 224.metaclip 400m
A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks
Image Classification
V
timm
1,206
1
Arabic Large Nougat
Gpl-3.0
An end-to-end structured optical character recognition system specifically designed for Arabic, converting book page images into structured text (Markdown format)
Image-to-Text
Transformers Supports Multiple Languages

A
MohamedRashad
537
10
Hair Type Image Detection
Apache-2.0
An image classification model based on Google's Vision Transformer (ViT) architecture, specifically designed to identify five hairstyle types (curly, dreadlocks, twists, straight, wavy) from facial images with 93% accuracy.
Image Classification
H
dima806
143
2
Sapiens Depth 0.3b Bfloat16
Sapiens is a series of vision transformer models pre-trained on 300 million human images at 1024x1024 resolution, focusing on human-centric vision tasks.
3D Vision English
S
facebook
22
0
Sapiens Seg 1b Bfloat16
Sapiens is a Vision Transformer model pre-trained on 300 million high-resolution human images, specializing in human-centric vision tasks
Image Segmentation English
S
facebook
42
0
Sapiens Pretrain 1b Bfloat16
Sapiens is a vision Transformer model pre-trained on 300 million 1024×1024 resolution human images, supporting high-resolution inference and real-world scenario generalization.
Image Classification English
S
facebook
23
0
Sapiens Depth 0.3b
Sapiens is a Vision Transformer model pre-trained on 300 million high-resolution human images, specializing in human-centric vision tasks.
3D Vision English
S
facebook
24
0
Sapiens Depth 0.6b
Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024x1024 resolution human images, specializing in human-centric vision tasks.
3D Vision English
S
facebook
19
1
Sapiens Seg 1b
Sapiens is a Vision Transformer model pre-trained on 300 million human images, specializing in human-centric segmentation tasks and supporting 1K high-resolution inference.
Image Segmentation English
S
facebook
146
4
Sapiens Pretrain 0.6b
Sapiens is a Vision Transformer model pre-trained on 300 million 1024×1024 resolution human images, excelling in human-centric vision tasks.
Image Classification English
S
facebook
13
0
- 1
- 2
- 3
Featured Recommended AI Models