Model Selection

Vision Transformer

# Vision Transformer

Clip Vitl14 Test Time Registers

Based on the OpenCLIP-ViT-L-14 model, the Test-Time Register technology is introduced to improve the model's interpretability and the performance of downstream tasks.

A Vision Transformer-based binary classification model specifically designed for detecting deepfake images, with an accuracy rate of 99.2%

Image Classification

Coco Instance Eomt Large 1280

This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, demonstrating ViT's potential in image segmentation tasks.

Image Segmentation

Ade20k Panoptic Eomt Giant 1280

This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, revealing ViT's potential in image segmentation tasks.

Image Segmentation

Ade20k Panoptic Eomt Large 1280

This paper proposes an image segmentation model based on Vision Transformer (ViT), revealing the potential of ViT in image segmentation tasks.

Image Segmentation

Ade20k Panoptic Eomt Large 640

This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, demonstrating ViT's potential in image segmentation tasks.

Image Segmentation

Ade20k Panoptic Eomt Giant 640

This model reveals the potential of Vision Transformer (ViT) in image segmentation tasks by adapting its architecture specifically for segmentation.

Image Segmentation

Coco Panoptic Eomt Giant 640

The model proposed in this paper reveals the potential of Vision Transformer (ViT) in image segmentation tasks.

Image Segmentation

Coco Panoptic Eomt Large 1280

This paper proposes a novel perspective by treating Vision Transformer (ViT) as an image segmentation model and explores its potential in image segmentation tasks.

Image Segmentation

Ade20k Semantic Eomt Large 512

This model is developed based on the paper 'Your ViT is Actually an Image Segmentation Model' and is a Vision Transformer model for image segmentation tasks.

Image Segmentation

Cityscapes Semantic Eomt Large 1024

This model reveals the potential of Vision Transformer (ViT) in image segmentation tasks by transforming ViT into an efficient image segmentation model through specific methods.

Image Segmentation

Coco Panoptic Eomt Large 640

This model reveals the potential of Vision Transformer (ViT) in image segmentation tasks by adapting its architecture for segmentation purposes.

Image Segmentation

Coco Instance Eomt Large 640

This paper proposes a method to reinterpret Vision Transformer (ViT) as an image segmentation model, demonstrating ViT's potential in image segmentation tasks.

Image Segmentation

Coco Panoptic Eomt Giant 1280

By rethinking the architecture of Vision Transformer (ViT), this model demonstrates its potential in image segmentation tasks.

Image Segmentation

Smart Tv Hand Gestures Image Detection

A smart TV gesture recognition model based on the Vision Transformer architecture, capable of accurately classifying 9 common gestures.

Image Classification

Ai Vs Human Generated Image Detection

A Vision Transformer (ViT)-based image classification model for distinguishing AI-generated from human-created images, achieving 98% accuracy.

Image Classification

Vitpose Plus Huge

ViTPose++ is a vision Transformer-based foundational model for human pose estimation, achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set.

Pose Estimation

Vitpose Plus Large

ViTPose++ is a vision Transformer-based foundation model for human pose estimation, achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set.

Pose Estimation

Vitpose Plus Small

ViTPose++ is a vision Transformer-based human pose estimation model, achieving outstanding performance of 81.1 AP on the MS COCO keypoint detection benchmark.

Pose Estimation

Vitpose Plus Base

ViTPose is a vision Transformer-based human pose estimation model that achieves an outstanding performance of 81.1 AP on the MS COCO keypoint detection benchmark with a simple design.

Pose Estimation

Transformers English

Vitpose Base Coco Aic Mpii

ViTPose is a human pose estimation model based on Vision Transformer, achieving outstanding performance on benchmarks like MS COCO through simple architectural design.

Pose Estimation

Transformers English

A vision Transformer-based human pose estimation model achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set

Pose Estimation

Transformers English

Vitpose Base Simple

ViTPose is a human pose estimation model based on Vision Transformer, achieving 81.1 AP accuracy on the MS COCO keypoint test set, with advantages such as model simplicity, scalable size, and flexible training.

Pose Estimation

Transformers English

Aimv2 3b Patch14 448.apple Pt

AIM-v2 is an image encoder model based on the timm library, with a 3B parameter scale, suitable for image feature extraction tasks.

Image Classification

Aimv2 3b Patch14 336.apple Pt

AIM-v2 is an image encoder model based on the timm library, suitable for image feature extraction tasks.

Image Classification

Dinov2 With Registers Giant

This is a vision transformer model based on DINOv2, which improves the attention mechanism by adding register tokens for unsupervised image feature extraction.

Image Classification

This model is a facial expression classifier fine-tuned on the FER2013 dataset based on Google's ViT base model, capable of classifying images into four facial expressions.

Ai Image Detector

This model is used to detect whether an image is real or AI-generated, employing the Vision Transformer (ViT) architecture for precise classification.

Image Classification

PyTorch English

Vitpose Base Simple

ViTPose is a baseline model for human pose estimation based on plain vision transformers, achieving high-performance keypoint detection with a simple architecture

Pose Estimation

Transformers English

Vit Base Patch16 Clip 224.metaclip 2pt5b

A dual-framework compatible vision model trained on the MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks

Image Classification

Vit Base Patch16 Clip 224.metaclip 400m

A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks

Image Classification

Arabic Large Nougat

An end-to-end structured optical character recognition system specifically designed for Arabic, converting book page images into structured text (Markdown format)

Transformers Supports Multiple Languages

Hair Type Image Detection

An image classification model based on Google's Vision Transformer (ViT) architecture, specifically designed to identify five hairstyle types (curly, dreadlocks, twists, straight, wavy) from facial images with 93% accuracy.

Image Classification

Sapiens Depth 0.3b Bfloat16

Sapiens is a series of vision transformer models pre-trained on 300 million human images at 1024x1024 resolution, focusing on human-centric vision tasks.

3D Vision English

Sapiens Seg 1b Bfloat16

Sapiens is a Vision Transformer model pre-trained on 300 million high-resolution human images, specializing in human-centric vision tasks

Image Segmentation English

Sapiens Pretrain 1b Bfloat16

Sapiens is a vision Transformer model pre-trained on 300 million 1024×1024 resolution human images, supporting high-resolution inference and real-world scenario generalization.

Image Classification English

Sapiens Depth 0.3b

Sapiens is a Vision Transformer model pre-trained on 300 million high-resolution human images, specializing in human-centric vision tasks.

3D Vision English

Sapiens Depth 0.6b

Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024x1024 resolution human images, specializing in human-centric vision tasks.

3D Vision English

Sapiens is a Vision Transformer model pre-trained on 300 million human images, specializing in human-centric segmentation tasks and supporting 1K high-resolution inference.

Image Segmentation English

Sapiens Pretrain 0.6b

Sapiens is a Vision Transformer model pre-trained on 300 million 1024×1024 resolution human images, excelling in human-centric vision tasks.

Image Classification English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase