Model Selection

Large-scale Pretraining

# Large-scale Pretraining

LHM is a feedforward model capable of reconstructing animatable 3D humans from a single image within seconds.

3D Vision English

A language model fine-tuned on an unknown dataset, trained using the Transformers library

Large Language Model

Siglip2 Giant Opt Patch16 384

SigLIP 2 is an improved model based on the SigLIP pretraining objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

OWLS is a suite of Whisper-style models designed to help researchers understand the scaling properties of speech models, supporting multilingual speech recognition and translation.

Speech Recognition Other

LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.

Multimodal Fusion

Transformers English

Mt0 Xxl Mt Q4 K M GGUF

This model is a multilingual text generation model converted from bigscience/mt0-xxl-mt to GGUF format via llama.cpp, supporting various language tasks.

Large Language Model Supports Multiple Languages

C-RADIOv2 is a visual feature extraction model developed by NVIDIA, offering multiple specification versions suitable for image understanding and dense processing tasks.

Videomaev2 Giant

VideoMAEv2-giant is an ultra-large-scale video classification model based on self-supervised learning, employing a dual masking strategy for pretraining.

Video Processing

Reloc3r is a concise and efficient camera pose estimation framework that combines a pretrained dual-view relative camera pose regression network with a multi-view motion averaging module.

Pose Estimation

Eva02 Large Patch14 Clip 336.merged2b

EVA02 CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.

Vit So400m Patch14 Siglip Gap 384.webli

Vision Transformer model based on SigLIP, utilizing global average pooling for image features

Image Classification

Vit So400m Patch14 Siglip 224.webli

Vision Transformer model based on SigLIP, containing only the image encoder part, utilizing original attention pooling mechanism

Image Classification

Vit Giant Patch14 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset

Image Classification

Vit Base Patch16 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, containing only the image encoder part, suitable for image feature extraction tasks

Image Classification

Convnext Large Mlp.clip Laion2b Ft Soup 320

ConvNeXt-Large image encoder based on CLIP architecture, fine-tuned on the LAION-2B dataset, supporting 320x320 resolution image feature extraction

Image Classification

Convnext Large Mlp.clip Laion2b Augreg

ConvNeXt-Large image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports visual feature extraction

Image Classification

Aimv2 Large Patch14 224 Lit

AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, demonstrating outstanding performance across multiple multimodal understanding benchmarks.

Camembertav2 Base

CamemBERTav2 is a French language model pretrained on 275 billion French text tokens, utilizing the DebertaV2 architecture, and excels in multiple French NLP tasks.

Large Language Model

Transformers French

Vit Base Patch16 Clip 224.metaclip 400m

A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks

Image Classification

Resnet101 Clip.yfcc15m

CLIP-style dual-modal model trained on YFCC-15M dataset, compatible with both open_clip and timm frameworks

Image Classification

Prolip ViT B 16 DC 1B 12 8B

Probabilistic Language-Image Pretraining (ProLIP) ViT-B/16 model pretrained on the DataComp 1B dataset

A 1-billion-parameter imitation learning diffusion Transformer model pretrained on 1M+ multi-robot operation data, supporting multi-view visual-language-action prediction

Multimodal Fusion

Transformers English

robotics-diffusion-transformer

Sam2 Hiera Large

A foundational model for promptable visual segmentation in images and videos developed by FAIR

Image Segmentation

Jina Embeddings V2 Base En Q5 K M GGUF

Jina Embeddings V2 Base is an efficient English text embedding model focused on sentence similarity and feature extraction tasks.

Text Embedding English

Depth Anything V2 Large Hf

Depth Anything V2 is currently the most powerful Monocular Depth Estimation (MDE) model, trained on 595,000 synthetically annotated images and over 62 million real unlabeled images, offering finer details and stronger robustness.

Vit L 16 HTxt Recap CLIP

A CLIP model trained on the Recap-DataComp-1B dataset using LLaMA-3 generated captions, suitable for zero-shot image classification tasks

Tic CLIP Basic Oracle

TiC-CLIP is an improved vision-language model based on OpenCLIP, focusing on continual temporal learning, with training data spanning from 2014 to 2022

TookaBERT is a family of encoder models trained on Persian, including base and large versions, suitable for various natural language processing tasks.

Large Language Model

Transformers Other

Llava Meta Llama 3 8B Instruct

A multimodal model integrating Meta-Llama-3-8B-Instruct and LLaVA-v1.5, providing advanced vision-language understanding capabilities

Qwen-Audio-nf4 is the quantized version of Qwen-Audio, supporting multiple audio inputs and text outputs

Transformers Supports Multiple Languages

Pile-T5 XXL is an encoder-decoder model trained on The Pile dataset using the T5x library, employing a MLM objective similar to the original T5 model, trained for 2 million steps (approximately 2 trillion tokens).

Large Language Model

Transformers English

InfiMM is a multimodal vision-language model inspired by the Flamingo architecture, integrating the latest LLM models and suitable for a wide range of vision-language processing tasks.

Transformers English

kaori-70b-v1 is a large language model based on the LLaMA2 architecture, fine-tuned by the Kaeri and Jenti teams using the Open-Platypus, dolphin, and OpenOrca datasets.

Large Language Model

Vit H 14 CLIPA 336 Datacomp1b

CLIPA-v2 model, an efficient contrastive vision-language model focused on zero-shot image classification tasks.

Sentence Transformers All Mini Lm L6 V2

A lightweight sentence embedding model optimized based on the MiniLM architecture, specifically designed for efficient sentence similarity calculation

Text Embedding English

MoLM is a series of language models based on the Mixture of Experts (MoE) architecture. The 700M-4B version has a total of 4 billion parameters, with computational consumption equivalent to a dense model of 700 million parameters.

Large Language Model

Wav2vec2 Large Audioset

Audio representation model based on HuBERT architecture, pretrained on the complete AudioSet dataset, suitable for general audio tasks

Audio Classification

Idefics 9b Instruct

IDEFICS is an open-source reproduction of DeepMind's proprietary visual language model Flamingo. It is a multimodal model that can accept arbitrary sequences of images and text as input and generate text output.

Transformers English

CLIP Giga Config Fixed

A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text

Video Blip Opt 2.7b Ego4d

VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase