# Large-scale Pretraining

LHM 500M
Apache-2.0
LHM is a feedforward model capable of reconstructing animatable 3D humans from a single image within seconds.
3D Vision English
L
3DAIGC
132
4
De Wiki Mlm 13
A language model fine-tuned on an unknown dataset, trained using the Transformers library
Large Language Model Transformers
D
fpadovani
35
1
Siglip2 Giant Opt Patch16 384
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pretraining objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
26.12k
14
Owls 4B 180K
OWLS is a suite of Whisper-style models designed to help researchers understand the scaling properties of speech models, supporting multilingual speech recognition and translation.
Speech Recognition Other
O
espnet
40
5
Llave 7B
Apache-2.0
LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.
Multimodal Fusion Transformers English
L
zhibinlan
1,389
5
Mt0 Xxl Mt Q4 K M GGUF
Apache-2.0
This model is a multilingual text generation model converted from bigscience/mt0-xxl-mt to GGUF format via llama.cpp, supporting various language tasks.
Large Language Model Supports Multiple Languages
M
Markobes
14
1
C RADIOv2 G
Other
C-RADIOv2 is a visual feature extraction model developed by NVIDIA, offering multiple specification versions suitable for image understanding and dense processing tasks.
Transformers
C
nvidia
648
11
Videomaev2 Giant
VideoMAEv2-giant is an ultra-large-scale video classification model based on self-supervised learning, employing a dual masking strategy for pretraining.
Video Processing
V
OpenGVLab
1,071
4
Reloc3r 512
Reloc3r is a concise and efficient camera pose estimation framework that combines a pretrained dual-view relative camera pose regression network with a multi-view motion averaging module.
Pose Estimation
R
siyan824
840
4
Eva02 Large Patch14 Clip 336.merged2b
MIT
EVA02 CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.
Text-to-Image
E
timm
197
0
Vit So400m Patch14 Siglip Gap 384.webli
Apache-2.0
Vision Transformer model based on SigLIP, utilizing global average pooling for image features
Image Classification Transformers
V
timm
96
0
Vit So400m Patch14 Siglip 224.webli
Apache-2.0
Vision Transformer model based on SigLIP, containing only the image encoder part, utilizing original attention pooling mechanism
Image Classification Transformers
V
timm
123
1
Vit Giant Patch14 Clip 224.laion2b
Apache-2.0
Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset
Image Classification Transformers
V
timm
71
0
Vit Base Patch16 Clip 224.laion2b
Apache-2.0
Vision Transformer model based on CLIP architecture, containing only the image encoder part, suitable for image feature extraction tasks
Image Classification Transformers
V
timm
4,460
0
Convnext Large Mlp.clip Laion2b Ft Soup 320
Apache-2.0
ConvNeXt-Large image encoder based on CLIP architecture, fine-tuned on the LAION-2B dataset, supporting 320x320 resolution image feature extraction
Image Classification Transformers
C
timm
173
0
Convnext Large Mlp.clip Laion2b Augreg
Apache-2.0
ConvNeXt-Large image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports visual feature extraction
Image Classification Transformers
C
timm
107
0
Aimv2 Large Patch14 224 Lit
AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, demonstrating outstanding performance across multiple multimodal understanding benchmarks.
Image-to-Text
A
apple
222
6
Camembertav2 Base
MIT
CamemBERTav2 is a French language model pretrained on 275 billion French text tokens, utilizing the DebertaV2 architecture, and excels in multiple French NLP tasks.
Large Language Model Transformers French
C
almanach
2,972
19
Vit Base Patch16 Clip 224.metaclip 400m
A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks
Image Classification
V
timm
1,206
1
Resnet101 Clip.yfcc15m
MIT
CLIP-style dual-modal model trained on YFCC-15M dataset, compatible with both open_clip and timm frameworks
Image Classification
R
timm
134
0
Prolip ViT B 16 DC 1B 12 8B
MIT
Probabilistic Language-Image Pretraining (ProLIP) ViT-B/16 model pretrained on the DataComp 1B dataset
Text-to-Image Safetensors
P
SanghyukChun
460
0
Rdt 1b
MIT
A 1-billion-parameter imitation learning diffusion Transformer model pretrained on 1M+ multi-robot operation data, supporting multi-view visual-language-action prediction
Multimodal Fusion Transformers English
R
robotics-diffusion-transformer
2,644
80
Sam2 Hiera Large
Apache-2.0
A foundational model for promptable visual segmentation in images and videos developed by FAIR
Image Segmentation
S
facebook
155.85k
68
Jina Embeddings V2 Base En Q5 K M GGUF
Apache-2.0
Jina Embeddings V2 Base is an efficient English text embedding model focused on sentence similarity and feature extraction tasks.
Text Embedding English
J
djuna
85
2
Depth Anything V2 Large Hf
Depth Anything V2 is currently the most powerful Monocular Depth Estimation (MDE) model, trained on 595,000 synthetically annotated images and over 62 million real unlabeled images, offering finer details and stronger robustness.
3D Vision Transformers
D
depth-anything
83.99k
19
Vit L 16 HTxt Recap CLIP
A CLIP model trained on the Recap-DataComp-1B dataset using LLaMA-3 generated captions, suitable for zero-shot image classification tasks
Text-to-Image
V
UCSC-VLAA
538
17
Tic CLIP Basic Oracle
Other
TiC-CLIP is an improved vision-language model based on OpenCLIP, focusing on continual temporal learning, with training data spanning from 2014 to 2022
Text-to-Image
T
apple
37
0
Tookabert Base
Apache-2.0
TookaBERT is a family of encoder models trained on Persian, including base and large versions, suitable for various natural language processing tasks.
Large Language Model Transformers Other
T
PartAI
127
24
Llava Meta Llama 3 8B Instruct
A multimodal model integrating Meta-Llama-3-8B-Instruct and LLaVA-v1.5, providing advanced vision-language understanding capabilities
Image-to-Text Transformers
L
MBZUAI
20
11
Qwen Audio Nf4
Qwen-Audio-nf4 is the quantized version of Qwen-Audio, supporting multiple audio inputs and text outputs
Audio-to-Text Transformers Supports Multiple Languages
Q
Ostixe360
134
1
Pile T5 Xxl
Pile-T5 XXL is an encoder-decoder model trained on The Pile dataset using the T5x library, employing a MLM objective similar to the original T5 model, trained for 2 million steps (approximately 2 trillion tokens).
Large Language Model Transformers English
P
EleutherAI
44
28
Infimm Zephyr
InfiMM is a multimodal vision-language model inspired by the Flamingo architecture, integrating the latest LLM models and suitable for a wide range of vision-language processing tasks.
Image-to-Text Transformers English
I
Infi-MM
23
10
Kaori 70b V1
kaori-70b-v1 is a large language model based on the LLaMA2 architecture, fine-tuned by the Kaeri and Jenti teams using the Open-Platypus, dolphin, and OpenOrca datasets.
Large Language Model Transformers
K
KaeriJenti
907
2
Vit H 14 CLIPA 336 Datacomp1b
Apache-2.0
CLIPA-v2 model, an efficient contrastive vision-language model focused on zero-shot image classification tasks.
Text-to-Image
V
UCSC-VLAA
493
4
Sentence Transformers All Mini Lm L6 V2
Apache-2.0
A lightweight sentence embedding model optimized based on the MiniLM architecture, specifically designed for efficient sentence similarity calculation
Text Embedding English
S
danielpark
78
1
Molm 700M 4B
Apache-2.0
MoLM is a series of language models based on the Mixture of Experts (MoE) architecture. The 700M-4B version has a total of 4 billion parameters, with computational consumption equivalent to a dense model of 700 million parameters.
Large Language Model Transformers
M
ibm-research
36
6
Wav2vec2 Large Audioset
Audio representation model based on HuBERT architecture, pretrained on the complete AudioSet dataset, suitable for general audio tasks
Audio Classification Transformers
W
ALM
43
0
Idefics 9b Instruct
Other
IDEFICS is an open-source reproduction of DeepMind's proprietary visual language model Flamingo. It is a multimodal model that can accept arbitrary sequences of images and text as input and generate text output.
Image-to-Text Transformers English
I
HuggingFaceM4
28.34k
104
CLIP Giga Config Fixed
MIT
A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text
Text-to-Image Transformers
C
Geonmo
109
1
Video Blip Opt 2.7b Ego4d
MIT
VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.
Video-to-Text Transformers English
V
kpyu
429
16
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase