C

CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Soup

Developed by laion
CLIP ConvNeXt-XXLarge model trained on LAION-2B dataset using OpenCLIP framework, the first non-ViT image tower CLIP model achieving >79% ImageNet top-1 zero-shot accuracy
Downloads 9,412
Release Time : 2/26/2023

Model Overview

This model is a CLIP model using ConvNeXt-XXLarge architecture, specifically designed for zero-shot image classification and image-text retrieval tasks. It combines weights from two training phases through model soup method, demonstrating excellent performance at 256x256 resolution.

Model Features

Large-scale ConvNeXt architecture
Uses 847M-parameter ConvNeXt-XXLarge as image tower, currently the largest released pre-trained ConvNeXt model
High-performance zero-shot classification
Achieves 79.4% zero-shot top-1 accuracy on ImageNet, surpassing many ViT architecture models
Efficient computation
At 256x256 resolution, computational efficiency is between ViT-g and ViT-G, but resource consumption is significantly lower than the latter
Model soup integration
Further improves performance by averaging weights from two different training phases

Model Capabilities

Zero-shot image classification
Image-text retrieval
Image feature extraction
Text feature extraction

Use Cases

Computer vision
Zero-shot image classification
Classify images without specific training
Achieves 79.4% top-1 accuracy on ImageNet
Image retrieval
Retrieve relevant images based on text descriptions
Multimodal research
Vision-language alignment research
Study alignment relationships between image and text representations
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase