A

Align Base

Developed by kakaobrain
ALIGN is a vision-language dual-encoder model that aligns image and text representations through contrastive learning, achieving state-of-the-art cross-modal representation with large-scale noisy data.
Downloads 78.28k
Release Time : 2/24/2023

Model Overview

ALIGN uses EfficientNet as the visual encoder and BERT as the text encoder, trained via contrastive learning on the COYO-700M dataset, supporting zero-shot image classification and multimodal embedding retrieval.

Model Features

Noisy Data Training
Utilizes massive noisy image-text pair data (COYO-700M), demonstrating that simple methods combined with large-scale data can achieve state-of-the-art representation.
Dual-Encoder Architecture
Independent encoding of visual and text branches, achieving modality alignment through contrastive loss, balancing efficiency and flexibility.
Rich Metadata Support
Trained on the COYO dataset, providing metadata such as aesthetic scores, watermark detection, and face counts to enhance downstream application control.

Model Capabilities

Zero-shot image classification
Image-text similarity calculation
Cross-modal embedding retrieval
Multimodal representation learning

Use Cases

Image Understanding
Zero-shot Image Classification
Classify images of arbitrary categories without fine-tuning.
Achieves performance comparable to dedicated classification models on standard benchmarks.
Cross-modal Retrieval
Image-Text Matching
Retrieve the most relevant images for text descriptions or generate matching text for images.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase