nllb-clip-base-siglip Open-source Multilingual Vision-Language Model - Supports Image-Text Processing in 201 Languages

Nllb Clip Base Siglip

Developed by visheratin

NLLB-CLIP-SigLIP is a multilingual vision-language model that combines the text encoder from NLLB and the image encoder from SigLIP, supporting 201 languages.

Text-to-Image #Multilingual zero-shot classification #Cross-modal retrieval #Low-resource language processing

Downloads 478

Release Time : 11/14/2023

Model Overview

This model integrates the text encoding capabilities of NLLB and the image encoding capabilities of SigLIP, excelling particularly in low-resource languages and performing outstandingly in cross-modal tasks.

Model Features

Multilingual support

Supports 201 languages from Flores-200, with particular strength in low-resource languages

Cross-modal capability

Combines text and image encoding capabilities, suitable for cross-modal tasks

Superior performance

Sets the latest state-of-the-art performance on the Crossmodal-3600 dataset

Model Capabilities

Zero-shot image classification

Multilingual text understanding

Cross-modal retrieval

Use Cases

Multilingual applications

Multilingual image classification

Classify images using different languages

Performs excellently across multiple languages

Cross-modal retrieval

Image-text matching

Match images and texts in a multilingual environment

Performs exceptionally well on the Crossmodal-3600 dataset

🚀 NLLB-CLIP-SigLIP Model

NLLB-CLIP-SigLIP combines a text encoder from the NLLB model and an image encoder from the SigLIP model. It extends model capabilities to 201 languages of the Flores - 200 and sets state - of - the - art on the Crossmodal - 3600 dataset, especially excelling in low - resource languages.

📦 Model Information

Property	Details
Model Type	NLLB - CLIP - SigLIP
Training Data	visheratin/laion - coco - nllb
License	cc - by - nc - 4.0
New Version	visheratin/mexma - siglip2

✨ Features

Combines text encoder from NLLB model and image encoder from SigLIP.
Extends capabilities to 201 languages of the Flores - 200.
Achieves state - of - the - art on the Crossmodal - 3600 dataset, performing well on low - resource languages.
This version outperforms the standard version, with results available here and here.
There is an even better version available.

🚀 Quick Start

This model is integrated into OpenCLIP. You can use it just like any other model.

📦 Installation

!pip install -U open_clip_torch

💻 Usage Examples

Basic Usage

from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch

model, transform = create_model_from_pretrained("nllb-clip-base-siglip", "v1", device="cuda")

tokenizer = get_tokenizer("nllb-clip-base-siglip")

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

text_inputs = []
for i in range(len(class_options)):
    tokenizer.set_language(class_langs[i])
    text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

image_inputs = transform(image).unsqueeze(0).to("cuda")

with torch.inference_mode():
    logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

print(logits_per_image.softmax(dim=-1))

📚 Documentation

You can find more details about the model in the paper.

📄 License

This model is released under the cc - by - nc - 4.0 license.

🔗 Try it in Colab

👏 Acknowledgements

I thank ML Collective for providing Google Cloud compute resources to train the OpenCLIP - compatible version of NLLB - CLIP.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご