Open-source Multilingual Vision-Language Model nllb-clip-large-siglip - Supports Image-Text Processing in 201 Languages

Nllb Clip Large Siglip

Developed by visheratin

NLLB-CLIP-SigLIP is a multilingual vision-language model that combines the text encoder of the NLLB model and the image encoder of the SigLIP model, supporting 201 languages.

Text-to-Image #Multilingual zero-shot classification #Cross-modal retrieval #Low-resource language support

Downloads 384

Release Time : 11/14/2023

Model Overview

This model integrates the text encoding capability of NLLB and the image encoding capability of SigLIP, excelling particularly in cross-modal tasks for low-resource languages and demonstrating outstanding performance on the Crossmodal-3600 dataset.

Model Features

Multilingual support

Supports 201 languages from Flores-200, including many low-resource languages

Cross-modal capability

Combines text and image encoding abilities, excelling in image-text matching tasks

Low-resource language performance

Achieves state-of-the-art performance on low-resource languages

Model Capabilities

Multilingual image classification

Cross-lingual image retrieval

Zero-shot learning

Use Cases

Multilingual content understanding

Multilingual image classification

Classify images using text labels in different languages

Outstanding performance on the Crossmodal-3600 dataset

Cross-lingual image retrieval

Retrieve relevant images using queries in different languages

Supports queries in 201 languages

🚀 NLLB-CLIP-SigLIP

NLLB-CLIP-SigLIP combines NLLB's text encoder and SigLIP's image encoder, enabling cross - modal tasks in 201 languages of Flores - 200.

🚀 Quick Start

This model is integrated into OpenCLIP so that you can use it as any other model. You can also click the button below to open a Colab notebook for a quick try.

✨ Features

NLLB-CLIP-SigLIP combines a text encoder from the NLLB model and an image encoder from the SigLIP model, extending the model capabilities to 201 languages of the Flores-200.
It sets state-of-the-art on the Crossmodal-3600 dataset, performing very well on low-resource languages.
This version performs much better than the standard version. You can see the results here and here.
There is an even better version of this model available!

📦 Installation

First, install the necessary library:

!pip install -U open_clip_torch

💻 Usage Examples

Basic Usage

from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch

model, transform = create_model_from_pretrained("nllb-clip-large-siglip", "v1", device="cuda")

tokenizer = get_tokenizer("nllb-clip-large-siglip")

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

text_inputs = []
for i in range(len(class_options)):
    tokenizer.set_language(class_langs[i])
    text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

image_inputs = transform(image).unsqueeze(0).to("cuda")

with torch.inference_mode():
    logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

print(logits_per_image.softmax(dim=-1))

📚 Documentation

You can find more details about the model in the paper.

📄 License

This model is released under the cc-by-nc-4.0 license.

🔗 Related Information

Property	Details
Tags	clip
Library Name	open_clip
Pipeline Tag	zero-shot-image-classification
Datasets	visheratin/laion-coco-nllb
New Version	visheratin/mexma-siglip2

👏 Acknowledgements

I thank ML Collective for providing Google Cloud compute resources to train the OpenCLIP-compatible version of NLLB-CLIP.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご