🚀 MEXMA-SigLIP
MEXMA-SigLIP is a high - performance zero - shot image classification model supporting 80 languages, combining MEXMA and SigLIP.
🚀 Quick Start
MEXMA-SigLIP combines the MEXMA multilingual text encoder and an image encoder from the SigLIP model. It enables us to obtain a high - performance CLIP model for 80 languages. This model sets the state - of - the - art on the Crossmodal - 3600 dataset among commercial use - friendly models.
💻 Usage Examples
Basic Usage
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")
img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
probs = image_logits.softmax(dim=-1)
print(probs)
📄 License
The model is released under the MIT license.
🔗 Additional Information
Supported Languages
- ar, kn, ka, af, kk, am, km, ky, ko, as, lo, az, ml, mr, be, mk, bn, my, bs, nl, bg, ca, no, cs, ne, ku, pl, cy, pt, da, ro, de, ru, el, sa, en, si, eo, sk, et, sl, eu, sd, fi, so, fr, es, gd, sr, ga, su, gl, sv, gu, sw, ha, ta, he, te, hi, th, hr, tr, hu, ug, hy, uk, id, ur, is, vi, it, xh, jv, zh, ja
Pipeline Tag
zero - shot - image - classification
Tags
siglip, clip, mexma
New Version
visheratin/mexma - siglip2
Acknowledgements
I thank ML Collective and Lambda for providing compute resources to train the model.