mexma - siglip2 Open-Source Multimodal Model - Image and Text Matching Application Supporting 80 Languages

Mexma Siglip2

Developed by visheratin

MEXMA-SigLIP2 is a high-performance CLIP model combining the MEXMA multilingual text encoder and SigLIP2 image encoder, supporting 80 languages.

Text-to-Image

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Multilingual zero-shot retrieval #Cross-modal high precision #80-language support

Downloads 224

Release Time : 3/2/2025

Model Overview

This model integrates the MEXMA multilingual text encoder and SigLIP2 image encoder to achieve cross-modal retrieval capabilities, excelling particularly in zero-shot image classification tasks.

Model Features

Multilingual support

Supports 80 languages, including various Asian, European, and African languages

High-performance cross-modal retrieval

Achieves new state-of-the-art results on the Crossmodal-3600 dataset

Zero-shot learning capability

Performs image classification tasks without task-specific fine-tuning

Model Capabilities

Zero-shot image classification

Cross-modal retrieval

Multilingual text understanding

Image-text matching

Use Cases

Image retrieval

Multilingual image search

Retrieve relevant images using queries in different languages

Achieves 62.54% image retrieval accuracy on the Crossmodal-3600 dataset

Text retrieval

Image-related text retrieval

Retrieve relevant text descriptions based on image content

Achieves 59.99% text retrieval accuracy on the Crossmodal-3600 dataset

🚀 MEXMA-SigLIP2

MEXMA-SigLIP2 is a high - performance CLIP model that combines MEXMA and SigLIP2, supporting 80 languages and achieving state - of - the - art results on the Crossmodal - 3600 dataset.

🚀 Quick Start

Model Information

Property	Details
Model Type	zero - shot image classification
Supported Languages	ar, kn, ka, af, kk, am, km, ky, ko, as, lo, az, ml, mr, be, mk, bn, my, bs, nl, bg, ca, no, cs, ne, ku, pl, cy, pt, da, ro, de, ru, el, sa, en, si, eo, sk, et, sl, eu, sd, fi, so, fr, es, gd, sr, ga, su, gl, sv, gu, sw, ha, ta, he, te, hi, th, hr, tr, hu, ug, hy, uk, id, ur, is, vi, it, xh, jv, zh, ja
Tags	siglip2, clip, mexma

Model Performance

The model mexma - siglip2 has the following performance on the Crossmodal - 3600 dataset:

Task	Metric	Value
Zero - shot Image Retrieval	Image retrieval R@1	62.54%
Zero - shot Text Retrieval	Text retrieval R@1	59.99%

💻 Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")

img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1)
    print(probs)

📚 Documentation

MEXMA - SigLIP2 is a model that combines the MEXMA multilingual text encoder and an image encoder from the [SigLIP2](https://huggingface.co/google/siglip2 - so400m - patch16 - 512/) model. This allows us to get a high - performance CLIP model for 80 languages. MEXMA - SigLIP2 sets new state - of - the - art on the Crossmodal - 3600 dataset with 62.54% R@1 for image retrieval and 59.99% R@1 for text retrieval.

📄 License

This project is licensed under the MIT license.

🙏 Acknowledgements

I thank ML Collective for providing compute resources to train the model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご