mexma - siglip Open-source Multimodal Model - High - efficiency Application for Image

Mexma Siglip

Developed by visheratin

MEXMA-SigLIP is a high-performance CLIP model combining multilingual text encoder and image encoder, supporting 80 languages.

Text-to-Image

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Multilingual Image Classification #Zero-shot Learning #Cross-modal Retrieval

Downloads 137

Release Time : 12/4/2024

Model Overview

This model integrates the MEXMA multilingual text encoder and SigLIP image encoder, achieving high-performance zero-shot image classification across 80 languages.

Model Features

Multilingual Support

Supports zero-shot image classification in 80 languages

High Performance

Achieves state-of-the-art performance on Crossmodal-3600 dataset among business-friendly models

Combination of Superior Models

Combines the strengths of MEXMA multilingual text encoder and SigLIP image encoder

Model Capabilities

Zero-shot Image Classification

Multilingual Text Understanding

Image-Text Matching

Use Cases

Multilingual Image Classification

Multilingual Image Tagging

Classify and tag images using different languages

Accurately identifies image content and describes it in multiple languages

🚀 MEXMA-SigLIP

MEXMA-SigLIP is a high - performance zero - shot image classification model supporting 80 languages, combining MEXMA and SigLIP.

🚀 Quick Start

MEXMA-SigLIP combines the MEXMA multilingual text encoder and an image encoder from the SigLIP model. It enables us to obtain a high - performance CLIP model for 80 languages. This model sets the state - of - the - art on the Crossmodal - 3600 dataset among commercial use - friendly models.

💻 Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")

img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1)
    print(probs)

📄 License

The model is released under the MIT license.

🔗 Additional Information

Supported Languages

ar, kn, ka, af, kk, am, km, ky, ko, as, lo, az, ml, mr, be, mk, bn, my, bs, nl, bg, ca, no, cs, ne, ku, pl, cy, pt, da, ro, de, ru, el, sa, en, si, eo, sk, et, sl, eu, sd, fi, so, fr, es, gd, sr, ga, su, gl, sv, gu, sw, ha, ta, he, te, hi, th, hr, tr, hu, ug, hy, uk, id, ur, is, vi, it, xh, jv, zh, ja

Pipeline Tag

zero - shot - image - classification

New Version

visheratin/mexma - siglip2

Acknowledgements

I thank ML Collective and Lambda for providing compute resources to train the model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご