clip-fa-vision Open-Source Model - Achieving Efficient Connection and Matching between Persian Texts and Images

Clip Fa Vision

Developed by SajjadAyoubi

CLIPfa is the Persian version of OpenAI's CLIP model, connecting Persian text and image representations through contrastive learning

Text-to-Image

Transformers

#Persian image-text matching #Multimodal vector embedding #Small-scale data fine-tuning

Downloads 43

Release Time : 3/2/2022

Model Overview

A multimodal model based on contrastive learning that maps Persian text and images to a shared vector space for cross-modal retrieval and matching

Model Features

Persian adaptation

Uses Farahani's RoBERTa-fa as the text encoder, specifically optimized for Persian text understanding

Lightweight training

Effectively trained with only 400,000 data pairs (1/10 of the original version)

Dual-modal alignment

Visual and text encoders output a 768-dimensional shared vector space

Model Capabilities

Persian image-text matching

Cross-modal vector retrieval

Image semantic search

Text-guided image classification

Use Cases

Multimedia retrieval

Persian image search

Search for related images using Persian descriptions

Demonstrates retrieval effectiveness in a 25,000-image gallery

Content moderation

Multilingual inappropriate content detection

Detect inappropriate images through Persian text descriptions

🚀 CLIPfa: Connecting Farsi Text and Images

CLIPfa is a Farsi (Persian) version of OpenAI's CLIP model. It connects Farsi text and images by matching their corresponding vector representations. This project fine - tunes the text and vision encoders on a dataset of 400,000 (image, text) pairs, enabling effective text - image connections in the Farsi language.

🚀 Quick Start

OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision, introducing the CLIP (Contrastive Language–Image Pre - training) model. This model connects text and images by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models: a vision encoder and a text encoder, trained on 400 million images and corresponding captions.

We trained a Farsi (Persian) version of OpenAI's CLIP on a dataset of 400,000 (image, text) pairs. We used Farahani's RoBERTa - fa as the text encoder and ViT as the vision encoder from the original CLIP and fine - tuned them.

It should be noted that only 400K pairs were used for this training, while 4 million pairs were used for the original CLIP. Also, the training took 30 days across 592 GPUs powered by the V100 chip.

💻 Usage Examples

Basic Usage

Both models generate vectors with 768 dimensions.

from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer, CLIPFeatureExtractor
# download pre-trained models
vision_encoder = CLIPVisionModel.from_pretrained('SajjadAyoubi/clip-fa-vision')
preprocessor = CLIPFeatureExtractor.from_pretrained('SajjadAyoubi/clip-fa-vision')
text_encoder = RobertaModel.from_pretrained('SajjadAyoubi/clip-fa-text')
tokenizer = AutoTokenizer.from_pretrained('SajjadAyoubi/clip-fa-text')
# define input image and input text
text = 'something'
image = PIL.Image.open('my_favorite_image.jpg')
# compute embeddings
text_embedding = text_encoder(**tokenizer(text,
                                          return_tensors='pt')).pooler_output
image_embedding = vision_encoder(**preprocessor(image, 
                                                return_tensors='pt')).pooler_output
text_embedding.shape == image_embedding.shape

Advanced Usage

The following are some use cases of CLIPfa on 25K Unsplash images

First, install the required package using pip install -q git+https://github.com/sajjjadayobi/clipfa.git

from clipfa import CLIPDemo
demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
demo.compute_text_embeddings(['گاو' ,'اسب' ,'ماهی'])
demo.compute_image_embeddings(test_df.image_path.to_list())

🌐 Online Demo

CLIPfa at Huggingface🤗 spaces We used a small set of images (25K) to keep this app almost real - time, but it's obvious that the quality of image search depends heavily on the size of the image database.

Made with ❤️ in my basement🤫

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご