🚀 CLIPfa: Connecting Farsi Text and Images
CLIPfa is a Farsi (Persian) version of OpenAI's CLIP model. It connects Farsi text and images by matching their corresponding vector representations. This project fine - tunes the text and vision encoders on a dataset of 400,000 (image, text) pairs, enabling effective text - image connections in the Farsi language.
🚀 Quick Start
OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision, introducing the CLIP (Contrastive Language–Image Pre - training) model. This model connects text and images by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models: a vision encoder and a text encoder, trained on 400 million images and corresponding captions.
We trained a Farsi (Persian) version of OpenAI's CLIP on a dataset of 400,000 (image, text) pairs. We used Farahani's RoBERTa - fa as the text encoder and ViT as the vision encoder from the original CLIP and fine - tuned them.
- It should be noted that only 400K pairs were used for this training, while 4 million pairs were used for the original CLIP. Also, the training took 30 days across 592 GPUs powered by the V100 chip.
💻 Usage Examples
Basic Usage
Both models generate vectors with 768 dimensions.
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer, CLIPFeatureExtractor
vision_encoder = CLIPVisionModel.from_pretrained('SajjadAyoubi/clip-fa-vision')
preprocessor = CLIPFeatureExtractor.from_pretrained('SajjadAyoubi/clip-fa-vision')
text_encoder = RobertaModel.from_pretrained('SajjadAyoubi/clip-fa-text')
tokenizer = AutoTokenizer.from_pretrained('SajjadAyoubi/clip-fa-text')
text = 'something'
image = PIL.Image.open('my_favorite_image.jpg')
text_embedding = text_encoder(**tokenizer(text,
return_tensors='pt')).pooler_output
image_embedding = vision_encoder(**preprocessor(image,
return_tensors='pt')).pooler_output
text_embedding.shape == image_embedding.shape
Advanced Usage
The following are some use cases of CLIPfa on 25K Unsplash images
- First, install the required package using
pip install -q git+https://github.com/sajjjadayobi/clipfa.git
from clipfa import CLIPDemo
demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
demo.compute_text_embeddings(['گاو' ,'اسب' ,'ماهی'])
demo.compute_image_embeddings(test_df.image_path.to_list())
🌐 Online Demo
CLIPfa at Huggingface🤗 spaces
We used a small set of images (25K) to keep this app almost real - time, but it's obvious that the quality of image search depends heavily on the size of the image database.
Made with ❤️ in my basement🤫