đ Italian CLIP
This project presents an Italian CLIP model. By applying some techniques, a competitive Italian CLIP model is fine - tuned with only 1.4 million training samples. It's built on the Italian BERT model and the OpenAI vision transformer. You can test the model via the provided demo application, which also includes project details.
đ Quick Start
Want to test our model immediately? Just visit our demo application. The demo has all the project details, from training tricks to remarkable results.
Paper: Contrastive Language - Image Pre - training for the Italian Language
Our Italian CLIP model is based on the Italian BERT model provided by dbmdz and the OpenAI vision transformer.
đĻ Training Data
We considered four main data sources:
- WIT: An image - caption dataset from Wikipedia (see Srinivasan et al., 2021).
- MSCOCO - IT: This image - caption dataset is from the work by [Scaiella et al., 2019](http://www.ai - lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf).
- Conceptual Captions: From the work by [Sharma et al., 2018](https://aclanthology.org/P18 - 1238.pdf).
- [La Foto del Giorno](https://www.ilpost.it/foto - del - giorno/): Collected from Il Post, a well - known Italian online newspaper.
We used better data augmentation, strategic training choices, and backbone - freezing pre - training. For more details, refer to our demo.
đ Experiments
Quantitative Evaluation
To assess the performance of our clip - italian model, we conducted an experimental evaluation. Since it's the first clip - based model in Italian, we used the multilingual CLIP model as a baseline.
mCLIP
The multilingual CLIP (mCLIP) was introduced by Nils Reimers in his sentence - transformer library. It's based on a multilingual encoder created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp - main.365/)).
Tasks
We selected two tasks:
- Image retrieval
- Zero - shot classification
Reproducibility
Both experiments are easy to replicate. We share the two Colab notebooks used for the results:
- [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp = sharing)
- [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci - pk8xxx3Vu_RRgW -?usp = sharing)
Image Retrieval
This experiment is performed on the MSCOCO - IT validation set (not used in training). Given a caption, we search for the most similar image. We use MRR@K as evaluation metrics.
MRR |
CLIP - Italian |
mCLIP |
MRR@1 |
0.3797 |
0.2874 |
MRR@5 |
0.5039 |
0.3957 |
MRR@10 |
0.5204 |
0.4129 |
Although we used MSCOCO - IT in training, the original CLIP model was trained on 400 million images (some might be from MSCOCO).
Zero - shot Image Classification
This experiment replicates the OpenAI's zero - shot image classification on ImageNet. We used DeepL to translate ImageNet labels. We evaluate the models by computing accuracy at different levels.
Accuracy |
CLIP - Italian |
mCLIP |
Accuracy@1 |
22.11 |
20.15 |
Accuracy@5 |
43.69 |
36.57 |
Accuracy@10 |
52.55 |
42.91 |
Accuracy@100 |
81.08 |
67.11 |
Our results show that CLIP - Italian is competitive and outperforms mCLIP on the two tasks. However, our results are lower than those in the original OpenAI paper. We think the translated image labels might affect the final scores.
đĨ Team Members
đ License
This project is licensed under the GPL - 3.0 license.
Property |
Details |
Model Type |
Italian CLIP |
Training Data |
WIT, MSCOCO - IT, Conceptual Captions, La Foto del Giorno |
Tags |
italian, bert, vit, vision |