clip-italian Open Source Model - Free Implementation of Italian Comparison for Language and Image Pretraining Applications

Clip Italian

Developed by clip-italian

The first contrastive language-image pretraining model for Italian, based on Italian BERT and ViT architecture, achieving competitive performance with only 1.4 million fine-tuned samples

Text-to-Image OtherOpen Source License:Gpl-3.0 #Italian text-image retrieval #Zero-shot classification #Multimodal contrastive learning

Downloads 960

Release Time : 3/2/2022

Model Overview

This model achieves cross-modal understanding between Italian text and images through contrastive learning, supporting tasks like image retrieval and zero-shot classification

Model Features

Few-shot efficient training

Achieves competitive performance with only 1.4 million training samples, far less than the original CLIP's 400 million data requirement

Cross-modal understanding

Achieves semantic alignment between Italian text and images through contrastive learning

Zero-shot transfer capability

Can be directly applied to downstream vision tasks without fine-tuning

Multi-source data fusion

Integrates 4 Italian vision-language datasets including WIT and MSCOCO-IT

Model Capabilities

Understanding Italian image captions

Text-based image retrieval

Zero-shot image classification

Cross-modal feature extraction

Use Cases

Multimedia retrieval

Italian image search

Retrieve relevant images through natural language descriptions

Achieves MRR@10 of 0.5204 on MSCOCO-IT validation set

Intelligent classification

Zero-shot image classification

Directly classify unseen image categories without training

Top-5 accuracy of 43.69% on ImageNet

🚀 Italian CLIP

This project presents an Italian CLIP model. By applying some techniques, a competitive Italian CLIP model is fine - tuned with only 1.4 million training samples. It's built on the Italian BERT model and the OpenAI vision transformer. You can test the model via the provided demo application, which also includes project details.

🚀 Quick Start

Want to test our model immediately? Just visit our demo application. The demo has all the project details, from training tricks to remarkable results.

Paper: Contrastive Language - Image Pre - training for the Italian Language

Our Italian CLIP model is based on the Italian BERT model provided by dbmdz and the OpenAI vision transformer.

📦 Training Data

We considered four main data sources:

WIT: An image - caption dataset from Wikipedia (see Srinivasan et al., 2021).
MSCOCO - IT: This image - caption dataset is from the work by [Scaiella et al., 2019](http://www.ai - lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf).
Conceptual Captions: From the work by [Sharma et al., 2018](https://aclanthology.org/P18 - 1238.pdf).
[La Foto del Giorno](https://www.ilpost.it/foto - del - giorno/): Collected from Il Post, a well - known Italian online newspaper.

We used better data augmentation, strategic training choices, and backbone - freezing pre - training. For more details, refer to our demo.

📚 Experiments

Quantitative Evaluation

To assess the performance of our clip - italian model, we conducted an experimental evaluation. Since it's the first clip - based model in Italian, we used the multilingual CLIP model as a baseline.

mCLIP

The multilingual CLIP (mCLIP) was introduced by Nils Reimers in his sentence - transformer library. It's based on a multilingual encoder created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp - main.365/)).

Tasks

We selected two tasks:

Image retrieval
Zero - shot classification

Reproducibility

Both experiments are easy to replicate. We share the two Colab notebooks used for the results:

[Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp = sharing)
[ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci - pk8xxx3Vu_RRgW -?usp = sharing)

Image Retrieval

This experiment is performed on the MSCOCO - IT validation set (not used in training). Given a caption, we search for the most similar image. We use MRR@K as evaluation metrics.

MRR	CLIP - Italian	mCLIP
MRR@1	0.3797	0.2874
MRR@5	0.5039	0.3957
MRR@10	0.5204	0.4129

Although we used MSCOCO - IT in training, the original CLIP model was trained on 400 million images (some might be from MSCOCO).

Zero - shot Image Classification

This experiment replicates the OpenAI's zero - shot image classification on ImageNet. We used DeepL to translate ImageNet labels. We evaluate the models by computing accuracy at different levels.

Accuracy	CLIP - Italian	mCLIP
Accuracy@1	22.11	20.15
Accuracy@5	43.69	36.57
Accuracy@10	52.55	42.91
Accuracy@100	81.08	67.11

Our results show that CLIP - Italian is competitive and outperforms mCLIP on the two tasks. However, our results are lower than those in the original OpenAI paper. We think the translated image labels might affect the final scores.

👥 Team Members

Federico Bianchi (vinid)
Raphael Pisoni (4rtemi5)
Giuseppe Attanasio (g8a9)
Silvia Terragni (silviatti)
Dario Balestri (D3Reo)
Gabriele Sarti (gsarti)
Sri Lakshmi (srisweet)

📄 License

This project is licensed under the GPL - 3.0 license.

Property	Details
Model Type	Italian CLIP
Training Data	WIT, MSCOCO - IT, Conceptual Captions, La Foto del Giorno
Tags	italian, bert, vit, vision

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご