đ BLIP Image Captioning - Arabic (Flickr8k Arabic)
This model is a fine - tuned version of Salesforce/blip - image - captioning - large
, designed for generating Arabic captions for images using the Flickr8K Arabic dataset.
đ Quick Start
Basic Usage
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()
model.eval()
with torch.no_grad():
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
generated_output = model.generate(
pixel_values=pixel_values,
max_length=75,
min_length=20,
num_beams=5,
repetition_penalty=1.5,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True
)
caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
print(caption)
⨠Features
- This model is a fine - tuned version of
Salesforce/blip-image-captioning-large
for Arabic image captioning.
- It can take an input image and generate a relevant Arabic caption describing the image content.
đĻ Installation
The code example above assumes that you have installed the necessary libraries such as transformers
, torch
, Pillow
, and matplotlib
. You can install them using pip
:
pip install transformers torch pillow matplotlib
đ Documentation
Model Sources
Training Details
Training Data
Property |
Details |
Model Type |
Fine - tuned BLIP model for Arabic image captioning |
Training Data |
Flickr8k Arabic dataset, consisting of 8,000 images with 32,000 captions |
The Flickr8k Arabic dataset provides a diverse collection of everyday scenes and activities described in Modern Standard Arabic.
Training Procedure
The model was fine - tuned from the original BLIP model by adapting its language generation capabilities to Arabic text.
Training Hyperparameters
Parameter |
Value |
Training regime |
fp16 mixed precision |
Optimizer |
AdamW |
Learning rate |
5e - 5 |
per_device_train_batch_size |
2 |
per_device_eval_batch_size |
16 |
gradient_accumulation_steps |
14 |
Total training batch size |
28 |
Epochs |
5 |
LR scheduler |
Cosine with warmup |
Weight decay |
0.01 |
Evaluation
Testing Data
The model was evaluated on the Flickr8k Arabic test split, which contains 1,000 images with 4 reference captions each.
Metrics
Metric |
Value |
BLEU - 1 |
65.80 |
BLEU - 2 |
51.33 |
BLEU - 3 |
38.72 |
BLEU - 4 |
28.75 |
METEOR |
46.29 |
Results
The model performs well on common scenes and activities, generating grammatically correct and contextually appropriate Arabic captions. However, its performance decreases slightly for unusual scenes or culturally specific contexts not well - represented in the training data.
Bias, Risks, and Limitations
â ī¸ Important Note
- The model was trained on Flickr8k Arabic, which may not represent the full diversity of images and linguistic expressions in Arabic - speaking regions.
- It may produce stereotypical or culturally insensitive descriptions.
- Performance may vary across different Arabic dialects and regional expressions.
- It has a limited ability to correctly describe culturally specific items, events, or contexts.
- It may struggle with complex scenes or unusual visual elements.
Recommendations
đĄ Usage Tip
- Users should review generated captions before using them in sensitive contexts.
- Consider post - processing or human review for public - facing applications.
- Test across diverse image types relevant to your use case.
- Be aware that the model may reflect biases present in the training data.
- Consider regional and dialectal differences when evaluating caption quality.
đ License
This model is licensed under the MIT license.