blip-image-captioning-base-football-finetuned open-source model - Free deployment and accurate generation of football image descriptions

Blip Image Captioning Base Football Finetuned

Developed by ybelkada

A vision-language model pre-trained on COCO and fine-tuned on a football dataset, proficient in generating image captions

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Vision-Language Pretraining #Image Caption Generation #Multi-Task Unified Framework

Downloads 71

Release Time : 1/17/2023

Model Overview

BLIP is a unified vision-language pre-training framework, excelling in image understanding and caption generation tasks. This version is an image caption generation model fine-tuned on a football dataset.

Model Features

Unified Vision-Language Framework

Supports both visual understanding and language generation tasks simultaneously

Guided Annotation Strategy

Effectively utilizes noisy data through synthetic caption generation and filtering mechanisms

Optimized for Football Scenarios

Fine-tuned on a football dataset, providing more accurate descriptions of sports scenes

Model Capabilities

Image Caption Generation

Conditional Text Generation

Vision-Language Understanding

Use Cases

Sports Media

Automatic Annotation of Football Match Images

Generate descriptive text for match pictures in sports news

Improve the efficiency of sports content production

Accessibility Technology

Visual Assistance Application

Describe image content for visually impaired people

Enhance the accessibility of digital content

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is a model card for image captioning. The model is pre - trained on the COCO dataset (base architecture with ViT base backbone) and fine - tuned on the football dataset. It can be used for both conditional and unconditional image captioning.

Google Colab notebook for fine - tuning: https://colab.research.google.com/drive/1lbqiSiA0sDF7JDWPeS0tccrM85LloVha?usp=sharing


Pull figure from BLIP official repo

🚀 Quick Start

This model can be used for conditional and un - conditional image captioning. You can follow the usage examples below to start using it.

✨ Features

Flexible Transfer: BLIP can be flexibly transferred to both vision - language understanding and generation tasks.
Effective Use of Noisy Data: It effectively utilizes noisy web data by bootstrapping the captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
State - of - the - Art Results: Achieves state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval, image captioning, and VQA.
Strong Generalization: Demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner.

💻 Usage Examples

Basic Usage

You can use this model for conditional and un - conditional image captioning.

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("ybelkada/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("ybelkada/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesfoce/blip-image-captioning-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📚 Documentation

TL;DR

The authors from the paper write in the abstract:

Vision - Language Pre - training (VLP) has advanced the performance for many vision - language tasks. However, most existing pre - trained models only excel in either understanding - based tasks or generation - based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image - text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision - language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero - shot manner. Code, models, and datasets are released.

📄 License

This model is licensed under the bsd - 3 - clause license.

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Property	Details
Model Type	Image captioning model pre - trained on COCO and fine - tuned on football dataset
Training Data	COCO dataset, ybelkada/football - dataset

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Blip Image Captioning Base Football Finetuned

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

🚀 Quick Start

✨ Features

💻 Usage Examples

Basic Usage

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

📚 Documentation

TL;DR

📄 License

BibTex and citation info

In half precision (`float16`)