RGB_Language_Cap Open-Source Vision-Language Model - Generating Text Descriptions of Spatial Relationships of Image Entities for Free

Rgb Language Cap

Developed by voxreality

This is a vision-language model trained on the COCO dataset, capable of generating descriptive texts that include spatial relationships between image entities.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Spatial Relationship Description #ViT-GPT2 Architecture #Multi-sentence Image Captioning

Downloads 24

Release Time : 9/3/2024

Model Overview

The model adopts a sequence-to-sequence architecture with a ViT encoder and GPT2 decoder, specifically designed for image caption generation, with outputs always including spatial orientation relationships between objects.

Model Features

Spatial Relationship Awareness

Generated captions explicitly indicate spatial orientation relationships between objects (e.g., 'on the left side').

Controllable Output Length

Supports controlling the maximum number of sentences generated (up to 5 sentences) via parameters.

Lightweight Deployment

Requires only 4GB GPU memory to run.

Model Capabilities

Image Caption Generation

Spatial Relationship Recognition

Multi-sentence Text Generation

Use Cases

Assistive Technology

Visual Impairment Assistance

Generates environment descriptions with spatial relationships for visually impaired users.

Helps users understand the relative positions of objects.

Content Generation

Automatic Image Tagging

Generates metadata with spatial information for image libraries.

Improves the accuracy of image retrieval.

🚀 Spatial Aware Vision-Language (VL) Model

We are creating a spatial aware vision - language (VL) model that can generate captions for images with spatial relationship information.

🚀 Quick Start

This is a trained model on the COCO dataset images, which includes extra information about the spatial relationship between the entities in the image. It is a sequence - to - sequence model for image - captioning, with a ViT encoder and a GPT2 decoder architecture.

Requirements

Requirements!

- 4GB GPU RAM. - CUDA enabled docker

Download and Run

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="voxreality/rgb-language_cap", max_new_tokens=200, device=device)
filename = 'path/to/file'
generated_captions = image_captioner(filename)
print(generated_captions)

The model is trained to generate as many words as possible, with a maximum of 200 tokens, which is approximately 5 sentences. Usually, the 6th sentence is cropped. The output always follows the form: "Object1" is to the "Left/Right etc." of the "Object2".

💻 Usage Examples

Basic Usage

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="voxreality/rgb-language_cap", max_new_tokens=200, device=device)
filename = 'path/to/file'
generated_captions = image_captioner(filename)
print(generated_captions)

Advanced Usage

If you want to produce a specific number of captions up to 5:

import os
def print_up_to_n_sentences(captions, n):
    for caption in captions:
        generated_text = caption.get('generated_text', '')
        sentences = generated_text.split('.')
        result = '.'.join(sentences[:n])
        #print(result)
    return result
filename = 'path/to/file'

generated_captions = image_captioner(filename)
captions = print_up_to_n_sentences(generated_captions, 5)
print(captions)

📄 License

This project is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Sequence to sequence model for image - captioning (ViT encoder and GPT2 decoder)
Training Data	COCO dataset images
Library Name	transformers
Pipeline Tag	image - to - text
Tags	text - generation - inference
Metrics	code_eval

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご