CogFlorence-2.2-Large Open-Source Model - Free Deployment to Facilitate Precise Image-to-Text Tasks

Cogflorence 2.2 Large

Developed by thwri

This model is a fine-tuned version of microsoft/Florence-2-large, trained on a 40,000-image subset of the Ejafa/ye-pop dataset, with annotation texts generated by THUDM/cogvlm2-llama3-chat-19B, suitable for image-to-text tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Artistic Image Annotation #Multimodal Generation #High-Precision Description

Downloads 20.64k

Release Time : 8/23/2024

Model Overview

A fine-tuned vision-language model focused on generating detailed image descriptions and annotations.

Model Features

High-Quality Image Annotation

Capable of generating detailed and accurate image descriptions, capturing both details and emotions in the image

Multi-Stage Annotation Processing

Annotation texts are generated by CogVLM2 and then processed by Gemma, improving clarity of expression

Optimized Visual Encoding

Visual encoder parameters remain frozen during training, ensuring stability of visual features

Model Capabilities

Image Description Generation

Image Content Analysis

Visual Scene Understanding

Detailed Image Annotation

Use Cases

Content Creation

Automatic Image Annotation

Automatically generate detailed descriptions for images in a library

Improves image retrieval efficiency and enhances accessibility

Assistive Technology

Visual Impairment Assistance

Provide detailed image descriptions for visually impaired users

Helps in understanding visual content

🚀 microsoft/Florence-2-large tuned on Ejafa/ye-pop captioned with CogVLM2

This repository houses a fine - tuned version of the microsoft/Florence-2-large model. It has been optimized on a 40,000 - image subset of the Ejafa/ye-pop dataset. The captions for this tuning were generated by THUDM/cogvlm2-llama3-chat-19B, enhancing the model's image - to - text capabilities.

✨ Features

Fine - tuned on a diverse image subset from Ejafa/ye-pop dataset.
Captions are generated by THUDM/cogvlm2-llama3-chat-19B and refined by google/gemma-2-9b.
Suitable for image - to - text tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.2-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.2-Large", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
from PIL import Image
import requests
import copy
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)
# {'<MORE_DETAILED_CAPTION>': 'A vivid portrayal of a classic Volkswagen Beetle parked on a cobblestone street. The car is painted a vibrant turquoise, contrasting with the muted yellow of the building behind it. The building has two wooden doors, one with a white frame and the other with a dark brown finish. The sky is clear, and the sun casts a warm glow on the scene, highlighting the car's details. The image evokes a nostalgic and nostalgic mood, capturing a moment in time without posed elements.'}

📚 Documentation

Training Details

Property	Details
Vision Encoder	The vision encoder was frozen during training.
Batch Size	64
Gradient Accumulation Steps	16
Learning Rate	5.12e - 05
Optimizer	AdamW
Scheduler	polynomial
Epochs	8.36

Dataset

The fine - tuning process made use of a 40,000 - image subset from the Ejafa/ye-pop dataset. This dataset offers a wide variety of images with different subjects, which serves as a solid training ground for enhancing the model's captioning capabilities.

Captioning

The captions were generated by THUDM/cogvlm2-llama3-chat-19B and then post - processed by google/gemma-2-9b to eliminate vagueness.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご