vit-gpt2-image-chinese-captioning Open Source Model - Free Support for Chinese Image Caption Generation

Home

Vit Gpt2 Image Chinese Captioning

Developed by yuanzhoulvpi

This model uses ViT for image encoding and GPT-2 for decoding, supporting Chinese image caption generation.

Image-to-Text

Transformers

ChineseOpen Source License:MIT #Chinese Image Caption Generation #ViT-GPT2 Joint Model #Multimodal Chinese Processing

Downloads 22

Release Time : 3/2/2023

Model Overview

A Chinese image captioning model combining a vision encoder (ViT) and a language decoder (GPT-2), capable of generating Chinese text descriptions for input images.

Model Features

Chinese Support

Image captioning capability specifically optimized for Chinese.

Hybrid Architecture

Combines the strengths of Vision Transformer (ViT) and language model (GPT-2).

Pretrained Models

Based on pretrained models google/vit-base-patch16-224 and yuanzhoulvpi/gpt2_chinese.

Model Capabilities

Image Understanding

Chinese Text Generation

Image-to-Text Conversion

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates Chinese descriptions for images on social media or e-commerce platforms.

Example generated description: 'A cat sitting on a sofa'

Assisting Visually Impaired Users

Converts visual content into text descriptions.

🚀 Vision Encoder-Decoder Model with ViT and GPT2

This project combines a Vision Transformer (ViT) encoder with a GPT2 decoder to support Chinese image captioning.

🚀 Quick Start

This model uses ViT to encode images and GPT2 to generate captions. It supports Chinese language processing.

Model Architecture

✨ Features

Encoder-Decoder Architecture: Utilizes ViT (google/vit-base-patch16-224) for image encoding and GPT2 (yuanzhoulvpi/gpt2_chinese) for caption generation.
Chinese Support: The model is capable of generating captions in Chinese.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import (VisionEncoderDecoderModel, 
                          AutoTokenizer,ViTImageProcessor)
import torch
from PIL import Image

vision_encoder_decoder_model_name_or_path = "yuanzhoulvpi/vit-gpt2-image-chinese-captioning"#"vit-gpt2-image-chinese-captioning/checkpoint-3200"

processor = ViTImageProcessor.from_pretrained(vision_encoder_decoder_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(vision_encoder_decoder_model_name_or_path)
model = VisionEncoderDecoderModel.from_pretrained(vision_encoder_decoder_model_name_or_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}


def predict_step(image_paths):
    images = []
    for image_path in image_paths:
        i_image = Image.open(image_path)
        if i_image.mode != "RGB":
            i_image = i_image.convert(mode="RGB")

        images.append(i_image)

    pixel_values = processor(images=images, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)

    output_ids = model.generate(pixel_values, **gen_kwargs)

    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    preds = [pred.strip() for pred in preds]
    return preds


predict_step(['bigdata/image_data/train-1000200.jpg'])

📚 Documentation

Training Code: You can find the training code here.

🔧 Technical Details

The model uses a Vision Transformer (ViT) as the encoder and GPT2 as the decoder. The ViT model is google/vit-base-patch16-224, and the GPT2 model is yuanzhoulvpi/gpt2_chinese. This combination allows the model to encode images and generate Chinese captions.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご