SmolVLM-500M-anime-caption-v0.1 Open-source Model - Accurately Describe Anime-style Images

Smolvlm 500M Anime Caption V0.1

Developed by Andres77872

A vision-language model specialized in describing anime-style images, fine-tuned from SmolVLM-500M-Base, trained on 180K synthetic image/caption pairs generated by large language models.

Image-to-Text

Safetensors

EnglishOpen Source License:Apache-2.0 #Anime image captioning #Synthetic data training #Multi-model generation

Downloads 61

Release Time : 4/18/2025

Model Overview

Designed for efficiently generating high-quality captions for anime-style images, capable of producing natural and fluent English descriptions for various anime works and illustrations.

Model Features

Specialized for Anime Images

Optimized specifically for anime-style images, accurately capturing unique visual features and stylistic elements of anime.

High-Quality Synthetic Data Training

Trained on 180K high-quality synthetic datasets generated by the latest large language models (Gemma 3, Gemini 2.0 Flash, etc.).

Lightweight and Efficient

A lightweight model with 500M parameters, achieving efficient inference while maintaining performance.

Model Capabilities

Anime image caption generation

Anime content indexing and tagging

Anime style recognition

Use Cases

Anime Content Creation

Automatic Captioning for Anime Works

Automatically generates English captions for anime works and illustrations

Natural and fluent anime-style descriptions

Anime Database Annotation

Used for automatic content annotation in anime databases and archives

Improves content retrieval efficiency

🚀 SmolVLM-500M-Anime-Caption-v0.1

SmolVLM-500M-Anime-Caption-v0.1 is a vision-language model tailored for describing anime-style images. It was fine-tuned from HuggingFaceTB/SmolVLM-500M-Base on 180,000 synthetic image/caption pairs created by recent LLMs (Gemma 3, Gemini 2.0 Flash, Llama 4 Maverick, and GPT-4.1).

✨ Features

Specialized for Anime: Designed to efficiently and accurately caption anime-style images, producing natural English descriptions for a wide range of anime and illustration artworks.
High - Quality Fine - Tuning: Fine - tuned on a large dataset of 180k synthetic anime image/caption pairs, ensuring high - quality results.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is a recommended inference pipeline (transformers):

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, Idefics3ForConditionalGeneration, TextIteratorStreamer, StoppingCriteria, StoppingCriteriaList

base_model_id = "Andres77872/SmolVLM-500M-anime-caption-v0.1"

processor = AutoProcessor.from_pretrained(base_model_id)
model = Idefics3ForConditionalGeneration.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

class StopOnTokens(StoppingCriteria):
    def __init__(self, tokenizer, stop_sequence):
        super().__init__()
        self.tokenizer = tokenizer
        self.stop_sequence = stop_sequence

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        new_text = self.tokenizer.decode(input_ids[0], skip_special_tokens=True)
        max_keep = len(self.stop_sequence) + 10
        if len(new_text) > max_keep:
            new_text = new_text[-max_keep:]
        return self.stop_sequence in new_text

def prepare_inputs(image: Image.Image):
    # IMPORTANT: The question prompt must remain fixed as "describe the image".
    # This model is NOT designed for visual question answering.
    # It is strictly an image captioning model, not intended to answer arbitrary visual questions.
    question = "describe the image"
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]
    max_image_size = processor.image_processor.max_image_size["longest_edge"]
    size = processor.image_processor.size.copy()
    if "longest_edge" in size and size["longest_edge"] > max_image_size:
        size["longest_edge"] = max_image_size
    prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=[prompt], images=[[image]], return_tensors='pt', padding=True, size=size)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    return inputs

# Example: caption a sample anime image
image = Image.open(requests.get('https://img.arz.ai/5A7A-ckt', stream=True).raw).convert("RGB")
inputs = prepare_inputs(image)
stop_sequence = "</QUERY>"
streamer = TextIteratorStreamer(
    processor.tokenizer,
    skip_prompt=True,
    skip_special_tokens=True,
)
custom_stopping_criteria = StoppingCriteriaList([
    StopOnTokens(processor.tokenizer, stop_sequence)
])

with torch.no_grad():
    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        do_sample=False,
        max_new_tokens=512,
        pad_token_id=processor.tokenizer.pad_token_id,
        stopping_criteria=custom_stopping_criteria,
    )

    import threading
    generation_thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
    generation_thread.start()

    for new_text in streamer:
        print(new_text, end="", flush=True)

    generation_thread.join()

📚 Documentation

Model Description

This model is crafted for efficient, high - quality captioning of anime - style images. It can generate natural English descriptions for various anime and illustration artworks.

Intended Use

Anime image captioning: Generate English descriptions for anime, manga panels, or illustrations.
Content indexing or tagging for anime - focused archives, databases, and creative tools.

Out of Scope / Limitations:
The model is not suitable for real - world photograph captioning, non - anime artwork, or critical decision - making scenarios.

Training Details

Fine - tuning dataset: 180,000 pairs of anime images and synthetic English captions
Caption generation: Synthetic captions were produced using Gemma 3, Gemini 2.0 Flash, Llama 4 Maverick, and GPT - 4.1.
Task: Image - to - text, focused on high - quality anime - style descriptions.
Base model: HuggingFaceTB/SmolVLM-500M-Base

📄 License

Apache 2.0 (Inherited from base and training components)

Attribution

This model is a fine - tuned derivative of HuggingFaceTB/SmolVLM-500M-Base using synthetic data generated with large language models, for the task of anime image captioning.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご