Paligemma Longprompt V1 Safetensors
Experimental vision model combining keyword tags with long text descriptions for image prompt generation
Downloads 38
Release Time : 6/15/2024
Model Overview
This model is a vision-language model designed for generating ultra-long complex structured image descriptions. It can simultaneously output comma-separated keywords and natural language long-text descriptions, suitable for image content analysis and prompt creation.
Model Features
Hybrid output format
Simultaneously generates gallery-style tags (comma-separated keywords) and natural language long-text descriptions
Complex structure processing
Specially optimized for generating ultra-long complex description structures
Dual-purpose output
Generated tags and descriptions can both be directly used as image generation prompts
Model Capabilities
Image content analysis
Keyword extraction
Natural language description generation
Image prompt creation
Use Cases
Creative assistance
AI painting prompt generation
Generates prompts containing keywords and detailed descriptions for AI painting tools
Example output includes 20+ keywords and 100+ words of coherent description
Content tagging
Automatic image library tagging
Automatically generates searchable tags and descriptive texts for image libraries
Provides both searchable keywords and readable descriptions
đ mnemic/paligemma-longprompt-v1-safetensors
This is an experimental vision model that combines booru-style tagging and longer descriptive texts to generate captions/prompts for input images.
đ Quick Start
This model is an experiment aiming to create high - quality prompts by mixing keyword tags and detailed descriptions. However, it currently requires further training and refinement.
⨠Features
- Dual - mode Output: Combines booru - style tagging (comma - separated keyword tags) and longer descriptive texts.
- High - quality Prompt Goal: Aims to generate high - quality prompts that can be used effectively in various scenarios.
đĻ Installation
Install the requirements and pytorch with CUDA.
pip install git+https://github.com/huggingface/transformers
đģ Usage Examples
Basic Usage
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "mnemic/paligemma-longprompt-v1-safetensors"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).to('cuda').eval()
processor = AutoProcessor.from_pretrained(model_id)
## prefix
prompt = "caption en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to('cuda')
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=256, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Advanced Usage
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration, BitsAndBytesConfig
from PIL import Image
import torch
import os
import glob
from colorama import init, Fore, Style
from datetime import datetime
import time
import re
from huggingface_hub import snapshot_download
# Initialize colorama
init(autoreset=True)
# Settings
quantization_bits = 8 # Set to None for full precision, 4 for 4-bit quantization, or 8 for 8-bit quantization
generation_token_length = 256
min_tokens = 20 # Minimum number of tokens required in the generated output
max_word_character_length = 30 # Maximum length of a word before it's considered too long
prune_end = True # Remove any trailing chopped off end text until it reaches a . or ,
output_format = ".txt" # Output format for the generated captions
# Clean up of poorly generated prompts
repetition_penalty = 1.15 # Control the repetition penalty (higher values discourage repetition)
retry_words = ["no_parallel"] # If these words are encountered, the entire generation retries
max_retries = 10
remove_words = ["#", "/", "ã", "@", "__", "|", " ", ";", "~", "\"", "*", "^", ",,", "ON DISPLAY:"] # Words or characters to be removed from the output results
strip_contents_inside = ["(", "[", "{"] # Specify which characters to strip out along with their contents
remove_underscore_tags = True # Option to remove words containing underscores
# Specify the model path
model_name = "mnemic/paligemma-longprompt-v1-safetensors"
models_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'models')
model_path = os.path.join(models_dir, model_name.split('/')[-1])
# Ensure the local directory is correctly specified relative to the script's location
script_dir = os.path.dirname(os.path.abspath(__file__))
local_model_path = model_path # Use the specified model directory
# Directory paths
input_dir = os.path.join(script_dir, 'input')
output_in_input_dir = True # Set this to False if you want to use a separate output directory
output_dir = input_dir if output_in_input_dir else os.path.join(script_dir, 'output')
# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Function to download the model from HuggingFace using snapshot_download
def download_model(model_name, model_path):
if not os.path.exists(model_path):
print(Fore.YELLOW + f"Downloading model {model_name} to {model_path}...")
snapshot_download(repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False, local_files_only=False)
print(Fore.GREEN + "Model downloaded successfully.")
else:
print(Fore.GREEN + f"Model directory already exists: {model_path}")
# Download the model if not already present
download_model(model_name, model_path)
# Check that the required files are in the local_model_path
required_files = ["config.json", "tokenizer_config.json"]
missing_files = [f for f in required_files if not os.path.exists(os.path.join(local_model_path, f))]
safetensor_files = [f for f in os.listdir(local_model_path) if f.endswith(".safetensors")]
if missing_files:
raise FileNotFoundError(f"Missing required files in {local_model_path}: {', '.join(missing_files)}")
if not safetensor_files:
raise FileNotFoundError(f"No safetensors files found in {local_model_path}")
# Load model and processor from local directory
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(Fore.YELLOW + "Loading model and processor...")
try:
if quantization_bits == 4:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = PaliGemmaForConditionalGeneration.from_pretrained(
local_model_path,
quantization_config=bnb_config,
device_map={"": 0},
).eval()
elif quantization_bits == 8:
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
)
model = PaliGemmaForConditionalGeneration.from_pretrained(
local_model_path,
quantization_config=bnb_config,
device_map={"": 0},
).eval()
elif quantization_bits is None:
model = PaliGemmaForConditionalGeneration.from_pretrained(
local_model_path
).eval()
model.to(device) # Ensure the model is on the correct device
else:
raise ValueError("Unsupported quantization_bits value. Use None for full precision, 4 for 4-bit quantization, or 8 for 8-bit quantization.")
processor = AutoProcessor.from_pretrained(local_model_path, local_files_only=True)
print(Fore.GREEN + "Model and processor loaded successfully.")
except OSError as e:
print(Fore.RED + f"Error loading model or processor: {e}")
raise
# Process each image in the input directory recursively
image_extensions = ['jpg', 'jpeg', 'png', 'webp']
image_paths = []
for ext in image_extensions:
image_paths.extend(glob.glob(os.path.join(input_dir, '**', f'*.{ext}'), recursive=True))
print(Fore.YELLOW + f"Found {len(image_paths)} image(s) to process.\n")
def prune_text(text):
if not prune_end:
return text
# Find the last period or comma
last_period_index = text.rfind('.')
last_comma_index = text.rfind(',')
prune_index = max(last_period_index, last_comma_index)
if prune_index != -1:
# Return text up to the last period or comma
return text[:prune_index].strip()
return text
def contains_retry_word(text, retry_words):
return any(word in text for word in retry_words)
def remove_unwanted_words(text, remove_words):
for word in remove_words:
text = text.replace(word, ' ')
return text
def strip_contents(text, chars):
for char in chars:
if char == "(":
text = re.sub(r'\([^)]*\)', ' ', text)
elif char == "[":
text = re.sub(r'\[[^\]]*\]', ' ', text)
elif char == "{":
text = re.sub(r'\{[^}]*\}', ' ', text)
text = re.sub(r'\s{2,}', ' ', text) # Remove extra spaces
text = re.sub(r'\s([,.!?;])', r'\1', text) # Remove space before punctuation
text = re.sub(r'([,.!?;])\s', r'\1 ', text) # Add space after punctuation if missing
return text.strip()
def remove_long_words(text, max_word_length):
words = text.split()
for i, word in enumerate(words):
if len(word) > max_word_length:
# Strip back to the previous comma or period
last_period_index = text.rfind('.', 0, text.find(word))
last_comma_index = text.rfind(',', 0, text.find(word))
prune_index = max(last_period_index, last_comma_index)
if prune_index != -1:
return text[:prune_index].strip()
else:
return text[:text.find(word)].strip()
return text
def clean_text(text):
text = remove_unwanted_words(text, remove_words)
text = strip_contents(text, strip_contents_inside)
text = remove_long_words(text, max_word_character_length)
# Remove unwanted characters
text = re.sub(r'[^\x00-\x7F]+', '', text)
# Normalize spaces
text = re.sub(r'\s+', ' ', text).strip()
if remove_underscore_tags:
text = ' '.join([word for word in text.split() if '_' not in word])
return text
for image_path in image_paths:
output_file_path = os.path.splitext(image_path)[0] + output_format if output_in_input_dir else os.path.join(output_dir, os.path.splitext(os.path.relpath(image_path, input_dir))[0] + output_format)
if os.path.exists(output_file_path):
# print(Fore.CYAN + f"Skipping {image_path}, output already exists.")
continue
try:
start_time = datetime.now()
print(Fore.CYAN + f"[{start_time.strftime('%Y-%m-%d %H:%M:%S')}] Starting processing for {image_path}")
image = Image.open(image_path).convert('RGB')
prompt = "caption en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) # Ensure inputs are on the correct device
input_len = model_inputs["input_ids"].shape[-1]
# Generate the caption with additional parameters to reduce repetitiveness
retries = 0
success = False
while retries < max_retries:
with torch.inference_mode():
generation_start_time = time.time()
generation = model.generate(
**model_inputs,
max_new_tokens=generation_token_length,
do_sample=True, # Enable sampling
temperature=0.7, # Control randomness of predictions
top_k=50, # Consider top 50 candidates
top_p=0.9, # Consider tokens that comprise the top 90% probability mass
no_repeat_ngram_size=2, # Avoid repeating 2-grams
repetition_penalty=repetition_penalty # Apply a penalty to repeated tokens
)
generation_end_time = time.time()
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
pruned_text = prune_text(decoded)
if not contains_retry_word(pruned_text, retry_words) and len(pruned_text.split()) >= min_tokens:
success = True
break
retries += 1
print(Fore.YELLOW + f"Retrying generation for {image_path} due to retry word or insufficient tokens, attempt {retries}")
if retries == max_retries:
print(Fore.RED + f"Max retries reached for {image_path}. Saving the result with retry word or insufficient tokens.")
# Clean the text
cleaned_text = clean_text(pruned_text)
# Save the output to a text file
with open(output_file_path, 'w', encoding='utf-8') as f:
f.write(cleaned_text)
end_time = datetime.now()
elapsed_time = end_time - start_time
print(Fore.GREEN + f"[{end_time.strftime('%Y-%m-%d %H:%M:%S')}] Processed {image_path} in {elapsed_time.total_seconds():.2f} seconds. Saved to {output_file_path}")
except Exception as e:
print(Fore.RED + f"Error processing {image_path}: {e}")
đ License
This project is licensed under the GPL - 3.0 license.
Clip Vit Large Patch14
CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.
Image-to-Text
C
openai
44.7M
1,710
Clip Vit Base Patch32
CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.
Image-to-Text
C
openai
14.0M
666
Siglip So400m Patch14 384
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved sigmoid loss function to optimize image-text matching tasks.
Image-to-Text
Transformers

S
google
6.1M
526
Clip Vit Base Patch16
CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.
Image-to-Text
C
openai
4.6M
119
Blip Image Captioning Base
Bsd-3-clause
BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks and supporting both conditional and unconditional text generation.
Image-to-Text
Transformers

B
Salesforce
2.8M
688
Blip Image Captioning Large
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling at image caption generation tasks, supporting both conditional and unconditional image caption generation.
Image-to-Text
Transformers

B
Salesforce
2.5M
1,312
Openvla 7b
MIT
OpenVLA 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, capable of generating robot actions based on language instructions and camera images.
Image-to-Text
Transformers English

O
openvla
1.7M
108
Llava V1.5 7b
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna, supporting image-text interaction.
Image-to-Text
Transformers

L
liuhaotian
1.4M
448
Vit Gpt2 Image Captioning
Apache-2.0
This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

V
nlpconnect
939.88k
887
Blip2 Opt 2.7b
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation tasks.
Image-to-Text
Transformers English

B
Salesforce
867.78k
359
Featured Recommended AI Models
Š 2025AIbase