LeX-Enhancer オープンソース軽量級プロンプト強化モデル - テキストから画像への高品質プロンプトを無料で生成

ホーム

Lex Enhancer

X-ARTによって開発

LeX-Enhancerは、DeepSeek-R1から蒸留された軽量なプロンプト強化モデルで、高品質で視覚的に豊かなプロンプト出力を生成し、テキストから画像への生成に適しています。

大規模言語モデル

Transformers

その他オープンソースライセンス:MIT #テキストから画像へのプロンプト強化 #LoRAファインチューニング #高精細ビジュアル記述

ダウンロード数 160

リリース時間 : 3/10/2025

モデル概要

LeX-Enhancerは、簡単なタイトルを詳細なビジュアル記述に変換できるプロンプト強化モデルで、高精細なテキストから画像への生成タスクに適しています。

モデル特徴

効率的なプロンプト強化

簡単なタイトルを詳細なビジュアル記述に変換でき、テキストから画像への生成品質を向上させます。

軽量モデル

LoRAファインチューニングを基にしており、モデルが軽量で効率的であり、大規模な生成に適しています。

高品質な出力

生成されるプロンプト記述は視覚的に豊かで詳細にわたっており、高精細な画像生成に適しています。

モデル能力

テキスト生成

プロンプト強化

ビジュアル記述生成

使用事例

テキストから画像生成

感謝カードデザイン

簡単な感謝カードのタイトルを詳細なビジュアル記述に変換し、高品質な画像生成に使用します。

生成された記述には、背景の色調、テキストスタイル、光と影の効果などの豊富なビジュアル要素が含まれています。

アート創作

アートプロンプト生成

アーティストに詳細なビジュアル記述を提供し、創作プロセスを支援します。

生成されたプロンプト記述は芸術的で視覚的に魅力的です。

🚀 🎨 LeX-Enhancer: プロンプトエンハンサー

LeX-Enhancer は、DeepSeek-R1 から蒸留された軽量のプロンプト強化モデルです。
DeepSeek-R1による強化前後の60,856個のキャプションペアを収集し、LoRA を使用して DeepSeek-R1-Distilled-Qwen-14B モデルを微調整し、高品質で視覚的に豊かなプロンプト出力を再現します。

これにより、高忠実度のテキストから画像への生成に最適な、効率的な大規模なプロンプト強化生成が可能になります。

プロパティ	詳細
ベースモデル	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
データセット	X-ART/LeX-R1-60K
ライブラリ名	transformers
ライセンス	mit
パイプラインタグ	text-to-image
タグ	art, text-rendering

🚀 クイックスタート

💻 使用例

基本的な使用法

import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

# System instruction for reasoning + answering
SYSTEM_TEMPLATE = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
    "The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think> <answer> answer here </answer>."
)

model_path = 'X-ART/LeX-Enhancer'

# Your simple caption goes here
simple_caption = "A thank you card with the words very much, with the text on it: \"VERY\" in black, \"MUCH\" in yellow."

def create_chat_template(user_prompt):
    return [
        {"role": "system", "content": SYSTEM_TEMPLATE},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "<think>"}
    ]

def create_direct_template(user_prompt):
    return user_prompt + "<think>"

def create_user_prompt(simple_caption):
    return (
        "Below is the simple caption of an image with text. Please deduce the detailed description of the image based on this simple caption. "
        "Note: 1. The description should only include visual elements and should not contain any extended meanings. "
        "2. The visual elements should be as rich as possible, such as the main objects in the image, their respective attributes, "
        "the spatial relationships between the objects, lighting and shadows, color style, any text in the image and its style, etc. "
        "3. The output description should be a single paragraph and should not be structured. "
        "4. The description should avoid certain situations, such as pure white or black backgrounds, blurry text, excessive rendering of text, "
        "or harsh visual styles. "
        "5. The detailed caption should be human readable and fluent. "
        "6. Avoid using vague expressions such as \"may be\" or \"might be\"; the generated caption must be in a definitive, narrative tone. "
        "7. Do not use negative sentence structures, such as \"there is nothing in the image,\" etc. The entire caption should directly describe the content of the image. "
        "8. The entire output should be limited to 200 words. "
        f"SIMPLE CAPTION: {simple_caption}"
    )

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16)

# Prepare input prompt
messages = create_direct_template(create_user_prompt(simple_caption))
input_ids = tokenizer.encode(messages, return_tensors="pt").to(model.device)

# Stream output
streamer = TextStreamer(tokenizer, skip_special_tokens=True, clean_up_tokenization_spaces=True)
output = model.generate(
    input_ids,
    max_length=2048,
    num_return_sequences=1,
    do_sample=True,
    temperature=0.6,
    repetition_penalty=1.1,
    streamer=streamer
)

print("*" * 80)
# Output will stream via TextStreamer

シンプルなキャプションから強化されたキャプションへの例

🧾 入力 (シンプルなキャプション):
A thank you card with the words very much, with the text on it: "VERY" in black, "MUCH" in yellow.

🧾 出力 (強化されたキャプション):
画像には、クリーミーなベージュとブラッシュピンクのソフトなグラデーションの穏やかな暖色系の背景を持つ長方形のお礼のカードがあり、微妙なテクスチャーの仕上がりが中央の要素を圧倒することなく奥行きを加えています。カードの上部近くの中央に、「VERY」という単語が太い大文字の黒い文字で、きれいなサンセリフフォントでレンダリングされ、構成の自然な流れに沿ってわずかに湾曲しています。その真下に、「MUCH」という単語が鮮やかな明るい黄色で、同じ太字のフォントスタイルですが少し大きいサイズで、「VERY」の下縁にきちんと揃えられて配置され、バランスの取れた階層構造を作り出しています。両方のテキストにはかすかなドロップシャドウが付けられており、柔らかく照らされた背景に対して読みやすさが向上しています。カードの端はシャープで、背景色の明るい色調のシンプルなボーダーがあり、全体的なレイアウトは対称的で、洗練された現代的な美学を醸し出しています。穏やかな周囲光がカードの角に柔らかく拡散した影を落とし、下にあるわずかにテクスチャーのある表面を暗示しています。装飾的な装飾がないため、焦点は活版印刷に留まります。配色は暖かさと中立性を調和させ、テキストを焦点に保ちながら、静かで親しみやすい雰囲気を維持しています。

参考情報

Githubリポジトリ: https://github.com/zhaoshitian/LeX-Art

@article{zhao2025lexart,
    title={LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis},
    author={Zhao, Shitian and Wu, Qilong and Li, Xinyue and Zhang, Bo and Li, Ming and Qin, Qi and Liu, Dongyang and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Gao, Peng and Fu, Bin and Li, Zhen},
    journal={arXiv preprint arXiv:2503.21749},
    year={2025}
}

📄 ライセンスこのプロジェクトはMITライセンスの下でライセンスされています。