Harmon-1_5Bオープンソースマルチモーダルモデル - 無料でテキストから画像生成をデプロイ可能、マルチモーダル理解に優れる

ホーム

Harmon 1 5B

wusizeによって開発

Harmonは革新的な統一マルチモーダル理解・生成フレームワークで、共有MARエンコーダーにより理解と生成の視覚表現を調和させ、テキストから画像生成やマルチモーダル理解タスクで優れた性能を発揮します。

テキスト生成画像

Safetensors

英語#マルチモーダル統一フレームワーク #テキストと画像の双方向生成 #視覚表現の調和

ダウンロード数 281

リリース時間 : 3/30/2025

モデル概要

Harmonフレームワークは共有MARエンコーダーでマルチモーダル理解と生成タスクを統一処理し、画像からテキスト、テキストから画像への変換をサポートし、主要ベンチマークで先進的な性能を示します。

モデル特徴

統一マルチモーダルフレームワーク

共有MARエンコーダーで視覚理解と生成タスクを同時サポートし、従来手法で必要だった異なるエンコーダーの問題を回避

先進的な生成性能

テキストから画像生成ベンチマークで先進的な生成品質を実現

マルチモーダル理解能力

マルチモーダル理解タスクで競争力のある結果を達成

二つのモデルバリアント

0.5Bと1.5Bの2つのパラメータ規模のモデルを提供

モデル能力

画像からテキスト生成

テキストから画像生成

マルチモーダル理解

視覚的質問応答

使用事例

コンテンツクリエーション

アート創作

テキスト記述に基づき創造的な画像を生成

高品質な芸術作品を生成可能

広告デザイン

製品コンセプト図を迅速生成

広告デザイン効率を向上

教育

教育補助

教材内容を可視化

学習体験を強化

ヒューマンコンピュータインタラクション

視覚的質問応答

画像内容に関する質問に回答

正確な画像理解を提供

🚀 Harmon: 統合的なマルチモーダル理解と生成のための視覚表現の調和

Harmon は、マルチモーダル理解と生成のための新しい統一フレームワークです。既存の最先端アーキテクチャが異なるエンコーダモデルで視覚理解と生成を分離するのとは異なり、この提案されたフレームワークは共有の MAR エンコーダを介して理解と生成の視覚表現を調和させます。Harmon は、主流のテキストから画像への生成ベンチマークで高度な生成性能を達成し、マルチモーダル理解タスクでも競争力のある結果を示します。このリポジトリでは、画像理解（画像からテキスト）とテキストから画像への生成のために Harmon を実行する推論コードを、Harmon - 0.5B と Harmon - 1.5B の2つのモデルバリアントで提供しています。

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy

🚀 クイックスタート

Harmon は、マルチモーダル理解と生成のための革新的な統一フレームワークです。このリポジトリでは、画像理解とテキストから画像への生成のための推論コードを提供しています。

✨ 主な機能

共有の MAR エンコーダを介して、視覚理解と生成の表現を調和させます。
主流のテキストから画像への生成ベンチマークで高度な生成性能を達成します。
マルチモーダル理解タスクでも競争力のある結果を示します。

モデルバリアント

モデルバリアント	LLM	MAR	Hugging Face Hub
Harmon - 0.5B	Qwen2.5 - 0.5B - Instruct	MAR - Base
Harmon - 1.5B	Qwen2.5 - 1.5B - Instruct	MAR - Huge

💻 使用例

基本的な使用法

🖌️ 画像からテキストへの生成

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from einops import rearrange
from PIL import Image
import requests


PROMPT_TEMPLATE = dict(
    SYSTEM='<|im_start|>system\n{system}<|im_end|>\n',
    INSTRUCTION='<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n',
    SUFFIX='<|im_end|>',
    SUFFIX_AS_EOS=True,
    SEP='\n',
    STOP_WORDS=['<|im_end|>', '<|endoftext|>'])


def expand2square(pil_img, background_color):
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result


@torch.no_grad()
def question_answer(question,
                    image,
                    model,
                    tokenizer,
                    max_new_tokens=512,
                    image_size=512
                    ):
    assert image_size == 512
    image = expand2square(
        image, (127, 127, 127))
    image = image.resize(size=(image_size, image_size))
    image = torch.from_numpy(np.array(image)).to(dtype=model.dtype, device=model.device)
    image = rearrange(image, 'h w c -> c h w')[None]
    image = 2 * (image / 255) - 1

    prompt = PROMPT_TEMPLATE['INSTRUCTION'].format(input="<image>\n" + question)
    assert '<image>' in prompt
    image_length = (image_size // 16) ** 2 + model.mar.buffer_size
    prompt = prompt.replace('<image>', '<image>'*image_length)
    input_ids = tokenizer.encode(
        prompt, add_special_tokens=True, return_tensors='pt').cuda()
    _, z_enc = model.extract_visual_feature(model.encode(image))
    inputs_embeds = z_enc.new_zeros(*input_ids.shape, model.llm.config.hidden_size)
    inputs_embeds[input_ids == image_token_idx] = z_enc.flatten(0, 1)
    inputs_embeds[input_ids != image_token_idx] = model.llm.get_input_embeddings()(
        input_ids[input_ids != image_token_idx]
    )
    output = model.llm.generate(inputs_embeds=inputs_embeds,
                                use_cache=True,
                                do_sample=False,
                                max_new_tokens=max_new_tokens,
                                eos_token_id=tokenizer.eos_token_id,
                                pad_token_id=tokenizer.pad_token_id
                                if tokenizer.pad_token_id is not None else
                                tokenizer.eos_token_id
                                )
    return tokenizer.decode(output[0])


harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-1_5B",
                                                 trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-1_5B",
                                         trust_remote_code=True).eval().cuda().bfloat16()

special_tokens_dict = {'additional_special_tokens': ["<image>", ]}
num_added_toks = harmon_tokenizer.add_special_tokens(special_tokens_dict)
assert num_added_toks == 1

image_token_idx = harmon_tokenizer.encode("<image>", add_special_tokens=False)[-1]
print(f"Image token: {harmon_tokenizer.decode(image_token_idx)}")

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw).convert('RGB')

output_text = question_answer(question='Describe the image in detail.',
                              image=raw_image,
                              model=harmon_model,
                              tokenizer=harmon_tokenizer,
                              )

print(output_text)

🖼️ テキストから画像への生成

import os
import torch
from transformers import AutoTokenizer, AutoModel
from einops import rearrange
from PIL import Image


PROMPT_TEMPLATE = dict(
    SYSTEM='<|im_start|>system\n{system}<|im_end|>\n',
    INSTRUCTION='<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n',
    SUFFIX='<|im_end|>',
    SUFFIX_AS_EOS=True,
    SEP='\n',
    STOP_WORDS=['<|im_end|>', '<|endoftext|>'])

GENERATION_TEMPLATE = "Generate an image: {text}"


@torch.no_grad()
def generate_images(prompts,
                    negative_prompt,
                    tokenizer,
                    model,
                    output,
                    grid_size=2,   # will produce 2 x 2 images per prompt
                    num_steps=64, cfg_scale=3.0, temperature=1.0, image_size=512):
    assert image_size == 512
    m = n = image_size // 16

    prompts = [
                  PROMPT_TEMPLATE['INSTRUCTION'].format(input=prompt)
                  for prompt in prompts
              ] * (grid_size ** 2)

    if cfg_scale != 1.0:
        prompts += [PROMPT_TEMPLATE['INSTRUCTION'].format(input=negative_prompt)] * len(prompts)

    inputs = tokenizer(
        prompts, add_special_tokens=True, return_tensors='pt', padding=True).to(model.device)

    images = model.sample(**inputs, num_iter=num_steps, cfg=cfg_scale, cfg_schedule="constant",
                          temperature=temperature, progress=True, image_shape=(m, n))
    images = rearrange(images, '(m n b) c h w -> b (m h) (n w) c', m=grid_size, n=grid_size)

    images = torch.clamp(
        127.5 * images + 128.0, 0, 255).to("cpu", dtype=torch.uint8).numpy()

    os.makedirs(output, exist_ok=True)
    for idx, image in enumerate(images):
        Image.fromarray(image).save(f"{output}/{idx:08d}.jpg")


harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-1_5B",
                                                 trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-1_5B",
                                         trust_remote_code=True).cuda().bfloat16().eval()


texts = ['a dog on the left and a cat on the right.',
         'a photo of a pink stop sign.']
pos_prompts = [GENERATION_TEMPLATE.format(text=text) for text in texts]
neg_prompt = 'Generate an image.'   # for classifier-free guidance


generate_images(prompts=pos_prompts,
                negative_prompt=neg_prompt,
                tokenizer=harmon_tokenizer,
                model=harmon_model,
                output='output',)

📚 ドキュメント

Harmon の詳細については、以下の論文を参照してください。

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy

📦 引用

もしあなたの研究やアプリケーションに Harmon が役立った場合は、以下の BibTeX を使用して我々の論文を引用してください。

@misc{wu2025harmon,
      title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2503.21979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21979}, 
}