Harmon-1_5B開源多模態模型 - 免費部署文本生成圖像，多模態理解佳

首頁

Harmon 1 5B

由wusize開發

Harmon是一種創新的統一多模態理解與生成框架，通過共享的MAR編碼器協調理解與生成的視覺表徵，在文本生成圖像和多模態理解任務中表現優異。

文本生成圖像

Safetensors

英語#多模態統一框架 #文本圖像雙向生成 #視覺表徵協調

下載量 281

發布時間 : 3/30/2025

模型概述

Harmon框架通過共享的MAR編碼器統一處理多模態理解和生成任務，支持圖像到文本和文本到圖像的轉換，在主流基準測試中展現出先進的性能。

模型特點

統一多模態框架

通過共享MAR編碼器同時支持視覺理解和生成任務，避免了傳統方法需要不同編碼器的問題

先進生成性能

在文本生成圖像基準測試中展現出先進的生成質量

多模態理解能力

在多模態理解任務中取得具有競爭力的結果

雙模型變體

提供0.5B和1.5B兩種參數規模的模型選擇

模型能力

圖像到文本生成

文本到圖像生成

多模態理解

視覺問答

使用案例

內容創作

藝術創作

根據文本描述生成創意圖像

可生成高質量的藝術作品

廣告設計

快速生成產品概念圖

提高廣告設計效率

教育

教學輔助

將教材內容可視化

增強學習體驗

人機交互

視覺問答

回答關於圖像內容的問題

提供準確的圖像理解

🚀 Harmon：統一多模態理解與生成的視覺表徵協調框架

Harmon 是一個用於多模態理解和生成的全新統一框架。與現有的將視覺理解和生成用不同編碼器模型分離處理的先進架構不同，該框架通過共享的 MAR 編碼器協調理解和生成的視覺表徵。Harmon 在主流的文本到圖像生成基準測試中取得了先進的生成性能，並在多模態理解任務中展現出了有競爭力的結果。在本倉庫中，我們提供了運行 Harmon 進行圖像理解（圖像到文本）和文本到圖像生成的推理代碼，有 Harmon - 0.5B 和 Harmon - 1.5B 兩種模型變體。

🚀 快速開始

Harmon 是一個創新的多模態理解與生成統一框架。它藉助共享的 MAR 編碼器，協調視覺理解與生成的表徵。本倉庫提供了 Harmon - 0.5B 和 Harmon - 1.5B 兩種模型變體的推理代碼，可用於圖像理解（圖像到文本）和文本到圖像生成任務。

✨ 主要特性

統一框架：通過共享的 MAR 編碼器，將視覺理解和生成的表徵進行協調，避免了使用不同編碼器模型分離處理的方式。
性能優越：在主流文本到圖像生成基準測試中取得先進的生成性能，在多模態理解任務中也有競爭力。
模型多樣：提供 Harmon - 0.5B 和 Harmon - 1.5B 兩種模型變體。

📦 安裝指南

文檔未提供安裝步驟，此部分跳過。

💻 使用示例

基礎用法

🖌️ 圖像到文本生成

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from einops import rearrange
from PIL import Image
import requests


PROMPT_TEMPLATE = dict(
    SYSTEM='<|im_start|>system\n{system}<|im_end|>\n',
    INSTRUCTION='<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n',
    SUFFIX='<|im_end|>',
    SUFFIX_AS_EOS=True,
    SEP='\n',
    STOP_WORDS=['<|im_end|>', '<|endoftext|>'])


def expand2square(pil_img, background_color):
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result


@torch.no_grad()
def question_answer(question,
                    image,
                    model,
                    tokenizer,
                    max_new_tokens=512,
                    image_size=512
                    ):
    assert image_size == 512
    image = expand2square(
        image, (127, 127, 127))
    image = image.resize(size=(image_size, image_size))
    image = torch.from_numpy(np.array(image)).to(dtype=model.dtype, device=model.device)
    image = rearrange(image, 'h w c -> c h w')[None]
    image = 2 * (image / 255) - 1

    prompt = PROMPT_TEMPLATE['INSTRUCTION'].format(input="<image>\n" + question)
    assert '<image>' in prompt
    image_length = (image_size // 16) ** 2 + model.mar.buffer_size
    prompt = prompt.replace('<image>', '<image>'*image_length)
    input_ids = tokenizer.encode(
        prompt, add_special_tokens=True, return_tensors='pt').cuda()
    _, z_enc = model.extract_visual_feature(model.encode(image))
    inputs_embeds = z_enc.new_zeros(*input_ids.shape, model.llm.config.hidden_size)
    inputs_embeds[input_ids == image_token_idx] = z_enc.flatten(0, 1)
    inputs_embeds[input_ids != image_token_idx] = model.llm.get_input_embeddings()(
        input_ids[input_ids != image_token_idx]
    )
    output = model.llm.generate(inputs_embeds=inputs_embeds,
                                use_cache=True,
                                do_sample=False,
                                max_new_tokens=max_new_tokens,
                                eos_token_id=tokenizer.eos_token_id,
                                pad_token_id=tokenizer.pad_token_id
                                if tokenizer.pad_token_id is not None else
                                tokenizer.eos_token_id
                                )
    return tokenizer.decode(output[0])


harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-1_5B",
                                                 trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-1_5B",
                                         trust_remote_code=True).eval().cuda().bfloat16()

special_tokens_dict = {'additional_special_tokens': ["<image>", ]}
num_added_toks = harmon_tokenizer.add_special_tokens(special_tokens_dict)
assert num_added_toks == 1

image_token_idx = harmon_tokenizer.encode("<image>", add_special_tokens=False)[-1]
print(f"Image token: {harmon_tokenizer.decode(image_token_idx)}")

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw).convert('RGB')

output_text = question_answer(question='Describe the image in detail.',
                              image=raw_image,
                              model=harmon_model,
                              tokenizer=harmon_tokenizer,
                              )

print(output_text)

🖼️ 文本到圖像生成

import os
import torch
from transformers import AutoTokenizer, AutoModel
from einops import rearrange
from PIL import Image


PROMPT_TEMPLATE = dict(
    SYSTEM='<|im_start|>system\n{system}<|im_end|>\n',
    INSTRUCTION='<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n',
    SUFFIX='<|im_end|>',
    SUFFIX_AS_EOS=True,
    SEP='\n',
    STOP_WORDS=['<|im_end|>', '<|endoftext|>'])

GENERATION_TEMPLATE = "Generate an image: {text}"


@torch.no_grad()
def generate_images(prompts,
                    negative_prompt,
                    tokenizer,
                    model,
                    output,
                    grid_size=2,   # will produce 2 x 2 images per prompt
                    num_steps=64, cfg_scale=3.0, temperature=1.0, image_size=512):
    assert image_size == 512
    m = n = image_size // 16

    prompts = [
                  PROMPT_TEMPLATE['INSTRUCTION'].format(input=prompt)
                  for prompt in prompts
              ] * (grid_size ** 2)

    if cfg_scale != 1.0:
        prompts += [PROMPT_TEMPLATE['INSTRUCTION'].format(input=negative_prompt)] * len(prompts)

    inputs = tokenizer(
        prompts, add_special_tokens=True, return_tensors='pt', padding=True).to(model.device)

    images = model.sample(**inputs, num_iter=num_steps, cfg=cfg_scale, cfg_schedule="constant",
                          temperature=temperature, progress=True, image_shape=(m, n))
    images = rearrange(images, '(m n b) c h w -> b (m h) (n w) c', m=grid_size, n=grid_size)

    images = torch.clamp(
        127.5 * images + 128.0, 0, 255).to("cpu", dtype=torch.uint8).numpy()

    os.makedirs(output, exist_ok=True)
    for idx, image in enumerate(images):
        Image.fromarray(image).save(f"{output}/{idx:08d}.jpg")


harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-1_5B",
                                                 trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-1_5B",
                                         trust_remote_code=True).cuda().bfloat16().eval()


texts = ['a dog on the left and a cat on the right.',
         'a photo of a pink stop sign.']
pos_prompts = [GENERATION_TEMPLATE.format(text=text) for text in texts]
neg_prompt = 'Generate an image.'   # for classifier-free guidance


generate_images(prompts=pos_prompts,
                negative_prompt=neg_prompt,
                tokenizer=harmon_tokenizer,
                model=harmon_model,
                output='output',)

📚 詳細文檔

模型變體信息

屬性	詳情
模型類型	Harmon - 0.5B：LLM 為 Qwen2.5 - 0.5B - Instruct，MAR 為 MAR - Base；Harmon - 1.5B：LLM 為 Qwen2.5 - 1.5B - Instruct，MAR 為 MAR - Huge
下載鏈接	Harmon - 0.5B：；Harmon - 1.5B：

📚 引用

如果您發現 Harmon 對您的研究或應用有用，請使用以下 BibTeX 引用我們的論文：

@misc{wu2025harmon,
      title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2503.21979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21979}, 
}