ChartVEオープンソース視覚含意モデル - 無料で生成されたタイトルのチャートに対する事実的な正確性を評価する

ホーム

Chartve

khhuangによって開発

ChartVEは、入力グラフに対する生成タイトル文の事実的精度を評価する視覚的含意モデルです。

画像生成テキスト

Transformers

英語オープンソースライセンス:Apache-2.0 #グラフの事実性検証 #視覚的含意確率 #マルチモーダル評価

ダウンロード数 38

リリース時間 : 12/16/2023

モデル概要

ChartVEは、入力グラフに対する生成タイトル文の事実的精度を評価する視覚的含意モデルです。このモデルはグラフ画像とタイトル文を入力として受け取り、含意確率を出力します。

モデル特徴

視覚的含意分析

生成タイトル文の入力グラフに対する事実的精度を評価できます。

単一文タイトル処理

単一文タイトルをテキスト入力として想定しており、複数文を含む場合は分割して個別に処理する必要があります。

UniChartアーキテクチャベース

モデルの基盤アーキテクチャはUniChartで、強力な視覚言語理解能力を備えています。

モデル能力

グラフ視覚的含意分析

事実的精度評価

使用事例

データ分析

グラフタイトル精度検証

自動生成されたグラフタイトルがグラフ内容を正確に反映しているか検証するために使用されます。

含意確率を出力し、タイトルの精度を定量化します。

学術研究

グラフ理解研究

大規模視覚言語モデルのグラフ理解能力を研究するために使用されます。

定量指標を提供し、モデル性能分析をサポートします。

🚀 ChartVE (チャート視覚含意)

ChartVEは、論文「Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning」で紹介された視覚含意モデルです。このモデルは、入力されたチャートに対する生成キャプション文の事実性を評価するために使用されます。モデルはチャート図とキャプション文を入力として受け取り、含意確率を出力します。含意確率の計算方法については、以下の「使い方」セクションを参照してください。このモデルの基礎となるアーキテクチャはUniChartです。

なお、このモデルはキャプション文をテキスト入力として期待しています。キャプションが複数文から構成される場合は、キャプションを複数の文に分割し、個々の文をChartVEに入力してからスコアを集約する必要があります。以下に、ChartVEの使用例を示します。

🚀 クイックスタート

使い方

事前学習済みモデルを直接使用する方法は次のとおりです。

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image

model_name = "khhuang/chartve"
model = VisionEncoderDecoderModel.from_pretrained(model_name).cuda()
processor = DonutProcessor.from_pretrained(model_name)

image_path = "PATH_TO_IMAGE"

def format_query(sentence):
    return f"Does the image entails this statement: \"{sentence}\"?"

# Format text inputs
CAPTION_SENTENCE = "The state that has the highest number of population is California."
query = format_query(CAPTION_SENTENCE)

# Encode chart figure and tokenize text
img = Image.open(IMAGE_PATH)
pixel_values = processor(img.convert("RGB"), random_padding=False, return_tensors="pt").pixel_values
pixel_values = pixel_values.cuda()
decoder_input_ids = processor.tokenizer(query, add_special_tokens=False, return_tensors="pt", max_length=510).input_ids.cuda()#.squeeze(0)


outputs = model(pixel_values, decoder_input_ids=decoder_input_ids)

# positive_logit = outputs['logits'].squeeze()[-1,49922]
# negative_logit = outputs['logits'].squeeze()[-1,2334] 

# Probe the probability of generating "yes"
binary_entail_prob_positive = torch.nn.functional.softmax(outputs['logits'].squeeze()[-1,[2334, 49922]])[1].item()

# binary_entail_prob_positive corresponds to the computed probability that the chart entails the caption sentence.

💻 使用例

基本的な使用法

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image

model_name = "khhuang/chartve"
model = VisionEncoderDecoderModel.from_pretrained(model_name).cuda()
processor = DonutProcessor.from_pretrained(model_name)

image_path = "PATH_TO_IMAGE"

def format_query(sentence):
    return f"Does the image entails this statement: \"{sentence}\"?"

# Format text inputs
CAPTION_SENTENCE = "The state that has the highest number of population is California."
query = format_query(CAPTION_SENTENCE)

# Encode chart figure and tokenize text
img = Image.open(IMAGE_PATH)
pixel_values = processor(img.convert("RGB"), random_padding=False, return_tensors="pt").pixel_values
pixel_values = pixel_values.cuda()
decoder_input_ids = processor.tokenizer(query, add_special_tokens=False, return_tensors="pt", max_length=510).input_ids.cuda()#.squeeze(0)


outputs = model(pixel_values, decoder_input_ids=decoder_input_ids)

# positive_logit = outputs['logits'].squeeze()[-1,49922]
# negative_logit = outputs['logits'].squeeze()[-1,2334] 

# Probe the probability of generating "yes"
binary_entail_prob_positive = torch.nn.functional.softmax(outputs['logits'].squeeze()[-1,[2334, 49922]])[1].item()

# binary_entail_prob_positive corresponds to the computed probability that the chart entails the caption sentence.

📄 ライセンス

このモデルはApache-2.0ライセンスの下で提供されています。

📚 引用

@misc{huang-etal-2023-do,
    title = "Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning",
    author = "Huang, Kung-Hsiang  and
      Zhou, Mingyang and
      Chan, Hou Pong  and
      Fung, Yi R. and
      Wang, Zhenhailong and
      Zhang, Lingyu and
      Chang, Shih-Fu and
      Ji, Heng",
    year={2023},
    eprint={2312.10160},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}