Phi-3-mini-4k-instruct-graphオープンソースモデル - テキストから正確にエンティティ関係を抽出して関係グラフを生成する

Phi 3 Mini 4k Instruct Graph

EmergentMethodsによって開発

Phi-3-mini-4k-instruct-graphは、マイクロソフトのPhi-3-mini-4k-instructを微調整したバージョンで、一般的なテキストデータからエンティティ関係を抽出するために特別に設計されており、エンティティ関係グラフの生成においてGPT-4と同等の品質と精度を達成することを目指しています。

知識グラフ

Transformers

英語#エンティティ関係マップ生成 #効率的なJSON出力 #大規模テキスト処理

ダウンロード数 524

リリース時間 : 7/20/2024

モデル概要

このモデルは、一般的なテキストデータからエンティティ関係を表す構造化JSONデータを生成するために設計されており、情報検索の強化、時間関係の探索、トレンド分析に利用できます。

モデル特徴

高品質のエンティティ関係抽出

エンティティ関係グラフの生成において、GPT-4と同等の品質と精度を達成します。

効率的な処理

大規模なテキストデータを処理する際に効率を向上させ、高スループットのアプリケーションに適しています。

構造化出力

エンティティ関係を表す構造化JSONデータを生成し、後続の処理と分析を容易にします。

モデル能力

エンティティ識別

関係抽出

構造化JSON生成

使用事例

情報検索

テキストデータベースの強化

エンティティと関係を抽出することで、様々なテキストデータベースの情報検索能力を強化します。

トレンド分析

高度な予測モデリング

様々なテキストソースに対してトレンド分析を行い、高度な予測モデリングをサポートします。

コンテンツ分析

コンテンツ集約プラットフォーム

大量のテキストデータを高スループットで処理する必要があるアプリケーション、例えばコンテンツ集約プラットフォームに適しています。

🚀 Phi-3-mini-4k-instruct-graph

このモデルは、MicrosoftのPhi-3-mini-4k-instructをファインチューニングしたもので、一般的なテキストデータからのエンティティ関係抽出に特化しています。GPT-4と同等の品質と精度でエンティティ関係グラフを生成し、大規模処理の効率を向上させることを目指しています。

🚀 クイックスタート

このコードスニペットは、GPU上でモデルを実行する方法を示しています。

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "EmergentMethods/Phi-3-mini-4k-instruct-graph",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-mini-4k-instruct-graph") 

messages = [ 
    {"role": "system", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.

The User provides text in the format:

-------Text begin-------
<User provided text>
-------Text end-------

The Assistant follows the following steps before replying to the User:

1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

"nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

"edges":[{"from": <entity 1>, "to": <entity 2>, "label": <relationship>}, ...]

The <entity N> must correspond to the "id" of an entity in the "nodes" list.

The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
The Assistant responds to the User in JSON only, according to the following JSON schema:

{"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"from":{"type":"string"},"to":{"type":"string"},"label":{"type":"string"}},"required":["from","to","label"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}
     """}, 
    {"role": "user", "content": """
-------Text begin-------
OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its mission is to develop "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work".[4] As a leading organization in the ongoing AI boom,[5] OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora.[6][7] Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.
-------Text end-------
"""}
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

# Output:

# {
#     "nodes": [
#         {
#             "id": "OpenAI",
#             "type": "organization",
#             "detailed_type": "ai research organization"
#         },
#         {
#             "id": "GPT family",
#             "type": "technology",
#             "detailed_type": "large language models"
#         },
#         {
#             "id": "DALL-E series",
#             "type": "technology",
#             "detailed_type": "text-to-image models"
#         },
#         {
#             "id": "Sora",
#             "type": "technology",
#             "detailed_type": "text-to-video model"
#         },
#         {
#             "id": "ChatGPT",
#             "type": "technology",
#             "detailed_type": "generative ai"
#         },
#         {
#             "id": "San Francisco",
#             "type": "location",
#             "detailed_type": "city"
#         },
#         {
#             "id": "California",
#             "type": "location",
#             "detailed_type": "state"
#         },
#         {
#             "id": "December 2015",
#             "type": "date",
#             "detailed_type": "foundation date"
#         },
#         {
#             "id": "November 2022",
#             "type": "date",
#             "detailed_type": "release date"
#         }
#     ],
#     "edges": [
#         {
#             "from": "OpenAI",
#             "to": "San Francisco",
#             "label": "headquartered in"
#         },
#         {
#             "from": "San Francisco",
#             "to": "California",
#             "label": "located in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "December 2015",
#             "label": "founded in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "GPT family",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "DALL-E series",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "Sora",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "ChatGPT",
#             "label": "released"
#         },
#         {
#             "from": "ChatGPT",
#             "to": "November 2022",
#             "label": "released in"
#         }
#     ]
# }

✨ 主な機能

このモデルは、一般的なテキストデータ内のエンティティ関係を表す構造化JSONデータを生成するように設計されています。以下の用途に使用できます。

様々なテキストデータベースにおける情報検索の強化
異なるタイプのドキュメント間の時間的関係と進化するナラティブの探索
多様なテキストソースにおけるトレンド分析のための高度な予測モデリング

このモデルは、大量のテキストデータを高スループットで処理する必要があるアプリケーション、例えばコンテンツ集約プラットフォーム、研究データベース、包括的なテキスト分析システムなどに特に有用です。

📚 ドキュメント

モデルの詳細

開発者: Emergent Methods
資金提供元: Emergent Methods
共有元: Emergent Methods
モデルタイプ: microsoft/phi-3-mini-4k-instruct (ファインチューニング済み)
言語: 英語
ライセンス: Creative Commons Attribution Non Commercial Share Alike 4.0
ファインチューニング元のモデル: microsoft/phi-3-mini-4k-instruct

詳細情報については、ブログ記事を参照してください。

📰 ブログ

バイアス、リスク、および制限事項

このデータセットの目的はバイアスを減らし、多様性を向上させることですが、依然として西洋の言語と国に偏っています。この制限は、翻訳と要約生成にLlama2を使用していることに起因しています。さらに、Llama2のトレーニングデータに存在するバイアスも、このデータセットにも存在します。また、Microsoft Phi-3に存在するバイアスも、このデータセットに存在します。

トレーニングの詳細

トレーニングデータ: AskNewsからの7,000以上の記事と更新情報で、トピックの重複を避けるように選りすぐられています。
トレーニング手順: Transformersライブラリ、SFTTrainer、PEFT、およびQLoRAを使用してファインチューニングされました。

評価結果

GPT-4o（グラウンドトゥルース）、Claude Sonnet 3.5、およびベースのPhi-3モデルと比較した結果です。

指標	Phi-3ファインチューニング済み	Claude Sonnet 3.5	Phi-3（ベース）
ノードの類似度	0.78	0.64	0.64
エッジの類似度	0.49	0.41	0.30
JSONの一貫性	0.99	0.97	0.96
JSONの類似度	0.75	0.67	0.63

環境への影響

ハードウェアタイプ: 1x A100 SXM
使用時間: 3時間
排出された二酸化炭素量: 0.44 kg（Machine Learning Impact calculatorによる）

倫理的な考慮事項

ユーザーは、このモデルが一般的なテキストデータからのエンティティ関係抽出に設計されており、さらなるファインチューニングなしに他のドメインに適していない可能性があることを認識すべきです。モデルの出力は、特に意思決定や公共の情報伝達に影響を与える可能性のあるアプリケーションで使用する場合、レビューと検証が必要です。