開源Phi-3-mini-128k-instruct-graph模型 - 從通用文本中精準提取實體關係！

首頁

Phi 3 Mini 128k Instruct Graph

由EmergentMethods開發

Phi-3-mini-128k-instruct-graph 是微軟 Phi-3-mini-128k-instruct 的微調版本，專門用於從通用文本數據中提取實體關係。

知識圖譜

Transformers

英語#實體關係抽取 #結構化JSON輸出 #高吞吐量處理

下載量 117

發布時間 : 7/20/2024

模型概述

該模型旨在生成表示通用文本數據中實體關係的結構化 JSON 數據，適用於信息檢索、趨勢分析和預測建模等任務。

模型特點

高效實體關係提取

專門優化用於從文本中提取實體及其關係，生成結構化 JSON 數據。

與 GPT-4 相當的質量

在生成實體關係圖方面達到與 GPT-4 相當的質量和準確性。

高吞吐量處理

優化設計用於大規模文本數據處理，提高處理效率。

模型能力

實體識別

關係提取

結構化數據生成

大規模文本處理

使用案例

信息檢索

增強文本數據庫檢索

通過提取實體關係增強各種文本數據庫中的信息檢索能力。

趨勢分析

高級預測建模

對各種文本來源進行趨勢分析的高級預測建模。

內容分析

文檔關係探索

探索不同類型文檔中的時間關係和演變敘事。

🚀 Phi-3-mini-128k-instruct-graph 模型卡片

本模型是微軟 Phi-3-mini-128k-instruct 的微調版本，專門用於從通用文本數據中提取實體關係。它旨在在生成實體關係圖方面達到與 GPT-4 相當的質量和準確性，同時提高大規模處理的效率。

📚 詳細文檔

模型詳情

屬性	詳情
開發者	Emergent Methods
資助方	Emergent Methods
共享方	Emergent Methods
模型類型	microsoft/phi-3-mini-128k-instruct（微調版）
語言	英語
許可證	知識共享署名-非商業性使用-相同方式共享 4.0 國際許可協議
微調基礎模型	microsoft/phi-3-mini-128k-instruct

更多信息，請查看我們的博客文章： 📰 博客

用途

本模型旨在生成表示通用文本數據中實體關係的結構化 JSON 數據。它可用於：

增強各種文本數據庫中的信息檢索。
探索不同類型文檔中的時間關係和演變敘事。
對各種文本來源進行趨勢分析的高級預測建模。

該模型特別適用於需要高吞吐量處理大量文本數據的應用程序，如內容聚合平臺、研究數據庫和綜合文本分析系統。

偏差、風險和侷限性

儘管數據集的目標是減少偏差並提高多樣性，但它仍然偏向西方語言和國家。這種侷限性源於 Llama2 在翻譯和摘要生成方面的能力。此外，由於使用 Llama2 對開放網絡文章進行摘要，Llama2 訓練數據中存在的任何偏差也將存在於該數據集中。而且，Microsoft Phi-3 中存在的任何偏差也將存在於當前數據集中。

訓練詳情

訓練數據：來自 AskNews 的 7000 多篇故事和更新，經過精心策劃以避免主題重疊。
訓練過程：使用 Transformers 庫、SFTTrainer、PEFT 和 QLoRA 進行微調。

評估結果

與 GPT-4o（基準真值）、Claude Sonnet 3.5 和基礎 Phi-3 模型相比：

指標	Phi-3 微調版	Claude Sonnet 3.5	Phi-3（基礎版）
節點相似度	0.78	0.64	0.64
邊相似度	0.49	0.41	0.30
JSON 一致性	0.99	0.97	0.96
JSON 相似度	0.75	0.67	0.63

環境影響

硬件類型：1x A100 SXM
使用時長：3 小時
碳排放：0.44 千克（根據機器學習影響計算器）

💻 使用示例

基礎用法

以下代碼片段展示瞭如何在 GPU 上快速運行該模型：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "EmergentMethods/Phi-3-mini-128k-instruct-graph",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-mini-128k-instruct-graph") 

messages = [ 
    {"role": "system", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.

The User provides text in the format:

-------Text begin-------
<User provided text>
-------Text end-------

The Assistant follows the following steps before replying to the User:

1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

"nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

"edges":[{"from": <entity 1>, "to": <entity 2>, "label": <relationship>}, ...]

The <entity N> must correspond to the "id" of an entity in the "nodes" list.

The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
The Assistant responds to the User in JSON only, according to the following JSON schema:

{"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"from":{"type":"string"},"to":{"type":"string"},"label":{"type":"string"}},"required":["from","to","label"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}
     """}, 
    {"role": "user", "content": """
-------Text begin-------
OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its mission is to develop "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work".[4] As a leading organization in the ongoing AI boom,[5] OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora.[6][7] Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.
-------Text end-------
"""}
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

# Output:

# {
#     "nodes": [
#         {
#             "id": "OpenAI",
#             "type": "organization",
#             "detailed_type": "ai research organization"
#         },
#         {
#             "id": "GPT family",
#             "type": "technology",
#             "detailed_type": "large language models"
#         },
#         {
#             "id": "DALL-E series",
#             "type": "technology",
#             "detailed_type": "text-to-image models"
#         },
#         {
#             "id": "Sora",
#             "type": "technology",
#             "detailed_type": "text-to-video model"
#         },
#         {
#             "id": "ChatGPT",
#             "type": "technology",
#             "detailed_type": "generative ai"
#         },
#         {
#             "id": "San Francisco",
#             "type": "location",
#             "detailed_type": "city"
#         },
#         {
#             "id": "California",
#             "type": "location",
#             "detailed_type": "state"
#         },
#         {
#             "id": "December 2015",
#             "type": "date",
#             "detailed_type": "foundation date"
#         },
#         {
#             "id": "November 2022",
#             "type": "date",
#             "detailed_type": "release date"
#         }
#     ],
#     "edges": [
#         {
#             "from": "OpenAI",
#             "to": "San Francisco",
#             "label": "headquartered in"
#         },
#         {
#             "from": "San Francisco",
#             "to": "California",
#             "label": "located in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "December 2015",
#             "label": "founded in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "GPT family",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "DALL-E series",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "Sora",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "ChatGPT",
#             "label": "released"
#         },
#         {
#             "from": "ChatGPT",
#             "to": "November 2022",
#             "label": "released in"
#         }
#     ]
# }