Phi-3-mini-4k-instruct-graph开源模型 - 精准从文本提取实体关系生成关系图

首页

Phi 3 Mini 4k Instruct Graph

由 EmergentMethods 开发

Phi-3-mini-4k-instruct-graph是微软Phi-3-mini-4k-instruct的微调版本，专门用于从通用文本数据中进行实体关系提取，旨在在生成实体关系图方面达到与GPT-4相当的质量和准确性。

知识图谱

Transformers

英语#实体关系图谱生成 #高效JSON输出 #大规模文本处理

下载量 524

发布时间 : 7/20/2024

模型简介

该模型专为从通用文本数据中生成表示实体关系的结构化JSON数据而设计，可用于增强信息检索、探索时间关系和进行趋势分析。

模型特点

高质量实体关系提取

在生成实体关系图方面达到与GPT-4相当的质量和准确性。

高效处理

在大规模处理文本数据时提高效率，适用于高吞吐量应用。

结构化输出

生成表示实体关系的结构化JSON数据，便于后续处理和分析。

模型能力

实体识别

关系提取

结构化JSON生成

使用案例

信息检索

增强文本数据库

通过提取实体和关系，增强各种文本数据库中的信息检索能力。

趋势分析

高级预测建模

对各种文本来源进行趋势分析，支持高级预测建模。

内容分析

内容聚合平台

适用于需要高吞吐量处理大量文本数据的应用，如内容聚合平台。

🚀 Phi-3-mini-4k-instruct-graph模型卡片

本模型是微软Phi-3-mini-4k-instruct的微调版本，专门用于从通用文本数据中进行实体关系提取。它旨在在生成实体关系图方面达到与GPT - 4相当的质量和准确性，同时在大规模处理时提高效率。

🚀 快速开始

以下代码片段展示了如何在GPU上快速运行该模型：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "EmergentMethods/Phi-3-mini-4k-instruct-graph",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-mini-4k-instruct-graph") 

messages = [ 
    {"role": "system", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.

The User provides text in the format:

-------Text begin-------
<User provided text>
-------Text end-------

The Assistant follows the following steps before replying to the User:

1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

"nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

"edges":[{"from": <entity 1>, "to": <entity 2>, "label": <relationship>}, ...]

The <entity N> must correspond to the "id" of an entity in the "nodes" list.

The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
The Assistant responds to the User in JSON only, according to the following JSON schema:

{"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"from":{"type":"string"},"to":{"type":"string"},"label":{"type":"string"}},"required":["from","to","label"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}
     """}, 
    {"role": "user", "content": """
-------Text begin-------
OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its mission is to develop "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work".[4] As a leading organization in the ongoing AI boom,[5] OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora.[6][7] Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.
-------Text end-------
"""}
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

# Output:

# {
#     "nodes": [
#         {
#             "id": "OpenAI",
#             "type": "organization",
#             "detailed_type": "ai research organization"
#         },
#         {
#             "id": "GPT family",
#             "type": "technology",
#             "detailed_type": "large language models"
#         },
#         {
#             "id": "DALL-E series",
#             "type": "technology",
#             "detailed_type": "text-to-image models"
#         },
#         {
#             "id": "Sora",
#             "type": "technology",
#             "detailed_type": "text-to-video model"
#         },
#         {
#             "id": "ChatGPT",
#             "type": "technology",
#             "detailed_type": "generative ai"
#         },
#         {
#             "id": "San Francisco",
#             "type": "location",
#             "detailed_type": "city"
#         },
#         {
#             "id": "California",
#             "type": "location",
#             "detailed_type": "state"
#         },
#         {
#             "id": "December 2015",
#             "type": "date",
#             "detailed_type": "foundation date"
#         },
#         {
#             "id": "November 2022",
#             "type": "date",
#             "detailed_type": "release date"
#         }
#     ],
#     "edges": [
#         {
#             "from": "OpenAI",
#             "to": "San Francisco",
#             "label": "headquartered in"
#         },
#         {
#             "from": "San Francisco",
#             "to": "California",
#             "label": "located in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "December 2015",
#             "label": "founded in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "GPT family",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "DALL-E series",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "Sora",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "ChatGPT",
#             "label": "released"
#         },
#         {
#             "from": "ChatGPT",
#             "to": "November 2022",
#             "label": "released in"
#         }
#     ]
# }

✨ 主要特性

本模型专为从通用文本数据中生成表示实体关系的结构化JSON数据而设计。
可用于增强各种文本数据库中的信息检索。
有助于探索不同类型文档中的时间关系和演变叙事。
可用于对各种文本来源进行趋势分析的高级预测建模。
特别适用于需要高吞吐量处理大量文本数据的应用，如内容聚合平台、研究数据库和综合文本分析系统。

📚 详细文档

模型详情

开发者：Emergent Methods
资助方：Emergent Methods
共享方：Emergent Methods
模型类型：microsoft/phi - 3 - mini - 4k - instruct（微调版）
语言：英语
许可证：知识共享署名-非商业性使用-相同方式共享4.0许可协议
微调基础模型：[microsoft/phi - 3 - mini - 4k - instruct](https://huggingface.co/microsoft/phi - 3 - mini - 4k - instruct)