gliner_large_news-v2.1开源新闻实体识别模型 - 高效抽取长文本新闻实体

首页

Gliner Large News V2.1

由 EmergentMethods 开发

基于GLiNER微调的新闻领域实体识别模型，擅长长文本新闻实体抽取，在18个基准数据集上零样本准确率最高提升7.5%。

序列标注

PyTorch

英语开源协议:Apache-2.0 #新闻实体抽取 #多语言支持 #零样本学习

下载量 2,558

发布时间 : 4/18/2024

模型简介

该模型是针对新闻领域优化的实体识别模型，底层使用microsoft/deberta架构，通过合成数据微调提升跨领域主题的准确率。支持多种语言的翻译文本处理。

模型特点

跨领域性能提升

在18个基准数据集上零样本准确率较基础模型最高提升7.5%

新闻领域优化

特别针对长文本新闻实体抽取场景进行优化

全球视角数据

训练数据强制国家/语言/主题/时间多样性设计

高效推理

模型体积精巧，适合高吞吐生产环境

模型能力

新闻实体识别

多语言文本处理

零样本学习

长文本分析

使用案例

新闻分析

新闻事件实体抽取

从新闻报道中提取人物、地点、组织等关键实体

示例中展示了90%以上的关键实体识别准确率

跨语言新闻处理

处理翻译后的多语言新闻内容

支持11种语言的翻译文本处理

内容分析

事件关联分析

通过实体识别建立新闻事件间的关联

已在AskNews实体抽取系统中实际应用

🚀 gliner_large_news-v2.1模型卡片

本模型是对 GLiNER 进行微调后的版本，旨在提高其在广泛主题上的准确性，尤其在长上下文新闻实体提取方面表现出色。如下表所示，在18个基准数据集上，这些微调后的模型相较于基础GLiNER模型的零样本准确率提升了高达7.5%。

结果表格

基础数据集 AskNews-NER-v0 的设计目标是通过强化国家、语言、主题和时间的多样性来丰富全球视角。用于微调此模型的所有数据均为合成生成。在对开放网络新闻文章进行翻译和总结时使用了WizardLM 13B v1.2，而在实体提取方面则使用了Llama3 70b instruct。多样性和微调方法的详细内容可在我们发表于 ArXiv 的论文中查看。

🚀 快速开始

使用以下代码即可开始使用该模型：

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

输出结果如下：

Chihuahua State Public Security Secretariat => organization
SSPE => organization
35-year-old => number
Salomón C. T. => person
Ciudad Juárez => location
GMC Yukon => vehicle
February 6 => date
Chihuahua State Attorney General's Office => organization

✨ 主要特性

本模型是对GLiNER的微调版本，在广泛主题上提高了准确性，尤其适用于长上下文新闻实体提取。
基础数据集经过精心设计，强化了国家、语言、主题和时间的多样性。
模型使用合成数据进行微调，在18个基准数据集上零样本准确率提升高达7.5%。
模型体积小巧，适用于高吞吐量的生产场景。

📦 安装指南

文档未提及安装步骤，故跳过该章节。

💻 使用示例

基础用法

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

高级用法

文档未提及高级用法示例，故跳过该部分。

📚 详细文档

模型描述

本新闻微调模型的合成数据来源于 AskNews API。我们在国家、语言、主题和时间方面强化了多样性。

国家分布：

实体类型：

主题分布：

属性	详情
开发者	Emergent Methods
资助方	Emergent Methods
共享方	Emergent Methods
模型类型	microsoft/deberta
语言（NLP）	英语（en）（英文文本以及来自西班牙语（es）、葡萄牙语（pt）、德语（de）、俄语（ru）、法语（fr）、阿拉伯语（ar）、意大利语（it）、乌克兰语（uk）、挪威语（no）、瑞典语（sv）、丹麦语（da）的翻译文本）
许可证	Apache 2.0
微调基础模型	GLiNER

模型来源

仓库：待添加
论文：待添加
演示：待添加

使用方式

直接使用

顾名思义，该模型旨在进行通用实体提取。尽管我们使用新闻数据对其进行微调，但它在18个基准数据集上的准确率提升了高达7.5%。这意味着广泛且多样化的基础数据集有助于它识别和提取更多类型的实体。

该模型体积小巧，可用于高吞吐量的生产场景。这也是我们选择以Apache 2.0许可证发布的原因之一。目前，AskNews 正在其系统中使用此微调模型进行实体提取。

偏差、风险和局限性

尽管数据集的目标是减少偏差并提高多样性，但它仍然偏向于西方语言和国家。这一局限性源于Llama2在翻译和总结生成方面的能力。此外，由于使用Llama2对开放网络文章进行总结，Llama2训练数据中的任何偏差也会存在于该数据集中。同样，由于使用Llama3从总结中提取实体，Llama3中的任何偏差也会存在于当前数据集中。

国家分布