gliner_small_news-v2.1开源模型 - 优化新闻实体识别，零样本准确率最高提升7.5%

首页

Gliner Small News V2.1

由 EmergentMethods 开发

基于GLiNER的微调版本，专为新闻领域实体识别优化，在18个基准测试中零样本准确率最高提升7.5%

序列标注

PyTorch

英语开源协议:Apache-2.0 #新闻实体识别 #多语言新闻分析 #零样本学习

下载量 34

发布时间 : 4/25/2024

模型简介

该模型擅长长文本新闻实体抽取，底层数据集通过强制国家/语言/主题/时间多样性构建全球视角，所有微调数据均为合成生成

模型特点

跨领域主题识别

特别优化了长文本新闻中的实体抽取能力

全球视角数据

训练数据强制包含国家/语言/主题/时间多样性

合成数据生成

使用WizardLM和Llama3完成新闻翻译/摘要及实体标注

模型能力

新闻文本实体识别

多语言文本处理（通过翻译）

零样本迁移学习

使用案例

新闻分析

新闻事件实体抽取

从新闻报道中提取人物、地点、时间等关键信息

在华雷斯城逮捕案例中准确识别出人物、地点、组织机构等实体

内容理解

跨语言新闻分析

处理翻译后的新闻文本进行实体识别

🚀 gliner_small_news-v2.1 模型卡片

本模型是对 GLiNER 进行微调后的成果，旨在提高其在广泛主题下的准确性，尤其在长上下文新闻实体提取方面表现出色。如下表所示，在 18 个基准数据集上，这些微调后的模型相较于基础 GLiNER 模型的零样本准确率最高提升了 7.5%。

结果表格

基础数据集 AskNews-NER-v0 的设计目标是通过强化国家、语言、主题和时间的多样性，来实现全球视角的多元化。用于微调此模型的所有数据均为合成生成。在对开放网络新闻文章进行翻译和总结时，使用了 WizardLM 13B v1.2；而在实体提取方面，则使用了 Llama3 70b instruct。关于数据多元化和微调方法的详细内容，请参考我们发表在 ArXiv 上的论文。

🚀 快速开始

使用以下代码即可开始使用该模型：

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_small_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

输出结果如下：

Chihuahua State Public Security Secretariat => organization
SSPE => organization
35-year-old => number
Salomón C. T. => person
Ciudad Juárez => location
GMC Yukon => vehicle
February 6 => date
Chihuahua State Attorney General's Office => organization

✨ 主要特性

对 GLiNER 进行微调，提高了在广泛主题下的准确性，特别是长上下文新闻实体提取。
基于合成数据进行训练，数据来源广泛且具有多样性。
模型体积小巧，适用于高吞吐量的生产场景。

💻 使用示例

基础用法

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_small_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

📚 详细文档

模型详情

模型描述

本新闻微调模型的合成数据来自 AskNews API。我们在国家、语言、主题和时间方面都进行了多样化处理。

国家分布情况：

实体类型：

主题：

开发者：Emergent Methods
资助方：Emergent Methods
共享方：Emergent Methods
模型类型：microsoft/deberta
支持语言（NLP）：英语（en）（包括英文文本以及从西班牙语（es）、葡萄牙语（pt）、德语（de）、俄语（ru）、法语（fr）、阿拉伯语（ar）、意大利语（it）、乌克兰语（uk）、挪威语（no）、瑞典语（sv）、丹麦语（da）翻译而来的内容）
许可证：Apache 2.0
微调基础模型：GLiNER

模型来源（可选）

仓库：待添加
论文：待添加
演示：待添加

使用方式

直接使用

顾名思义，该模型旨在进行通用实体提取。尽管我们使用新闻数据对其进行了微调，但它在 18 个基准数据集上的准确率最高提升了 7.5%。这意味着广泛且多样化的基础数据集有助于模型识别和提取更多类型的实体。

该模型体积小巧，可用于高吞吐量的生产场景。这也是我们将其许可为 Apache 2.0 的原因之一。目前，AskNews 正在其系统中使用此微调模型进行实体提取。

偏差、风险和局限性

尽管数据集的目标是减少偏差并提高多样性，但它仍然偏向于西方语言和国家。这一局限性源于 Llama2 在翻译和摘要生成方面的能力。此外，由于使用 Llama2 对开放网络文章进行摘要，Llama2 训练数据中的任何偏差也会存在于该数据集中。同样，由于使用 Llama3 从摘要中提取实体，Llama3 中存在的任何偏差也会出现在当前数据集中。

国家分布