🚀 gliner_small_news-v2.1 模型卡片
本模型是对 GLiNER 进行微调后的成果,旨在提高其在广泛主题下的准确性,尤其在长上下文新闻实体提取方面表现出色。如下表所示,在 18 个基准数据集上,这些微调后的模型相较于基础 GLiNER 模型的零样本准确率最高提升了 7.5%。

基础数据集 AskNews-NER-v0 的设计目标是通过强化国家、语言、主题和时间的多样性,来实现全球视角的多元化。用于微调此模型的所有数据均为合成生成。在对开放网络新闻文章进行翻译和总结时,使用了 WizardLM 13B v1.2;而在实体提取方面,则使用了 Llama3 70b instruct。关于数据多元化和微调方法的详细内容,请参考我们发表在 ArXiv 上的论文。
🚀 快速开始
使用以下代码即可开始使用该模型:
from gliner import GLiNER
model = GLiNER.from_pretrained("EmergentMethods/gliner_small_news-v2.1")
text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case.
"""
labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(entity["text"], "=>", entity["label"])
输出结果如下:
Chihuahua State Public Security Secretariat => organization
SSPE => organization
35-year-old => number
Salomón C. T. => person
Ciudad Juárez => location
GMC Yukon => vehicle
February 6 => date
Chihuahua State Attorney General's Office => organization
✨ 主要特性
- 对 GLiNER 进行微调,提高了在广泛主题下的准确性,特别是长上下文新闻实体提取。
- 基于合成数据进行训练,数据来源广泛且具有多样性。
- 模型体积小巧,适用于高吞吐量的生产场景。
💻 使用示例
基础用法
from gliner import GLiNER
model = GLiNER.from_pretrained("EmergentMethods/gliner_small_news-v2.1")
text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case.
"""
labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(entity["text"], "=>", entity["label"])
📚 详细文档
模型详情
模型描述
本新闻微调模型的合成数据来自 AskNews API。我们在国家、语言、主题和时间方面都进行了多样化处理。
国家分布情况:

实体类型:

主题:

模型来源(可选)
使用方式
直接使用
顾名思义,该模型旨在进行通用实体提取。尽管我们使用新闻数据对其进行了微调,但它在 18 个基准数据集上的准确率最高提升了 7.5%。这意味着广泛且多样化的基础数据集有助于模型识别和提取更多类型的实体。
该模型体积小巧,可用于高吞吐量的生产场景。这也是我们将其许可为 Apache 2.0 的原因之一。目前,AskNews 正在其系统中使用此微调模型进行实体提取。
偏差、风险和局限性
尽管数据集的目标是减少偏差并提高多样性,但它仍然偏向于西方语言和国家。这一局限性源于 Llama2 在翻译和摘要生成方面的能力。此外,由于使用 Llama2 对开放网络文章进行摘要,Llama2 训练数据中的任何偏差也会存在于该数据集中。同样,由于使用 Llama3 从摘要中提取实体,Llama3 中存在的任何偏差也会出现在当前数据集中。

模型训练详情
训练数据集为 AskNews-NER-v0。
其他训练细节可在 配套论文 中找到。
环境影响
- 硬件类型:1xA4500
- 使用时长:10 小时
- 碳排放:0.6 千克(根据 机器学习影响计算器)
引用信息
BibTeX:待添加
APA:待添加
模型作者
Elin Törnquist,Emergent Methods,邮箱:elin at emergentmethods.ai
Robert Caulk,Emergent Methods,邮箱:rob at emergentmethods.ai
模型联系方式
Elin Törnquist,Emergent Methods,邮箱:elin at emergentmethods.ai
Robert Caulk,Emergent Methods,邮箱:rob at emergentmethods.ai
📄 许可证
本模型采用 Apache 2.0 许可证。