gliner_large_news-v2.1開源新聞實體識別模型 - 高效抽取長文本新聞實體

首頁

Gliner Large News V2.1

由EmergentMethods開發

基於GLiNER微調的新聞領域實體識別模型，擅長長文本新聞實體抽取，在18個基準數據集上零樣本準確率最高提升7.5%。

序列標註

PyTorch

英語開源協議:Apache-2.0 #新聞實體抽取 #多語言支持 #零樣本學習

下載量 2,558

發布時間 : 4/18/2024

模型概述

該模型是針對新聞領域優化的實體識別模型，底層使用microsoft/deberta架構，通過合成數據微調提升跨領域主題的準確率。支持多種語言的翻譯文本處理。

模型特點

跨領域性能提升

在18個基準數據集上零樣本準確率較基礎模型最高提升7.5%

新聞領域優化

特別針對長文本新聞實體抽取場景進行優化

全球視角數據

訓練數據強制國家/語言/主題/時間多樣性設計

高效推理

模型體積精巧，適合高吞吐生產環境

模型能力

新聞實體識別

多語言文本處理

零樣本學習

長文本分析

使用案例

新聞分析

新聞事件實體抽取

從新聞報道中提取人物、地點、組織等關鍵實體

示例中展示了90%以上的關鍵實體識別準確率

跨語言新聞處理

處理翻譯後的多語言新聞內容

支持11種語言的翻譯文本處理

內容分析

事件關聯分析

通過實體識別建立新聞事件間的關聯

已在AskNews實體抽取系統中實際應用

🚀 gliner_large_news-v2.1模型卡片

本模型是對 GLiNER 進行微調後的版本，旨在提高其在廣泛主題上的準確性，尤其在長上下文新聞實體提取方面表現出色。如下表所示，在18個基準數據集上，這些微調後的模型相較於基礎GLiNER模型的零樣本準確率提升了高達7.5%。

結果表格

基礎數據集 AskNews-NER-v0 的設計目標是通過強化國家、語言、主題和時間的多樣性來豐富全球視角。用於微調此模型的所有數據均為合成生成。在對開放網絡新聞文章進行翻譯和總結時使用了WizardLM 13B v1.2，而在實體提取方面則使用了Llama3 70b instruct。多樣性和微調方法的詳細內容可在我們發表於 ArXiv 的論文中查看。

🚀 快速開始

使用以下代碼即可開始使用該模型：

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

輸出結果如下：

Chihuahua State Public Security Secretariat => organization
SSPE => organization
35-year-old => number
Salomón C. T. => person
Ciudad Juárez => location
GMC Yukon => vehicle
February 6 => date
Chihuahua State Attorney General's Office => organization

✨ 主要特性

本模型是對GLiNER的微調版本，在廣泛主題上提高了準確性，尤其適用於長上下文新聞實體提取。
基礎數據集經過精心設計，強化了國家、語言、主題和時間的多樣性。
模型使用合成數據進行微調，在18個基準數據集上零樣本準確率提升高達7.5%。
模型體積小巧，適用於高吞吐量的生產場景。

📦 安裝指南

文檔未提及安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

高級用法

文檔未提及高級用法示例，故跳過該部分。

📚 詳細文檔

模型描述

本新聞微調模型的合成數據來源於 AskNews API。我們在國家、語言、主題和時間方面強化了多樣性。

國家分佈：

實體類型：

主題分佈：

屬性	詳情
開發者	Emergent Methods
資助方	Emergent Methods
共享方	Emergent Methods
模型類型	microsoft/deberta
語言（NLP）	英語（en）（英文文本以及來自西班牙語（es）、葡萄牙語（pt）、德語（de）、俄語（ru）、法語（fr）、阿拉伯語（ar）、意大利語（it）、烏克蘭語（uk）、挪威語（no）、瑞典語（sv）、丹麥語（da）的翻譯文本）
許可證	Apache 2.0
微調基礎模型	GLiNER

模型來源

倉庫：待添加
論文：待添加
演示：待添加

使用方式

直接使用

顧名思義，該模型旨在進行通用實體提取。儘管我們使用新聞數據對其進行微調，但它在18個基準數據集上的準確率提升了高達7.5%。這意味著廣泛且多樣化的基礎數據集有助於它識別和提取更多類型的實體。

該模型體積小巧，可用於高吞吐量的生產場景。這也是我們選擇以Apache 2.0許可證發佈的原因之一。目前，AskNews 正在其系統中使用此微調模型進行實體提取。

偏差、風險和侷限性

儘管數據集的目標是減少偏差並提高多樣性，但它仍然偏向於西方語言和國家。這一侷限性源於Llama2在翻譯和總結生成方面的能力。此外，由於使用Llama2對開放網絡文章進行總結，Llama2訓練數據中的任何偏差也會存在於該數據集中。同樣，由於使用Llama3從總結中提取實體，Llama3中的任何偏差也會存在於當前數據集中。

國家分佈