gliner_small_news-v2.1開源模型 - 優化新聞實體識別，零樣本準確率最高提升7.5%

首頁

Gliner Small News V2.1

由EmergentMethods開發

基於GLiNER的微調版本，專為新聞領域實體識別優化，在18個基準測試中零樣本準確率最高提升7.5%

序列標註

PyTorch

英語開源協議:Apache-2.0 #新聞實體識別 #多語言新聞分析 #零樣本學習

下載量 34

發布時間 : 4/25/2024

模型概述

該模型擅長長文本新聞實體抽取，底層數據集通過強制國家/語言/主題/時間多樣性構建全球視角，所有微調數據均為合成生成

模型特點

跨領域主題識別

特別優化了長文本新聞中的實體抽取能力

全球視角數據

訓練數據強制包含國家/語言/主題/時間多樣性

合成數據生成

使用WizardLM和Llama3完成新聞翻譯/摘要及實體標註

模型能力

新聞文本實體識別

多語言文本處理（通過翻譯）

零樣本遷移學習

使用案例

新聞分析

新聞事件實體抽取

從新聞報道中提取人物、地點、時間等關鍵信息

在華雷斯城逮捕案例中準確識別出人物、地點、組織機構等實體

內容理解

跨語言新聞分析

處理翻譯後的新聞文本進行實體識別

🚀 gliner_small_news-v2.1 模型卡片

本模型是對 GLiNER 進行微調後的成果，旨在提高其在廣泛主題下的準確性，尤其在長上下文新聞實體提取方面表現出色。如下表所示，在 18 個基準數據集上，這些微調後的模型相較於基礎 GLiNER 模型的零樣本準確率最高提升了 7.5%。

結果表格

基礎數據集 AskNews-NER-v0 的設計目標是通過強化國家、語言、主題和時間的多樣性，來實現全球視角的多元化。用於微調此模型的所有數據均為合成生成。在對開放網絡新聞文章進行翻譯和總結時，使用了 WizardLM 13B v1.2；而在實體提取方面，則使用了 Llama3 70b instruct。關於數據多元化和微調方法的詳細內容，請參考我們發表在 ArXiv 上的論文。

🚀 快速開始

使用以下代碼即可開始使用該模型：

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_small_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

輸出結果如下：

Chihuahua State Public Security Secretariat => organization
SSPE => organization
35-year-old => number
Salomón C. T. => person
Ciudad Juárez => location
GMC Yukon => vehicle
February 6 => date
Chihuahua State Attorney General's Office => organization

✨ 主要特性

對 GLiNER 進行微調，提高了在廣泛主題下的準確性，特別是長上下文新聞實體提取。
基於合成數據進行訓練，數據來源廣泛且具有多樣性。
模型體積小巧，適用於高吞吐量的生產場景。

💻 使用示例

基礎用法

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_small_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

📚 詳細文檔

模型詳情

模型描述

本新聞微調模型的合成數據來自 AskNews API。我們在國家、語言、主題和時間方面都進行了多樣化處理。

國家分佈情況：

實體類型：

主題：

開發者：Emergent Methods
資助方：Emergent Methods
共享方：Emergent Methods
模型類型：microsoft/deberta
支持語言（NLP）：英語（en）（包括英文文本以及從西班牙語（es）、葡萄牙語（pt）、德語（de）、俄語（ru）、法語（fr）、阿拉伯語（ar）、意大利語（it）、烏克蘭語（uk）、挪威語（no）、瑞典語（sv）、丹麥語（da）翻譯而來的內容）
許可證：Apache 2.0
微調基礎模型：GLiNER

模型來源（可選）

倉庫：待添加
論文：待添加
演示：待添加

使用方式

直接使用

顧名思義，該模型旨在進行通用實體提取。儘管我們使用新聞數據對其進行了微調，但它在 18 個基準數據集上的準確率最高提升了 7.5%。這意味著廣泛且多樣化的基礎數據集有助於模型識別和提取更多類型的實體。

該模型體積小巧，可用於高吞吐量的生產場景。這也是我們將其許可為 Apache 2.0 的原因之一。目前，AskNews 正在其系統中使用此微調模型進行實體提取。

偏差、風險和侷限性

儘管數據集的目標是減少偏差並提高多樣性，但它仍然偏向於西方語言和國家。這一侷限性源於 Llama2 在翻譯和摘要生成方面的能力。此外，由於使用 Llama2 對開放網絡文章進行摘要，Llama2 訓練數據中的任何偏差也會存在於該數據集中。同樣，由於使用 Llama3 從摘要中提取實體，Llama3 中存在的任何偏差也會出現在當前數據集中。

國家分佈