nuner-v1_orgs開源模型 - 免費部署精準識別文本中組織實體

首頁

Nuner V1 Orgs

由guishe開發

基於FewNERD-fine-supervised微調的numind/NuNER-v1.0模型，用於識別文本中的組織實體(ORG)

序列標註

Transformers

支持多種語言#組織實體識別 #高精度NER #RoBERTa微調

下載量 6,836

發布時間 : 3/28/2024

模型概述

該模型是在NER-ORGS數據集上微調的NuNER模型，專門用於命名實體識別任務，特別是識別文本中的組織名稱。NuNER模型使用RoBERTa-base作為骨幹編碼器，並在大型多樣化數據集上進行了預訓練。

模型特點

高質量預訓練

使用GPT-3.5-turbo-0301合成標註的100萬句子大型多樣化數據集進行預訓練，生成高質量的標記嵌入

專業領域微調

在NER-ORGS數據集上進行微調，專門優化了組織實體識別能力

平衡性能

在精確率(0.76)和召回率(0.80)之間取得良好平衡，F1值達到0.78

模型能力

文本中的組織實體識別

命名實體標記分類

使用案例

新聞分析

新聞中的組織實體提取

從新聞文本中識別提到的公司、政府機構等組織實體

可準確識別如CNN、蘋果、谷歌等組織名稱

商業情報

商業文檔分析

分析商業文檔、合同或報告中提到的相關組織

🚀 numind/NuNER-v1.0在FewNERD-fine-supervised上微調的模型

這是一個在NER-ORGS數據集上微調的NuNER模型，可用於命名實體識別任務。NuNER模型使用RoBERTa-base作為骨幹編碼器，並在NuNER數據集上進行訓練。該數據集是一個由gpt - 3.5 - turbo - 0301合成標註的包含100萬句子的大型多樣數據集。這一進一步的預訓練階段有助於生成高質量的詞元嵌入，為在更專業的數據集上進行微調提供了良好的起點。

🚀 快速開始

本模型可直接用於推理，下面是具體的使用示例。

✨ 主要特性

基於NuNER架構，使用RoBERTa-base作為骨幹編碼器，具備強大的特徵提取能力。
在大規模合成標註的NuNER數據集上預訓練，生成高質量的詞元嵌入。
可用於命名實體識別任務，專注於識別組織（ORG）實體類型。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

>>> from transformers import pipeline

>>> text = """Foreign governments may be spying on your smartphone notifications, senator says. Washington (CNN) — Foreign governments have reportedly attempted to spy on iPhone and Android users through the mobile app notifications they receive on their smartphones - and the US government has forced Apple and Google to keep quiet about it, according to a top US senator. Through legal demands sent to the tech giants, governments have allegedly tried to force Apple and Google to turn over sensitive information that could include the contents of a notification - such as previews of a text message displayed on a lock screen, or an update about app activity, Oregon Democratic Sen. Ron Wyden said in a new report. Wyden's report reflects the latest example of long-running tensions between tech companies and governments over law enforcement demands, which have stretched on for more than a decade. Governments around the world have particularly battled with tech companies over encryption, which provides critical protections to users and businesses while in some cases preventing law enforcement from pursuing investigations into messages sent over the internet."""

>>> classifier = pipeline(
    "ner",
    model="guishe/nuner-v1_orgs",
    aggregation_strategy="simple",
)
>>> classifier(text)

[{'entity_group': 'ORG',
  'score': 0.9821347,
  'word': 'CNN',
  'start': 94,
  'end': 97},
 {'entity_group': 'ORG',
  'score': 0.99382174,
  'word': ' Apple',
  'start': 288,
  'end': 293},
 {'entity_group': 'ORG',
  'score': 0.99351865,
  'word': ' Google',
  'start': 298,
  'end': 304},
 {'entity_group': 'ORG',
  'score': 0.992792,
  'word': ' Apple',
  'start': 449,
  'end': 454},
 {'entity_group': 'ORG',
  'score': 0.99385214,
  'word': ' Google',
  'start': 459,
  'end': 465}]

📚 詳細文檔

模型詳情

該模型作為一個基於BERT的常規模型，使用HuggingFace的Trainer類針對命名實體識別任務進行了微調。

模型標籤

實體類型：組織（ORG）

用途

可直接用於推理，識別文本中的組織實體。

訓練過程

訓練超參數

訓練過程中使用了以下超參數：

學習率（learning_rate）：5e - 05
訓練批次大小（train_batch_size）：32
評估批次大小（eval_batch_size）：32
隨機種子（seed）：42
梯度累積步數（gradient_accumulation_steps）：2
總訓練批次大小（total_train_batch_size）：64
優化器（optimizer）：Adam，β值為(0.9, 0.999)，ε值為1e - 08
學習率調度器類型（lr_scheduler_type）：線性
學習率調度器熱身比例（lr_scheduler_warmup_ratio）：0.1
訓練輪數（num_epochs）：4

訓練結果

訓練損失	輪數	步數	驗證損失	精確率	召回率	F1值	準確率
0.0631	1.0	1710	0.0566	0.7635	0.7952	0.7790	0.9778
0.0572	2.0	3420	0.0580	0.7816	0.7925	0.7870	0.9785
0.0429	3.0	5130	0.0562	0.7869	0.8084	0.7975	0.9790
0.0336	4.0	6840	0.0631	0.7912	0.8045	0.7978	0.9790