nuner-v1_orgs开源模型 - 免费部署精准识别文本中组织实体

首页

Nuner V1 Orgs

由 guishe 开发

基于FewNERD-fine-supervised微调的numind/NuNER-v1.0模型，用于识别文本中的组织实体(ORG)

序列标注

Transformers

支持多种语言#组织实体识别 #高精度NER #RoBERTa微调

下载量 6,836

发布时间 : 3/28/2024

模型简介

该模型是在NER-ORGS数据集上微调的NuNER模型，专门用于命名实体识别任务，特别是识别文本中的组织名称。NuNER模型使用RoBERTa-base作为骨干编码器，并在大型多样化数据集上进行了预训练。

模型特点

高质量预训练

使用GPT-3.5-turbo-0301合成标注的100万句子大型多样化数据集进行预训练，生成高质量的标记嵌入

专业领域微调

在NER-ORGS数据集上进行微调，专门优化了组织实体识别能力

平衡性能

在精确率(0.76)和召回率(0.80)之间取得良好平衡，F1值达到0.78

模型能力

文本中的组织实体识别

命名实体标记分类

使用案例

新闻分析

新闻中的组织实体提取

从新闻文本中识别提到的公司、政府机构等组织实体

可准确识别如CNN、苹果、谷歌等组织名称

商业情报

商业文档分析

分析商业文档、合同或报告中提到的相关组织

🚀 numind/NuNER-v1.0在FewNERD-fine-supervised上微调的模型

这是一个在NER-ORGS数据集上微调的NuNER模型，可用于命名实体识别任务。NuNER模型使用RoBERTa-base作为骨干编码器，并在NuNER数据集上进行训练。该数据集是一个由gpt - 3.5 - turbo - 0301合成标注的包含100万句子的大型多样数据集。这一进一步的预训练阶段有助于生成高质量的词元嵌入，为在更专业的数据集上进行微调提供了良好的起点。

🚀 快速开始

本模型可直接用于推理，下面是具体的使用示例。

✨ 主要特性

基于NuNER架构，使用RoBERTa-base作为骨干编码器，具备强大的特征提取能力。
在大规模合成标注的NuNER数据集上预训练，生成高质量的词元嵌入。
可用于命名实体识别任务，专注于识别组织（ORG）实体类型。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

>>> from transformers import pipeline

>>> text = """Foreign governments may be spying on your smartphone notifications, senator says. Washington (CNN) — Foreign governments have reportedly attempted to spy on iPhone and Android users through the mobile app notifications they receive on their smartphones - and the US government has forced Apple and Google to keep quiet about it, according to a top US senator. Through legal demands sent to the tech giants, governments have allegedly tried to force Apple and Google to turn over sensitive information that could include the contents of a notification - such as previews of a text message displayed on a lock screen, or an update about app activity, Oregon Democratic Sen. Ron Wyden said in a new report. Wyden's report reflects the latest example of long-running tensions between tech companies and governments over law enforcement demands, which have stretched on for more than a decade. Governments around the world have particularly battled with tech companies over encryption, which provides critical protections to users and businesses while in some cases preventing law enforcement from pursuing investigations into messages sent over the internet."""

>>> classifier = pipeline(
    "ner",
    model="guishe/nuner-v1_orgs",
    aggregation_strategy="simple",
)
>>> classifier(text)

[{'entity_group': 'ORG',
  'score': 0.9821347,
  'word': 'CNN',
  'start': 94,
  'end': 97},
 {'entity_group': 'ORG',
  'score': 0.99382174,
  'word': ' Apple',
  'start': 288,
  'end': 293},
 {'entity_group': 'ORG',
  'score': 0.99351865,
  'word': ' Google',
  'start': 298,
  'end': 304},
 {'entity_group': 'ORG',
  'score': 0.992792,
  'word': ' Apple',
  'start': 449,
  'end': 454},
 {'entity_group': 'ORG',
  'score': 0.99385214,
  'word': ' Google',
  'start': 459,
  'end': 465}]

📚 详细文档

模型详情

该模型作为一个基于BERT的常规模型，使用HuggingFace的Trainer类针对命名实体识别任务进行了微调。

模型标签

实体类型：组织（ORG）

用途

可直接用于推理，识别文本中的组织实体。

训练过程

训练超参数

训练过程中使用了以下超参数：

学习率（learning_rate）：5e - 05
训练批次大小（train_batch_size）：32
评估批次大小（eval_batch_size）：32
随机种子（seed）：42
梯度累积步数（gradient_accumulation_steps）：2
总训练批次大小（total_train_batch_size）：64
优化器（optimizer）：Adam，β值为(0.9, 0.999)，ε值为1e - 08
学习率调度器类型（lr_scheduler_type）：线性
学习率调度器热身比例（lr_scheduler_warmup_ratio）：0.1
训练轮数（num_epochs）：4

训练结果

训练损失	轮数	步数	验证损失	精确率	召回率	F1值	准确率
0.0631	1.0	1710	0.0566	0.7635	0.7952	0.7790	0.9778
0.0572	2.0	3420	0.0580	0.7816	0.7925	0.7870	0.9785
0.0429	3.0	5130	0.0562	0.7869	0.8084	0.7975	0.9790
0.0336	4.0	6840	0.0631	0.7912	0.8045	0.7978	0.9790