Span Marker Roberta Large Fewnerd Fine Super
这是一个基于roberta-large的SpanMarker模型,专门用于细粒度命名实体识别任务,在FewNERD数据集上训练得到。
下载量 53
发布时间 : 3/30/2023
模型简介
该模型采用SpanMarker架构,结合roberta-large编码器,能够识别文本中的各类命名实体,适用于信息提取等场景。
模型特点
细粒度实体识别
支持识别66种细粒度实体类型,涵盖人物、地点、组织等多个领域
高性能基础模型
基于roberta-large编码器,提供强大的语义理解能力
SpanMarker架构
采用先进的SpanMarker方法,有效处理实体边界识别问题
模型能力
命名实体识别
细粒度实体分类
文本信息提取
使用案例
信息提取
新闻人物识别
从新闻文本中识别提及的人物及其类型
可准确识别如'阿梅莉亚·埃尔哈特'等人物实体
地理信息提取
识别文本中的地点、建筑等地理实体
可识别'巴黎'、'大西洋'等地理实体
内容分析
影视作品分析
识别文本中提到的电影、电视节目等
可准确识别如'潜龙轰天'等影视作品
🚀 SpanMarker结合roberta-large在FewNERD数据集上的应用
本项目是一个基于 SpanMarker 模型,在 FewNERD 数据集上训练得到的命名实体识别模型。该SpanMarker模型使用 roberta-large 作为基础编码器。训练脚本见 train.py。
🚀 快速开始
直接使用
from span_marker import SpanMarkerModel
# 从🤗 Hub下载模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 进行推理
entities = model.predict("Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones) was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.")
下游使用
你可以在自己的数据集上对该模型进行微调。
点击展开
from span_marker import SpanMarkerModel, Trainer
# 从🤗 Hub下载模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 指定一个包含 "tokens" 和 "ner_tag" 列的数据集
dataset = load_dataset("conll2003") # 例如CoNLL2003
# 使用预训练模型和数据集初始化一个Trainer
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finetuned")
✨ 主要特性
- 可用于命名实体识别任务。
- 使用roberta-large作为基础编码器,具有较强的特征提取能力。
- 支持在自己的数据集上进行微调。
📦 安装指南
文档未提及安装步骤,故跳过该章节。
💻 使用示例
基础用法
from span_marker import SpanMarkerModel
# 从🤗 Hub下载模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 进行推理
entities = model.predict("Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones) was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.")
高级用法
from span_marker import SpanMarkerModel, Trainer
# 从🤗 Hub下载模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 指定一个包含 "tokens" 和 "ner_tag" 列的数据集
dataset = load_dataset("conll2003") # 例如CoNLL2003
# 使用预训练模型和数据集初始化一个Trainer
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finetuned")
📚 详细文档
模型详情
模型描述
属性 | 详情 |
---|---|
模型类型 | SpanMarker |
编码器 | roberta-large |
最大序列长度 | 256个词元 |
最大实体长度 | 8个单词 |
训练数据集 | FewNERD |
语言 | 英语 |
许可证 | cc-by-sa-4.0 |
模型来源
模型标签
标签 | 示例 |
---|---|
art-broadcastprogram | "Street Cents", "The Gale Storm Show : Oh , Susanna", "Corazones" |
art-film | "Shawshank Redemption", "Bosch", "L'Atlantide" |
art-music | "Hollywood Studio Symphony", "Champion Lover", "Atkinson , Danko and Ford ( with Brockie and Hilton )" |
art-other | "Aphrodite of Milos", "Venus de Milo", "The Today Show" |
art-painting | "Production/Reproduction", "Cofiwch Dryweryn", "Touit" |
art-writtenart | "Imelda de ' Lambertazzi", "Time", "The Seven Year Itch" |
building-airport | "Sheremetyevo International Airport", "Newark Liberty International Airport", "Luton Airport" |
building-hospital | "Memorial Sloan-Kettering Cancer Center", "Hokkaido University Hospital", "Yeungnam University Hospital" |
building-hotel | "Flamingo Hotel", "The Standard Hotel", "Radisson Blu Sea Plaza Hotel" |
building-library | "British Library", "Berlin State Library", "Bayerische Staatsbibliothek" |
building-other | "Alpha Recording Studios", "Henry Ford Museum", "Communiplex" |
building-restaurant | "Fatburger", "Carnegie Deli", "Trumbull" |
building-sportsfacility | "Sports Center", "Glenn Warner Soccer Facility", "Boston Garden" |
building-theater | "Pittsburgh Civic Light Opera", "National Paris Opera", "Sanders Theatre" |
event-attack/battle/war/militaryconflict | "Jurist", "Vietnam War", "Easter Offensive" |
event-disaster | "the 1912 North Mount Lyell Disaster", "1990s North Korean famine", "1693 Sicily earthquake" |
event-election | "March 1898 elections", "Elections to the European Parliament", "1982 Mitcham and Morden by-election" |
event-other | "Eastwood Scoring Stage", "Union for a Popular Movement", "Masaryk Democratic Movement" |
event-protest | "Russian Revolution", "French Revolution", "Iranian Constitutional Revolution" |
event-sportsevent | "World Cup", "Stanley Cup", "National Champions" |
location-GPE | "Croatian", "the Republic of Croatia", "Mediterranean Basin" |
location-bodiesofwater | "Arthur Kill", "Norfolk coast", "Atatürk Dam Lake" |
location-island | "new Samsat district", "Staten Island", "Laccadives" |
location-mountain | "Ruweisat Ridge", "Salamander Glacier", "Miteirya Ridge" |
location-other | "Northern City Line", "Victoria line", "Cartuther" |
location-park | "Gramercy Park", "Shenandoah National Park", "Painted Desert Community Complex Historic District" |
location-road/railway/highway/transit | "NJT", "Friern Barnet Road", "Newark-Elizabeth Rail Link" |
organization-company | "Church 's Chicken", "Dixy Chicken", "Texas Chicken" |
organization-education | "MIT", "Barnard College", "Belfast Royal Academy and the Ulster College of Physical Education" |
organization-government/governmentagency | "Supreme Court", "Congregazione dei Nobili", "Diet" |
organization-media/newspaper | "Al Jazeera", "Clash", "TimeOut Melbourne" |
organization-other | "IAEA", "4th Army", "Defence Sector C" |
organization-politicalparty | "Al Wafa ' Islamic", "Kenseitō", "Shimpotō" |
organization-religion | "Jewish", "UPCUSA", "Christian" |
organization-showorganization | "Mr. Mister", "Lizzy", "Bochumer Symphoniker" |
organization-sportsleague | "China League One", "NHL", "First Division" |
organization-sportsteam | "Arsenal", "Luc Alphand Aventures", "Tottenham" |
other-astronomything | "Algol", "`` Caput Larvae ''", "Zodiac" |
other-award | "GCON", "Grand Commander of the Order of the Niger", "Order of the Republic of Guinea and Nigeria" |
other-biologything | "BAR", "N-terminal lipid", "Amphiphysin" |
other-chemicalthing | "carbon dioxide", "sulfur", "uranium" |
other-currency | "$", "Travancore Rupee", "lac crore" |
other-disease | "bladder cancer", "French Dysentery Epidemic of 1779", "hypothyroidism" |
other-educationaldegree | "Bachelor", "Master", "BSc ( Hons ) in physics" |
other-god | "El", "Fujin", "Raijin" |
other-language | "Latin", "Breton-speaking", "English" |
other-law | "Leahy–Smith America Invents Act ( AIA", "Thirty Years ' Peace", "United States Freedom Support Act" |
other-livingthing | "monkeys", "patchouli", "insects" |
other-medical | "Pediatrics", "pediatrician", "amitriptyline" |
person-actor | "Tchéky Karyo", "Ellaline Terriss", "Edmund Payne" |
person-artist/author | "George Axelrod", "Gaetano Donizett", "Hicks" |
person-athlete | "Jaguar", "Tozawa", "Neville" |
person-director | "Bob Swaim", "Frank Darabont", "Richard Quine" |
person-other | "Richard Benson", "Holden", "Campbell" |
person-politician | "Emeric", "Rivière", "William" |
person-scholar | "Stalmine", "Stedman", "Wurdack" |
person-soldier | "Helmuth Weidling", "Joachim Ziegler", "Krukenberg" |
product-airplane | "Luton", "Spey-equipped FGR.2s", "EC135T2 CPDS" |
product-car | "100EX", "Phantom", "Corvettes - GT1 C6R" |
product-food | "red grape", "yakiniku", "V. labrusca" |
product-game | "Airforce Delta", "Splinter Cell", "Hardcore RPG" |
product-other | "Fairbottom Bobs", "X11", "PDP-1" |
product-ship | "HMS `` Chinkara ''", "Congress", "Essex" |
product-software | "Wikipedia", "Apdf", "AmiPDF" |
product-train | "Royal Scots Grey", "High Speed Trains", "55022" |
product-weapon | "AR-15 's", "ZU-23-2M Wróbel", "ZU-23-2MR Wróbel II" |
训练详情
训练集指标
训练集指标 | 最小值 | 中位数 | 最大值 |
---|---|---|---|
句子长度 | 1 | 24.4945 | 267 |
每个句子的实体数量 | 0 | 2.5832 | 88 |
训练超参数
- 学习率:1e-05
- 训练批次大小:8
- 评估批次大小:8
- 随机种子:42
- 优化器:Adam(β1=0.9,β2=0.999,ε=1e-08)
- 学习率调度器类型:线性
- 学习率调度器热身比例:0.1
- 训练轮数:3
训练硬件
- 是否使用云服务:否
- GPU型号:1 x NVIDIA GeForce RTX 3090
- CPU型号:13th Gen Intel(R) Core(TM) i7-13700K
- 内存大小:31.78 GB
框架版本
- Python:3.9.16
- SpanMarker:1.3.1.dev
- Transformers:4.29.2
- PyTorch:2.0.1+cu118
- Datasets:2.14.3
- Tokenizers:0.13.2
🔧 技术细节
文档未提供具体的技术实现细节,故跳过该章节。
📄 许可证
本模型使用的许可证为 cc-by-sa-4.0。
Indonesian Roberta Base Posp Tagger
MIT
这是一个基于印尼语RoBERTa模型微调的词性标注模型,在indonlu数据集上训练,用于印尼语文本的词性标注任务。
序列标注
Transformers 其他

I
w11wo
2.2M
7
Bert Base NER
MIT
基于BERT微调的命名实体识别模型,可识别四类实体:地点(LOC)、组织机构(ORG)、人名(PER)和杂项(MISC)
序列标注 英语
B
dslim
1.8M
592
Deid Roberta I2b2
MIT
该模型是基于RoBERTa微调的序列标注模型,用于识别和移除医疗记录中的受保护健康信息(PHI/PII)。
序列标注
Transformers 支持多种语言

D
obi
1.1M
33
Ner English Fast
Flair自带的英文快速4类命名实体识别模型,基于Flair嵌入和LSTM-CRF架构,在CoNLL-03数据集上达到92.92的F1分数。
序列标注
PyTorch 英语
N
flair
978.01k
24
French Camembert Postag Model
基于Camembert-base的法语词性标注模型,使用free-french-treebank数据集训练
序列标注
Transformers 法语

F
gilf
950.03k
9
Xlm Roberta Large Ner Spanish
基于XLM-Roberta-large架构微调的西班牙语命名实体识别模型,在CoNLL-2002数据集上表现优异。
序列标注
Transformers 西班牙语

X
MMG
767.35k
29
Nusabert Ner V1.3
MIT
基于NusaBert-v1.3在印尼语NER任务上微调的命名实体识别模型
序列标注
Transformers 其他

N
cahya
759.09k
3
Ner English Large
Flair框架内置的英文4类大型NER模型,基于文档级XLM-R嵌入和FLERT技术,在CoNLL-03数据集上F1分数达94.36。
序列标注
PyTorch 英语
N
flair
749.04k
44
Punctuate All
MIT
基于xlm-roberta-base微调的多语言标点符号预测模型,支持12种欧洲语言的标点符号自动补全
序列标注
Transformers

P
kredor
728.70k
20
Xlm Roberta Ner Japanese
MIT
基于xlm-roberta-base微调的日语命名实体识别模型
序列标注
Transformers 支持多种语言

X
tsmatz
630.71k
25
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98