Span Marker Roberta Large Fewnerd Fine Super
這是一個基於roberta-large的SpanMarker模型,專門用於細粒度命名實體識別任務,在FewNERD數據集上訓練得到。
下載量 53
發布時間 : 3/30/2023
模型概述
該模型採用SpanMarker架構,結合roberta-large編碼器,能夠識別文本中的各類命名實體,適用於信息提取等場景。
模型特點
細粒度實體識別
支持識別66種細粒度實體類型,涵蓋人物、地點、組織等多個領域
高性能基礎模型
基於roberta-large編碼器,提供強大的語義理解能力
SpanMarker架構
採用先進的SpanMarker方法,有效處理實體邊界識別問題
模型能力
命名實體識別
細粒度實體分類
文本信息提取
使用案例
信息提取
新聞人物識別
從新聞文本中識別提及的人物及其類型
可準確識別如'阿梅莉亞·埃爾哈特'等人物實體
地理信息提取
識別文本中的地點、建築等地理實體
可識別'巴黎'、'大西洋'等地理實體
內容分析
影視作品分析
識別文本中提到的電影、電視節目等
可準確識別如'潛龍轟天'等影視作品
🚀 SpanMarker結合roberta-large在FewNERD數據集上的應用
本項目是一個基於 SpanMarker 模型,在 FewNERD 數據集上訓練得到的命名實體識別模型。該SpanMarker模型使用 roberta-large 作為基礎編碼器。訓練腳本見 train.py。
🚀 快速開始
直接使用
from span_marker import SpanMarkerModel
# 從🤗 Hub下載模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 進行推理
entities = model.predict("Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones) was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.")
下游使用
你可以在自己的數據集上對該模型進行微調。
點擊展開
from span_marker import SpanMarkerModel, Trainer
# 從🤗 Hub下載模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 指定一個包含 "tokens" 和 "ner_tag" 列的數據集
dataset = load_dataset("conll2003") # 例如CoNLL2003
# 使用預訓練模型和數據集初始化一個Trainer
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finetuned")
✨ 主要特性
- 可用於命名實體識別任務。
- 使用roberta-large作為基礎編碼器,具有較強的特徵提取能力。
- 支持在自己的數據集上進行微調。
📦 安裝指南
文檔未提及安裝步驟,故跳過該章節。
💻 使用示例
基礎用法
from span_marker import SpanMarkerModel
# 從🤗 Hub下載模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 進行推理
entities = model.predict("Most of the Steven Seagal movie ``Under Siege`` (co-starring Tommy Lee Jones) was filmed aboard the Battleship USS Alabama, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.")
高級用法
from span_marker import SpanMarkerModel, Trainer
# 從🤗 Hub下載模型
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
# 指定一個包含 "tokens" 和 "ner_tag" 列的數據集
dataset = load_dataset("conll2003") # 例如CoNLL2003
# 使用預訓練模型和數據集初始化一個Trainer
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finetuned")
📚 詳細文檔
模型詳情
模型描述
屬性 | 詳情 |
---|---|
模型類型 | SpanMarker |
編碼器 | roberta-large |
最大序列長度 | 256個詞元 |
最大實體長度 | 8個單詞 |
訓練數據集 | FewNERD |
語言 | 英語 |
許可證 | cc-by-sa-4.0 |
模型來源
模型標籤
標籤 | 示例 |
---|---|
art-broadcastprogram | "Street Cents", "The Gale Storm Show : Oh , Susanna", "Corazones" |
art-film | "Shawshank Redemption", "Bosch", "L'Atlantide" |
art-music | "Hollywood Studio Symphony", "Champion Lover", "Atkinson , Danko and Ford ( with Brockie and Hilton )" |
art-other | "Aphrodite of Milos", "Venus de Milo", "The Today Show" |
art-painting | "Production/Reproduction", "Cofiwch Dryweryn", "Touit" |
art-writtenart | "Imelda de ' Lambertazzi", "Time", "The Seven Year Itch" |
building-airport | "Sheremetyevo International Airport", "Newark Liberty International Airport", "Luton Airport" |
building-hospital | "Memorial Sloan-Kettering Cancer Center", "Hokkaido University Hospital", "Yeungnam University Hospital" |
building-hotel | "Flamingo Hotel", "The Standard Hotel", "Radisson Blu Sea Plaza Hotel" |
building-library | "British Library", "Berlin State Library", "Bayerische Staatsbibliothek" |
building-other | "Alpha Recording Studios", "Henry Ford Museum", "Communiplex" |
building-restaurant | "Fatburger", "Carnegie Deli", "Trumbull" |
building-sportsfacility | "Sports Center", "Glenn Warner Soccer Facility", "Boston Garden" |
building-theater | "Pittsburgh Civic Light Opera", "National Paris Opera", "Sanders Theatre" |
event-attack/battle/war/militaryconflict | "Jurist", "Vietnam War", "Easter Offensive" |
event-disaster | "the 1912 North Mount Lyell Disaster", "1990s North Korean famine", "1693 Sicily earthquake" |
event-election | "March 1898 elections", "Elections to the European Parliament", "1982 Mitcham and Morden by-election" |
event-other | "Eastwood Scoring Stage", "Union for a Popular Movement", "Masaryk Democratic Movement" |
event-protest | "Russian Revolution", "French Revolution", "Iranian Constitutional Revolution" |
event-sportsevent | "World Cup", "Stanley Cup", "National Champions" |
location-GPE | "Croatian", "the Republic of Croatia", "Mediterranean Basin" |
location-bodiesofwater | "Arthur Kill", "Norfolk coast", "Atatürk Dam Lake" |
location-island | "new Samsat district", "Staten Island", "Laccadives" |
location-mountain | "Ruweisat Ridge", "Salamander Glacier", "Miteirya Ridge" |
location-other | "Northern City Line", "Victoria line", "Cartuther" |
location-park | "Gramercy Park", "Shenandoah National Park", "Painted Desert Community Complex Historic District" |
location-road/railway/highway/transit | "NJT", "Friern Barnet Road", "Newark-Elizabeth Rail Link" |
organization-company | "Church 's Chicken", "Dixy Chicken", "Texas Chicken" |
organization-education | "MIT", "Barnard College", "Belfast Royal Academy and the Ulster College of Physical Education" |
organization-government/governmentagency | "Supreme Court", "Congregazione dei Nobili", "Diet" |
organization-media/newspaper | "Al Jazeera", "Clash", "TimeOut Melbourne" |
organization-other | "IAEA", "4th Army", "Defence Sector C" |
organization-politicalparty | "Al Wafa ' Islamic", "Kenseitō", "Shimpotō" |
organization-religion | "Jewish", "UPCUSA", "Christian" |
organization-showorganization | "Mr. Mister", "Lizzy", "Bochumer Symphoniker" |
organization-sportsleague | "China League One", "NHL", "First Division" |
organization-sportsteam | "Arsenal", "Luc Alphand Aventures", "Tottenham" |
other-astronomything | "Algol", "`` Caput Larvae ''", "Zodiac" |
other-award | "GCON", "Grand Commander of the Order of the Niger", "Order of the Republic of Guinea and Nigeria" |
other-biologything | "BAR", "N-terminal lipid", "Amphiphysin" |
other-chemicalthing | "carbon dioxide", "sulfur", "uranium" |
other-currency | "$", "Travancore Rupee", "lac crore" |
other-disease | "bladder cancer", "French Dysentery Epidemic of 1779", "hypothyroidism" |
other-educationaldegree | "Bachelor", "Master", "BSc ( Hons ) in physics" |
other-god | "El", "Fujin", "Raijin" |
other-language | "Latin", "Breton-speaking", "English" |
other-law | "Leahy–Smith America Invents Act ( AIA", "Thirty Years ' Peace", "United States Freedom Support Act" |
other-livingthing | "monkeys", "patchouli", "insects" |
other-medical | "Pediatrics", "pediatrician", "amitriptyline" |
person-actor | "Tchéky Karyo", "Ellaline Terriss", "Edmund Payne" |
person-artist/author | "George Axelrod", "Gaetano Donizett", "Hicks" |
person-athlete | "Jaguar", "Tozawa", "Neville" |
person-director | "Bob Swaim", "Frank Darabont", "Richard Quine" |
person-other | "Richard Benson", "Holden", "Campbell" |
person-politician | "Emeric", "Rivière", "William" |
person-scholar | "Stalmine", "Stedman", "Wurdack" |
person-soldier | "Helmuth Weidling", "Joachim Ziegler", "Krukenberg" |
product-airplane | "Luton", "Spey-equipped FGR.2s", "EC135T2 CPDS" |
product-car | "100EX", "Phantom", "Corvettes - GT1 C6R" |
product-food | "red grape", "yakiniku", "V. labrusca" |
product-game | "Airforce Delta", "Splinter Cell", "Hardcore RPG" |
product-other | "Fairbottom Bobs", "X11", "PDP-1" |
product-ship | "HMS `` Chinkara ''", "Congress", "Essex" |
product-software | "Wikipedia", "Apdf", "AmiPDF" |
product-train | "Royal Scots Grey", "High Speed Trains", "55022" |
product-weapon | "AR-15 's", "ZU-23-2M Wróbel", "ZU-23-2MR Wróbel II" |
訓練詳情
訓練集指標
訓練集指標 | 最小值 | 中位數 | 最大值 |
---|---|---|---|
句子長度 | 1 | 24.4945 | 267 |
每個句子的實體數量 | 0 | 2.5832 | 88 |
訓練超參數
- 學習率:1e-05
- 訓練批次大小:8
- 評估批次大小:8
- 隨機種子:42
- 優化器:Adam(β1=0.9,β2=0.999,ε=1e-08)
- 學習率調度器類型:線性
- 學習率調度器熱身比例:0.1
- 訓練輪數:3
訓練硬件
- 是否使用雲服務:否
- GPU型號:1 x NVIDIA GeForce RTX 3090
- CPU型號:13th Gen Intel(R) Core(TM) i7-13700K
- 內存大小:31.78 GB
框架版本
- Python:3.9.16
- SpanMarker:1.3.1.dev
- Transformers:4.29.2
- PyTorch:2.0.1+cu118
- Datasets:2.14.3
- Tokenizers:0.13.2
🔧 技術細節
文檔未提供具體的技術實現細節,故跳過該章節。
📄 許可證
本模型使用的許可證為 cc-by-sa-4.0。
Indonesian Roberta Base Posp Tagger
MIT
這是一個基於印尼語RoBERTa模型微調的詞性標註模型,在indonlu數據集上訓練,用於印尼語文本的詞性標註任務。
序列標註
Transformers 其他

I
w11wo
2.2M
7
Bert Base NER
MIT
基於BERT微調的命名實體識別模型,可識別四類實體:地點(LOC)、組織機構(ORG)、人名(PER)和雜項(MISC)
序列標註 英語
B
dslim
1.8M
592
Deid Roberta I2b2
MIT
該模型是基於RoBERTa微調的序列標註模型,用於識別和移除醫療記錄中的受保護健康信息(PHI/PII)。
序列標註
Transformers 支持多種語言

D
obi
1.1M
33
Ner English Fast
Flair自帶的英文快速4類命名實體識別模型,基於Flair嵌入和LSTM-CRF架構,在CoNLL-03數據集上達到92.92的F1分數。
序列標註
PyTorch 英語
N
flair
978.01k
24
French Camembert Postag Model
基於Camembert-base的法語詞性標註模型,使用free-french-treebank數據集訓練
序列標註
Transformers 法語

F
gilf
950.03k
9
Xlm Roberta Large Ner Spanish
基於XLM-Roberta-large架構微調的西班牙語命名實體識別模型,在CoNLL-2002數據集上表現優異。
序列標註
Transformers 西班牙語

X
MMG
767.35k
29
Nusabert Ner V1.3
MIT
基於NusaBert-v1.3在印尼語NER任務上微調的命名實體識別模型
序列標註
Transformers 其他

N
cahya
759.09k
3
Ner English Large
Flair框架內置的英文4類大型NER模型,基於文檔級XLM-R嵌入和FLERT技術,在CoNLL-03數據集上F1分數達94.36。
序列標註
PyTorch 英語
N
flair
749.04k
44
Punctuate All
MIT
基於xlm-roberta-base微調的多語言標點符號預測模型,支持12種歐洲語言的標點符號自動補全
序列標註
Transformers

P
kredor
728.70k
20
Xlm Roberta Ner Japanese
MIT
基於xlm-roberta-base微調的日語命名實體識別模型
序列標註
Transformers 支持多種語言

X
tsmatz
630.71k
25
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98