twitter-roberta-base-dec2021-tweetner7-all開源模型

首頁

Twitter Roberta Base Dec2021 Tweetner7 All

由tner開發

基於Twitter-RoBERTa模型在TweetNER7數據集上微調的命名實體識別模型，專門用於推文中的實體識別。

序列標註

Transformers

#推特實體識別 #多實體類別 #高召回率

下載量 6

發布時間 : 7/3/2022

模型概述

該模型是基於Twitter-RoBERTa在TweetNER7數據集上微調的版本，用於識別推文中的命名實體，如人物、地點、公司等。

模型特點

高性能實體識別

在推文數據上表現出色，特別是在識別人物和地點實體方面。

特殊文本處理

支持推文中賬戶名和URL的特殊格式處理，優化了推文數據的實體識別。

多實體類別支持

能夠識別多種實體類別，包括公司、創意作品、事件、團體、地點、人物和產品。

模型能力

推文實體識別

多類別實體分類

特殊格式文本處理

使用案例

社交媒體分析

推文實體提取

從推文中提取人物、地點、公司等實體，用於社交媒體監控和分析。

F1值達到0.6447（微觀）

內容推薦

基於實體的內容推薦

通過識別推文中的實體，為用戶推薦相關內容或產品。

🚀 tner/twitter-roberta-base-dec2021-tweetner7-all

本模型是 cardiffnlp/twitter-roberta-base-dec2021 在 tner/tweetner7 數據集（train_all 劃分）上的微調版本。模型微調通過 T-NER 的超參數搜索完成（更多細節請參閱該倉庫）。該模型在2021年測試集上取得了以下成績：

F1（微平均）：0.6447001005249637
精確率（微平均）：0.6234607906675308
召回率（微平均）：0.6674375578168362
F1（宏平均）：0.5982200308213212
精確率（宏平均）：0.576608821080324
召回率（宏平均）：0.622268182336741

✨ 主要特性

微調模型：基於預訓練模型 cardiffnlp/twitter-roberta-base-dec2021 在 tner/tweetner7 數據集上進行微調。
多指標評估：在測試集上使用 F1、精確率、召回率等多種指標進行評估，包括微平均和宏平均。
支持特定數據集：針對 TweetNER7 數據集進行了優化，該數據集對推文進行了預處理，將賬戶名和 URL 轉換為特殊格式。

📦 安裝指南

本模型可以通過 tner 庫使用。通過 pip 安裝該庫：

pip install tner

💻 使用示例

基礎用法

TweetNER7 對推文進行了預處理，將賬戶名和 URL 轉換為特殊格式（更多細節請參閱數據集頁面）。因此，我們需要相應地處理推文，然後運行模型預測，示例代碼如下：

import re
from urlextract import URLExtract
from tner import TransformersNER

extractor = URLExtract()

def format_tweet(tweet):
    # 屏蔽網頁鏈接
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # 格式化推特賬戶
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet


text = "Get the all-analog Classic Vinyl Edition of `Takin' Off` Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"
text_format = format_tweet(text)
model = TransformersNER("tner/twitter-roberta-base-dec2021-tweetner7-all")
model.predict([text_format])

雖然該模型也可以通過 transformers 庫使用，但目前不建議這樣做，因為當前不支持 CRF 層。

📚 詳細文檔

測試集 F1 分數詳情

測試集上 F1 分數的按實體細分如下：

公司：0.5048128342245989
創意作品：0.45297029702970293
事件：0.46761313220940554
團體：0.6009661835748793
地點：0.6592252133946159
人物：0.8302430243024302
產品：0.6717095310136157

對於 F1 分數，通過自助法獲得的置信區間如下：

F1（微平均）：
- 90%：[0.6358921767926183, 0.6542958612061787]
- 95%：[0.6341987223616053, 0.6560992650244356]
F1（宏平均）：
- 90%：[0.6358921767926183, 0.6542958612061787]
- 95%：[0.6341987223616053, 0.6560992650244356]

完整評估結果可在 NER 指標文件和實體跨度指標文件中查看。

訓練超參數

訓練過程中使用了以下超參數：

屬性	詳情
數據集	['tner/tweetner7']
數據集劃分	train_all
數據集名稱	None
本地數據集	None
模型	cardiffnlp/twitter-roberta-base-dec2021
CRF	True
最大長度	128
訓練輪數	30
批次大小	32
學習率	1e-05
隨機種子	0
梯度累積步數	1
權重衰減	1e-07
學習率熱身步數比例	0.3
最大梯度範數	1

完整配置可在微調參數文件中查看。

🔧 技術細節

模型微調是通過 T-NER 的超參數搜索完成的。在訓練過程中，使用了 CRF 層來提高命名實體識別的性能。同時，對推文進行了預處理，將賬戶名和 URL 轉換為特殊格式，以適應 TweetNER7 數據集。

📄 許可證

如果使用該模型，請引用 T-NER 論文和 TweetNER7 論文：

T-NER

@inproceedings{ushio-camacho-collados-2021-ner,
    title = "{T}-{NER}: An All-Round Python Library for Transformer-based Named Entity Recognition",
    author = "Ushio, Asahi  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.eacl-demos.7",
    doi = "10.18653/v1/2021.eacl-demos.7",
    pages = "53--62",
    abstract = "Language model (LM) pretraining has led to consistent improvements in many NLP downstream tasks, including named entity recognition (NER). In this paper, we present T-NER (Transformer-based Named Entity Recognition), a Python library for NER LM finetuning. In addition to its practical utility, T-NER facilitates the study and investigation of the cross-domain and cross-lingual generalization ability of LMs finetuned on NER. Our library also provides a web app where users can get model predictions interactively for arbitrary text, which facilitates qualitative model evaluation for non-expert programmers. We show the potential of the library by compiling nine public NER datasets into a unified format and evaluating the cross-domain and cross- lingual performance across the datasets. The results from our initial experiments show that in-domain performance is generally competitive across datasets. However, cross-domain generalization is challenging even with a large pretrained LM, which has nevertheless capacity to learn domain-specific features if fine- tuned on a combined dataset. To facilitate future research, we also release all our LM checkpoints via the Hugging Face model hub.",
}

TweetNER7

@inproceedings{ushio-etal-2022-tweet,
    title = "{N}amed {E}ntity {R}ecognition in {T}witter: {A} {D}ataset and {A}nalysis on {S}hort-{T}erm {T}emporal {S}hifts",
    author = "Ushio, Asahi  and
        Neves, Leonardo  and
        Silva, Vitor  and
        Barbieri, Francesco. and
        Camacho-Collados, Jose",
    booktitle = "The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
    month = nov,
    year = "2022",
    address = "Online",
    publisher = "Association for Computational Linguistics",
}