roberta-large-finetuned-abbr Open Source Model - Free Deployment for Accurate Recognition of Abbreviated Terminologies in Scientific Texts

Roberta Large Finetuned Abbr

Developed by surrey-nlp

A named entity recognition model fine-tuned on the PLOD-unfiltered dataset based on RoBERTa-large, specifically designed for identifying abbreviations and terms in scientific texts.

Sequence Labeling

Transformers

EnglishOpen Source License:MIT #High-precision NER #Scientific literature annotation #Abbreviation recognition

Downloads 64

Release Time : 4/20/2022

Model Overview

This model, achieved by fine-tuning RoBERTa-large, is specifically designed for token classification tasks, particularly for recognizing specific types of named entities (such as abbreviations and terms) in scientific literature.

Model Features

High-precision abbreviation recognition

Accurately identifies various abbreviations and terms in scientific texts, achieving an F1 score of 0.9645.

Powerful representation capability based on RoBERTa-large

Leverages the strong language understanding capabilities of the RoBERTa-large pre-trained model, making it particularly suitable for handling complex terminology in scientific literature.

Domain-specific optimization

Specially fine-tuned on the PLOD-unfiltered scientific dataset, making it well-suited for processing academic and technical documents.

Model Capabilities

Named entity recognition in scientific texts

Abbreviation detection

Term extraction

Token classification

Use Cases

Academic research

Scientific literature processing

Automatically identifies professional terms and abbreviations in research papers

Improves literature processing efficiency with an accuracy of 96.08%

Information extraction

Technical document analysis

Extracts key terms from technical manuals and patent documents

Achieves an F1 score of 0.9645

🚀 roberta-large-finetuned-ner

This model is a fine - tuned version of roberta - large on the PLOD - unfiltered dataset, achieving high precision, recall, F1, and accuracy in token - classification tasks.

🚀 Quick Start

This model is a fine - tuned version of roberta-large on the PLOD-unfiltered dataset. It achieves the following results on the evaluation set:

Loss: 0.1393
Precision: 0.9663
Recall: 0.9627
F1: 0.9645
Accuracy: 0.9608

✨ Features

Model Creators: Leonardo Zilio, Hadeel Saadany, Prashant Sharma, Diptesh Kanojia, Constantin Orasan.
Base Model: roberta - large
Metrics: precision, recall, f1, accuracy
Dataset: surrey - nlp/PLOD - unfiltered

Property	Details
Model Type	roberta - large - finetuned - ner
Training Data	surrey - nlp/PLOD - unfiltered

📚 Documentation

Model description

RoBERTa is a transformers model pretrained on a large corpus of English data in a self - supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

Intended uses & limitations

More information needed

Training and evaluation data

The model is fine - tuned using [PLOD - Unfiltered](https://huggingface.co/datasets/surrey - nlp/PLOD - unfiltered) dataset. This dataset is used for training and evaluating the model. The PLOD Dataset is published at LREC 2022. The dataset can help build sequence labeling models for the task of Abbreviation Detection.

🔧 Technical Details

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e - 05
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
num_epochs: 6

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.1281	1.0	14233	0.1300	0.9557	0.9436	0.9496	0.9457
0.1056	2.0	28466	0.1076	0.9620	0.9552	0.9586	0.9545
0.0904	3.0	42699	0.1054	0.9655	0.9585	0.9620	0.9583
0.0743	4.0	56932	0.1145	0.9658	0.9602	0.9630	0.9593
0.0523	5.0	71165	0.1206	0.9664	0.9619	0.9641	0.9604
0.044	6.0	85398	0.1393	0.9663	0.9627	0.9645	0.9608

Framework versions

Transformers 4.18.0
Pytorch 1.10.1+cu111
Datasets 2.1.0
Tokenizers 0.12.1

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご