roberta-large-ner-english Open-Source English Named Entity Recognition Model - Accurately Identify Entities without Capitalized First Letters

Roberta Large Ner English

Developed by ydshieh

An English named entity recognition model fine-tuned on the conll2003 dataset based on roberta-large, specifically optimized for recognizing entities with non-initial capital letters

Sequence Labeling

Transformers

English#English Named Entity Recognition #Informal Text Optimization #High-precision NER

Downloads 36

Release Time : 3/23/2022

Model Overview

This model is used for English named entity recognition tasks, capable of identifying entities such as person names, organization names, and locations in text, with excellent performance on email/chat data

Model Features

Optimized for Non-Initial Capital Entities

Specifically optimized for entities with non-initial capital letters, outperforming other models in such cases

Excellent Performance on Email/Chat Data

Better recognition performance on informal texts like emails and chat data compared to traditional NER models

Multi-category Entity Recognition

Capable of recognizing various entity types including person names (PER), organization names (ORG), locations (LOC), and miscellaneous (MISC)

Model Capabilities

English Named Entity Recognition

Informal Text Entity Extraction

Multi-category Entity Classification

Use Cases

Information Extraction

Email Entity Extraction

Extract key information such as person names and company names from emails

Achieved an F1 score of 0.8967 for person name recognition on a private dataset

Chat Log Analysis

Analyze locations and people mentioned in chat logs

Outperforms traditional NER models like Spacy on informal texts

Knowledge Graph Construction

Entity Relation Extraction

As a preliminary step for knowledge graph construction, identify key entities in text

🚀 roberta-large-ner-english: model fine-tuned from roberta-large for NER task

This is an English NER model fine-tuned from roberta-large on the conll2003 dataset, showing excellent performance on specific data types.

🚀 Quick Start

The roberta-large-ner-english is an English NER model fine-tuned from roberta-large on the conll2003 dataset. It was validated on emails/chat data and outperformed other models, especially on entities not starting with an upper case.

✨ Features

Fine-tuned from roberta-large on the conll2003 dataset.
Performs well on emails/chat data.
Works better on entities that don't start with an upper case.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner-english")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner-english")

# Process text sample (from wikipedia)
from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer")

[{'entity_group': 'ORG',
  'score': 0.99381506,
  'word': ' Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.99970853,
  'word': ' Steve Jobs',
  'start': 29,
  'end': 39},
 {'entity_group': 'PER',
  'score': 0.99981767,
  'word': ' Steve Wozniak',
  'start': 41,
  'end': 54},
 {'entity_group': 'PER',
  'score': 0.99956465,
  'word': ' Ronald Wayne',
  'start': 59,
  'end': 71},
 {'entity_group': 'PER',
  'score': 0.9997918,
  'word': ' Wozniak',
  'start': 92,
  'end': 99},
 {'entity_group': 'MISC',
  'score': 0.99956393,
  'word': ' Apple I',
  'start': 102,
  'end': 109}]

📚 Documentation

Training data

Training data was classified as follows:

Property Details

Model Type English NER model fine-tuned from roberta-large

Training Data The training data was from the conll2003 dataset. The classification abbreviations and their descriptions are as follows:
- O: Outside of a named entity
- MISC: Miscellaneous entity
- PER: Person’s name
- ORG: Organization
- LOC: Location. The prefix B- or I- from the original conll2003 was removed. The train and test datasets from the original conll2003 were used for training, and the "validation" dataset was used for validation. The dataset sizes are:
- Train: 17494
- Validation: 3250

Property	Details
Model Type	English NER model fine-tuned from roberta-large
Training Data	The training data was from the `conll2003` dataset. The classification abbreviations and their descriptions are as follows: - O: Outside of a named entity - MISC: Miscellaneous entity - PER: Person’s name - ORG: Organization - LOC: Location. The prefix B- or I- from the original `conll2003` was removed. The train and test datasets from the original `conll2003` were used for training, and the "validation" dataset was used for validation. The dataset sizes are: - Train: 17494 - Validation: 3250

Model performances

Model performances were computed on the conll2003 validation dataset (computed on the tokens predictions):

entity	precision	recall	f1
PER	0.9914	0.9927	0.9920
ORG	0.9627	0.9661	0.9644
LOC	0.9795	0.9862	0.9828
MISC	0.9292	0.9262	0.9277
Overall	0.9740	0.9766	0.9753

On a private dataset (email, chat, informal discussion), computed on word predictions:

entity	precision	recall	f1
PER	0.8823	0.9116	0.8967
ORG	0.7694	0.7292	0.7487
LOC	0.8619	0.7768	0.8171

By comparison on the same private dataset, Spacy (en_core_web_trf-3.2.0) gave the following results:

entity	precision	recall	f1
PER	0.9146	0.8287	0.8695
ORG	0.7655	0.6437	0.6993
LOC	0.8727	0.6180	0.7236

🔧 Technical Details

No specific technical details are provided in the original document, so this section is skipped.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご