nuner-v1_orgs Open-source Model - Free Deployment for Accurately Identifying Organizational Entities in Texts

Nuner V1 Orgs

Developed by guishe

A model fine-tuned from FewNERD-fine-supervised based on numind/NuNER-v1.0 for recognizing organizational entities (ORG) in text

Sequence Labeling

Transformers

Supports Multiple Languages#Organization Entity Recognition #High-precision NER #RoBERTa Fine-tuning

Downloads 6,836

Release Time : 3/28/2024

Model Overview

This model is a fine-tuned NuNER model on the NER-ORGS dataset, specifically designed for named entity recognition tasks, particularly identifying organization names in text. The NuNER model uses RoBERTa-base as its backbone encoder and has been pre-trained on a large, diverse dataset.

Model Features

High-quality Pre-training

Pre-trained on a large, diverse dataset of 1 million sentences synthetically annotated by GPT-3.5-turbo-0301, generating high-quality token embeddings

Domain-specific Fine-tuning

Fine-tuned on the NER-ORGS dataset, specifically optimized for organizational entity recognition

Balanced Performance

Achieves a good balance between precision (0.76) and recall (0.80), with an F1 score of 0.78

Model Capabilities

Recognition of organizational entities in text

Named entity tag classification

Use Cases

News Analysis

Extraction of Organizational Entities in News

Identify mentioned companies, government agencies, and other organizational entities from news texts

Can accurately recognize organization names such as CNN, Apple, Google, etc.

Business Intelligence

Business Document Analysis

Analyze relevant organizations mentioned in business documents, contracts, or reports

🚀 numind/NuNER-v1.0 fine-tuned on FewNERD-fine-supervised

This is a fine-tuned NuNER model on the NER-ORGS dataset for Named Entity Recognition. It uses RoBERTa-base as the backbone encoder and was pre - trained on a large synthetic dataset.

🚀 Quick Start

This model is a fine - tuned NuNER model that can be used for Named Entity Recognition. It leverages [RoBERTa - base](https://huggingface.co/FacebookAI/roberta - base) as the backbone encoder and was trained on the NuNER dataset, a large and diverse dataset synthetically labeled by gpt - 3.5 - turbo - 0301 with 1M sentences. This pre - training phase generated high - quality token embeddings, which is a good starting point for fine - tuning on more specialized datasets.

✨ Features

Fine - tuned for NER: Specifically fine - tuned for the Named Entity Recognition task.
High - quality embeddings: Benefited from pre - training on a large synthetic dataset.

📚 Documentation

Model Details

The model was fine - tuned as a regular BERT - based model for the NER task using the HuggingFace Trainer class.

Model labels

Entity Types: ORG

Uses

Direct Use for Inference

>>> from transformers import pipeline

>>> text = """Foreign governments may be spying on your smartphone notifications, senator says. Washington (CNN) — Foreign governments have reportedly attempted to spy on iPhone and Android users through the mobile app notifications they receive on their smartphones - and the US government has forced Apple and Google to keep quiet about it, according to a top US senator. Through legal demands sent to the tech giants, governments have allegedly tried to force Apple and Google to turn over sensitive information that could include the contents of a notification - such as previews of a text message displayed on a lock screen, or an update about app activity, Oregon Democratic Sen. Ron Wyden said in a new report. Wyden's report reflects the latest example of long - running tensions between tech companies and governments over law enforcement demands, which have stretched on for more than a decade. Governments around the world have particularly battled with tech companies over encryption, which provides critical protections to users and businesses while in some cases preventing law enforcement from pursuing investigations into messages sent over the internet."""

>>> classifier = pipeline(
    "ner",
    model="guishe/nuner - v1_orgs",
    aggregation_strategy="simple",
)
>>> classifier(text)

[{'entity_group': 'ORG',
  'score': 0.9821347,
  'word': 'CNN',
  'start': 94,
  'end': 97},
 {'entity_group': 'ORG',
  'score': 0.99382174,
  'word': ' Apple',
  'start': 288,
  'end': 293},
 {'entity_group': 'ORG',
  'score': 0.99351865,
  'word': ' Google',
  'start': 298,
  'end': 304},
 {'entity_group': 'ORG',
  'score': 0.992792,
  'word': ' Apple',
  'start': 449,
  'end': 454},
 {'entity_group': 'ORG',
  'score': 0.99385214,
  'word': ' Google',
  'start': 459,
  'end': 465}]

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 4

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0631	1.0	1710	0.0566	0.7635	0.7952	0.7790	0.9778
0.0572	2.0	3420	0.0580	0.7816	0.7925	0.7870	0.9785
0.0429	3.0	5130	0.0562	0.7869	0.8084	0.7975	0.9790
0.0336	4.0	6840	0.0631	0.7912	0.8045	0.7978	0.9790

Framework versions

Transformers 4.36.0
Pytorch 2.0.0+cu117
Datasets 2.18.0
Tokenizers 0.15.2

📄 License

This model is licensed under cc - by - sa - 4.0.

📚 Citation

BibTeX

@misc{bogdanov2024nuner,
      title={NuNER: Entity Recognition Encoder Pre - training via LLM - Annotated Data}, 
      author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
      year={2024},
      eprint={2402.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご