bert-base-NER-Russian Open Source Model - Freely Identify Entities Such as Person Names, Locations, and Institutions in Russian Texts

Bert Base NER Russian

Developed by Gherman

A Russian text named entity recognition (NER) model fine-tuned based on bert-base-multilingual-cased, using BIOLU annotation format, capable of recognizing various entity types such as person names, locations, and organizations.

Sequence Labeling

Transformers

OtherOpen Source License:MIT #Russian NER #Multi-entity recognition #BIOLU annotation

Downloads 128.72k

Release Time : 9/29/2024

Model Overview

This model is specifically designed for named entity recognition in Russian texts, suitable for information extraction, content analysis, and text preprocessing for downstream NLP tasks.

Model Features

Multi-type entity recognition

Capable of recognizing various entity types such as person names, locations, and organizations, supporting detailed sub-category annotations.

High-quality training data

Trained on AlexKly's Detailed-NER-Dataset-RU dataset, with excellent annotation quality.

BIOLU annotation system

Utilizes the advanced BIOLU annotation format, which is more precise than traditional BIO annotation.

Model Capabilities

Russian text analysis

Named entity recognition

Information extraction

Use Cases

Information processing

Russian document analysis

Extracting key information such as person names and locations from Russian documents

Highly accurate entity recognition

Content classification

Classifying content based on identified entities

🚀 Russian Named Entity Recognition Model

This model is a fine - tuned version of bert - base - multilingual - cased for Named Entity Recognition (NER) in Russian text, capable of identifying various entity types.

🚀 Quick Start

Here's a simple example of how to use the model:

from transformers import pipeline

ner_pipe = pipeline("ner", model="Gherman/bert-base-NER-Russian")

text = "Меня зовут Сергей Иванович из Москвы."
results = ner_pipe(text)

for result in results:
    print(f"Word: {result['word']}, Entity: {result['entity']}, Score: {result['score']:.4f}")

✨ Features

This model is a fine - tuned version of bert - base - multilingual - cased for Named Entity Recognition (NER) in Russian text.
It can identify various entity types such as person names, locations, and organizations using the BIOLU tagging format.
It can be used for tasks such as information extraction, content analysis, and text preprocessing for downstream NLP tasks.

📦 Installation

The document doesn't provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline

ner_pipe = pipeline("ner", model="Gherman/bert-base-NER-Russian")

text = "Меня зовут Сергей Иванович из Москвы."
results = ner_pipe(text)

for result in results:
    print(f"Word: {result['word']}, Entity: {result['entity']}, Score: {result['score']:.4f}")

Advanced Usage

The document doesn't provide advanced usage examples, so this part is skipped.

📚 Documentation

Intended uses & limitations

The model is designed to identify named entities in Russian text. It can be used for tasks such as information extraction, content analysis, and text preprocessing for downstream NLP tasks.

Limitations and bias

The model's performance may vary depending on the domain and style of the input text.
It may struggle with rare or complex entity names not seen during training.
The model might exhibit biases present in the training data.

Training data

The model was trained on Detailed - NER - Dataset - RU by AlexKly. Check it out, the dataset is pretty good!

Label Information

The dataset is labeled using the BIOLU format, where:

B: Beginning token of an entity
I: Inner token of an entity
O: Other (non - entity) token
L: Last token of an entity
U: Unit token (single - token entity)

The following entity types are included in the dataset:

Location (LOC) tags:

COUNTRY
REGION
CITY
DISTRICT
STREET
HOUSE

Person (PER) tags:

LAST_NAME
FIRST_NAME
MIDDLE_NAME

For example, a full tag might look like "B - CITY" for the beginning token of a city name, or "U - COUNTRY" for a single - token country name.

Training procedure

The model was fine - tuned from the bert - base - multilingual - cased checkpoint using the Hugging Face Transformers library.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e - 5
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with weight decay fix
lr_scheduler_type: linear
num_epochs: 10

Framework versions

Transformers 4.28.1
Pytorch 1.13.0
Datasets 2.12.0
Tokenizers 0.13.3

Evaluation results

The model achieves the following results on the evaluation set:

Precision: 0.987843
Recall: 0.988498
F1 Score: 0.988170

Ethical considerations

This model is intended for use in analyzing Russian text and should be used responsibly. Users should be aware of potential biases in the model's predictions and use the results judiciously, especially in applications that may impact individuals or groups.

🔧 Technical Details

The model is a fine - tuned version of bert - base - multilingual - cased for NER in Russian text. It uses the BIOLU tagging format to identify entities. The fine - tuning process was carried out using the Hugging Face Transformers library with specific hyperparameters as mentioned above.

📄 License

The model is licensed under the MIT license.

Property	Details
Model Type	Fine - tuned `bert - base - multilingual - cased` for Russian NER
Training Data	Detailed - NER - Dataset - RU by AlexKly
License	MIT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご