Bert4ner-base-chinese Open Source Model - Free Deployment for Efficient Chinese Named Entity Recognition

Bert4ner Base Chinese

Developed by shibing624

A BERT-based Chinese named entity recognition model, achieving near state-of-the-art performance on the People's Daily dataset

Sequence Labeling

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Chinese Entity Recognition #BERT Architecture #High-precision NER

Downloads 439

Release Time : 5/7/2022

Model Overview

This model is a BERT-based Chinese named entity recognition model, specifically designed to identify entities such as person names, locations, organization names, and time in Chinese text.

Model Features

High Performance

Achieves an F1 score of 0.9525 on the People's Daily test set, approaching state-of-the-art levels.

Supports Multiple Entity Types

Capable of recognizing various entities such as person names (PER), locations (LOC), organization names (ORG), and time (TIME).

Easy to Use

Provides both a simple API interface and HuggingFace Transformers integration for ease of use.

Model Capabilities

Chinese Named Entity Recognition

Text Analysis

Information Extraction

Use Cases

Text Information Extraction

Resume Information Extraction

Extracts key information such as person names and birth years from resume text.

Successfully identified '常建良' (person name) and '1963年' (time) in the example input.

News Text Analysis

Analyzes people, locations, and organizations in news text.

Successfully identified '王宏伟' (person name), '北京' (location), and '王府井' (location) in the example input.

🚀 BERT for Chinese Named Entity Recognition(bert4ner) Model

This is a Chinese named entity recognition model. The bert4ner-base-chinese model evaluates the PEOPLE (Renmin Ribao) test data.

The overall performance of BERT on the PEOPLE test:

	Accuracy	Recall	F1
BertSoftmax	0.9425	0.9627	0.9525

It reaches a level close to SOTA on the PEOPLE test set.

The network structure of BertSoftmax (native BERT):

arch

🚀 Quick Start

✨ Features

This model can perform Chinese named entity recognition tasks.
It achieves a high performance on the PEOPLE test set, approaching the SOTA level.

📦 Installation

This project is open - sourced in the named entity recognition project: nerpy, which supports the bert4ner model. You can call it through the following commands:

>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-chinese")
>>> predictions, raw_outputs, entities = model.predict(["常建良，男，1963年出生，工科学士，高级工程师"], split_on_space=False)
entities: [('常建良', 'PER'), ('1963年', 'TIME')]

The model files are composed as follows:

bert4ner-base-chinese
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

💻 Usage Examples

Basic Usage

Without nerpy, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

Install package:

pip install transformers seqeval

import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "王宏伟来自北京，是个警察，喜欢去王府井游玩儿。"


def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    print(sentence)
    print(char_tags)

    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)

Output:

王宏伟来自北京，是个警察，喜欢去王府井游玩儿。
[('王', 'B-PER'), ('宏', 'I-PER'), ('伟', 'I-PER'), ('来', 'O'), ('自', 'O'), ('北', 'B-LOC'), ('京', 'I-LOC'), ('，', 'O'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), ('，', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'B-LOC'), ('府', 'I-LOC'), ('井', 'I-LOC'), ('游', 'O'), ('玩', 'O'), ('儿', 'O'), ('。', 'O')]
Sentence entity:
[('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]

📚 Documentation

Training Datasets

Chinese Named Entity Recognition Datasets

Property	Details
Dataset	Corpus
`CNER Chinese Named Entity Recognition Dataset`	CNER (120,000 words)
`PEOPLE Chinese Named Entity Recognition Dataset`	Renmin Ribao Dataset (2 million words)

The data format of the CNER Chinese named entity recognition dataset:

美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O

If you need to train bert4ner, please refer to https://github.com/shibing624/nerpy/tree/main/examples

📄 License

This project is under the Apache - 2.0 license.

📚 Citation

@software{nerpy,
  author = {Xu Ming},
  title = {nerpy: Named Entity Recognition toolkit},
  year = {2022},
  url = {https://github.com/shibing624/nerpy},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご