GuwenBert-Large Open-Source Model - Empowering Classical Chinese Processing Tasks, Realize Classical Chinese Intelligent Applications for Free

Guwenbert Large

Developed by ethanyt

A RoBERTa model pre-trained on classical Chinese, suitable for ancient text processing tasks

Large Language Model ChineseOpen Source License:Apache-2.0 #Classical Chinese Pre-training #Ancient Book Named Entity Recognition #Classical Chinese Sentence Segmentation and Punctuation

Downloads 217

Release Time : 3/2/2022

Model Overview

This is a RoBERTa model specifically pre-trained for classical Chinese, applicable to downstream tasks such as sentence segmentation, punctuation, and named entity recognition in ancient texts.

Model Features

Specialized Pre-training for Classical Chinese

Specifically pre-trained on classical Chinese to better understand its semantics and grammatical structures

Two-stage Training Strategy

Adopts a two-stage strategy of first training the embedding layer and then training all parameters to improve training effectiveness

Large-scale Training Data

Uses the Daizhigeke ancient literature dataset, containing 15,694 classical Chinese books with 1.7 billion characters

Model Capabilities

Classical Chinese Semantic Understanding

Classical Chinese Masked Language Prediction

Classical Chinese Sentence Segmentation

Classical Chinese Punctuation

Classical Chinese Named Entity Recognition

Use Cases

Ancient Book Processing

Ancient Book Named Entity Recognition

Identify entities such as book titles and person names in ancient books

Achieved second place in the 'Gulian Cup' ancient book named entity recognition evaluation with an F1 score of 84.63

Classical Chinese Sentence Segmentation and Punctuation

Automatically add punctuation to unpunctuated classical Chinese texts

🚀 GuwenBERT

GuwenBERT is a RoBERTa model pre - trained on Classical Chinese. It can be fine - tuned for various downstream tasks like sentence breaking, punctuation, and named entity recognition, offering great value for classical Chinese text processing.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")

model = AutoModel.from_pretrained("ethanyt/guwenbert-large")

✨ Features

Pre - trained on Classical Chinese: It is specifically pre - trained on a large - scale classical Chinese dataset, enabling it to better understand the language characteristics of classical Chinese.
Multiple Downstream Tasks: Can be fine - tuned for various downstream tasks, such as sentence breaking, punctuation, and named entity recognition.

📦 Installation

The installation mainly involves using the transformers library in Python. You can install it via pip install transformers if not already installed.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")

model = AutoModel.from_pretrained("ethanyt/guwenbert-large")

📚 Documentation

Model description

GuwenBERT

This is a RoBERTa model pre - trained on Classical Chinese. You can fine - tune GuwenBERT for downstream tasks, such as sentence breaking, punctuation, named entity recognition, and so on.

For more information about RoBERTa, take a look at the RoBERTa's official repo.

Training data

The training data is the daizhige dataset (殆知阁古代文献), which contains 15,694 books in Classical Chinese, covering Buddhism, Confucianism, Medicine, History, Zi, Yi, Yizang, Shizang, Taoism, and Jizang. 76% of them are punctuated. The total number of characters is 1.7B (1,743,337,673). All traditional characters are converted to simplified characters. The vocabulary is constructed from this dataset and the size is 23,292.

Training procedure

The models are initialized with hfl/chinese - roberta - wwm - ext - large and then pre - trained with a 2 - step strategy. In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training.

The models are trained on 4 V100 GPUs for 120K steps (20K for step#1, 100K for step#2) with a batch size of 2,048 and a sequence length of 512. The optimizer used is Adam with a learning rate of 1e - 4, adam - betas of (0.9,0.98), adam - eps of 1e - 6, a weight decay of 0.01, learning rate warmup for 5K steps, and linear decay of learning rate after.

Eval results

"Gulian Cup" Ancient Books Named Entity Recognition Evaluation

Second place in the competition. Detailed test results:

NE Type	Precision	Recall	F1
Book Name	77.50	73.73	75.57
Other Name	85.85	89.32	87.55
Micro Avg.	83.88	85.39	84.63

🔧 Technical Details

The model is based on the RoBERTa architecture and is pre - trained on a large - scale classical Chinese dataset. The 2 - step pre - training strategy helps the model better capture the language features of classical Chinese. During training, different parameter update strategies are used in different steps to improve the training efficiency and model performance.

📄 License

The project is licensed under the Apache - 2.0 license.

About Us

We are from Datahammer, Beijing Institute of Technology. For more cooperation, please contact email: ethanyt [at] qq.com

Created with ❤️ by Tan Yan [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark - github - 32.png)](https://github.com/Ethan - yt) and Zewen Chi ![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark - github - 32.png)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご