๐ GuwenBERT
GuwenBERT is a RoBERTa model pre - trained on Classical Chinese. It can be fine - tuned for various downstream tasks like sentence breaking, punctuation, and named entity recognition, offering great value for classical Chinese text processing.
๐ Quick Start
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")
model = AutoModel.from_pretrained("ethanyt/guwenbert-large")
โจ Features
- Pre - trained on Classical Chinese: It is specifically pre - trained on a large - scale classical Chinese dataset, enabling it to better understand the language characteristics of classical Chinese.
- Multiple Downstream Tasks: Can be fine - tuned for various downstream tasks, such as sentence breaking, punctuation, and named entity recognition.
๐ฆ Installation
The installation mainly involves using the transformers
library in Python. You can install it via pip install transformers
if not already installed.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")
model = AutoModel.from_pretrained("ethanyt/guwenbert-large")
๐ Documentation
Model description

This is a RoBERTa model pre - trained on Classical Chinese. You can fine - tune GuwenBERT for downstream tasks, such as sentence breaking, punctuation, named entity recognition, and so on.
For more information about RoBERTa, take a look at the RoBERTa's official repo.
Training data
The training data is the daizhige dataset (ๆฎ็ฅ้ๅคไปฃๆ็ฎ), which contains 15,694 books in Classical Chinese, covering Buddhism, Confucianism, Medicine, History, Zi, Yi, Yizang, Shizang, Taoism, and Jizang. 76% of them are punctuated. The total number of characters is 1.7B (1,743,337,673). All traditional characters are converted to simplified characters. The vocabulary is constructed from this dataset and the size is 23,292.
Training procedure
The models are initialized with hfl/chinese - roberta - wwm - ext - large
and then pre - trained with a 2 - step strategy.
In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training.
The models are trained on 4 V100 GPUs for 120K steps (20K for step#1, 100K for step#2) with a batch size of 2,048 and a sequence length of 512. The optimizer used is Adam with a learning rate of 1e - 4, adam - betas of (0.9,0.98), adam - eps of 1e - 6, a weight decay of 0.01, learning rate warmup for 5K steps, and linear decay of learning rate after.
Eval results
"Gulian Cup" Ancient Books Named Entity Recognition Evaluation
Second place in the competition. Detailed test results:
NE Type |
Precision |
Recall |
F1 |
Book Name |
77.50 |
73.73 |
75.57 |
Other Name |
85.85 |
89.32 |
87.55 |
Micro Avg. |
83.88 |
85.39 |
84.63 |
๐ง Technical Details
The model is based on the RoBERTa architecture and is pre - trained on a large - scale classical Chinese dataset. The 2 - step pre - training strategy helps the model better capture the language features of classical Chinese. During training, different parameter update strategies are used in different steps to improve the training efficiency and model performance.
๐ License
The project is licensed under the Apache - 2.0 license.
About Us
We are from Datahammer, Beijing Institute of Technology.
For more cooperation, please contact email: ethanyt [at] qq.com
Created with โค๏ธ by Tan Yan [](https://github.com/Ethan - yt) and Zewen Chi 