roberta-classical-chinese-large-sentence-segmentation Open Source Model - Achieve Free and Precise Segmentation of Classical Chinese Sentences

Roberta Classical Chinese Large Sentence Segmentation

Developed by KoichiYasuoka

A RoBERTa model pre-trained on classical Chinese texts, specifically designed for sentence segmentation tasks in classical Chinese.

Sequence Labeling

Transformers

OtherOpen Source License:Apache-2.0 #Classical Chinese sentence segmentation #Classical Chinese processing #RoBERTa pre-training

Downloads 20

Release Time : 3/2/2022

Model Overview

This model is used to segment continuous classical Chinese texts into complete sentences, with each sentence starting with the token class 'B' and ending with 'E' (single-character sentences are marked as 'S').

Model Features

Specialized for Classical Chinese

Optimized specifically for classical Chinese texts, effectively handling the unique grammatical structures and expressions of ancient Chinese.

Accurate Sentence Segmentation

Uses a B/E/S tagging system to accurately identify sentence boundaries in classical Chinese.

Based on RoBERTa Architecture

Leverages the powerful RoBERTa pre-trained model, fine-tuned on classical Chinese texts.

Model Capabilities

Classical Chinese processing

Sentence boundary recognition

Text segmentation

Use Cases

Ancient text digitization

Automatic segmentation of ancient texts

Automatically segments unsegmented ancient literature into complete sentences

Improves the efficiency and accuracy of ancient text digitization

Academic research

Construction of classical Chinese corpora

Provides linguists with pre-processed segmented texts

Facilitates subsequent lexical analysis and grammatical research

🚀 roberta-classical-chinese-large-sentence-segmentation

This is a RoBERTa model designed for sentence segmentation of Classical Chinese texts. It offers a solution for accurately segmenting sentences in Classical Chinese, leveraging pre - trained knowledge from a base model.

🚀 Quick Start

This RoBERTa model is pre - trained on Classical Chinese texts for sentence segmentation. It is derived from roberta-classical-chinese-large-char. Every segmented sentence begins with token - class "B" and ends with token - class "E" (except for single - character sentence with token - class "S").

✨ Features

Language and Tags: Supports languages related to classical Chinese, with tags including "classical chinese", "literary chinese", "ancient chinese", "sentence segmentation", and "token - classification".
Base Model: Built upon the KoichiYasuoka/roberta-classical-chinese-large-char base model.
License: Licensed under the "apache - 2.0" license.
Pipeline Tag: It belongs to the "token - classification" pipeline.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
s="子曰學而時習之不亦説乎有朋自遠方來不亦樂乎人不知而不慍不亦君子乎"
p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))["logits"],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

📚 Documentation

Reference

Koichi Yasuoka: Sentence Segmentation of Classical Chinese Texts Using Transformers and BERT/RoBERTa Models, IPSJ Symposium Series, Vol.2021, No.1 (December 2021), pp.104 - 109.

📄 License

This project is licensed under the "apache - 2.0" license.

Property	Details
Language	lzh
Tags	classical chinese, literary chinese, ancient chinese, sentence segmentation, token - classification
Model Type	RoBERTa
Base Model	KoichiYasuoka/roberta-classical-chinese-large-char
License	apache - 2.0
Pipeline Tag	token - classification

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご