Longformer_zh: An Open-Source Chinese Long Text Processing Model - Free Processing of Document Sequences up to 4096 Characters

Longformer Zh

Developed by ValkyriaLenneth

A PyTorch-based Chinese Longformer model capable of processing document sequences up to 4096 characters with linear complexity, suitable for Chinese long-text tasks.

Large Language Model

Transformers

#Long Text Processing #Linear Complexity Attention #Chinese Whole Word Masking

Downloads 418

Release Time : 3/2/2022

Model Overview

This model integrates local window attention with task-oriented global attention, perfectly replacing standard self-attention modules, especially suitable for Chinese long-text tasks.

Model Features

Linear Complexity for Long Text Processing

Compared to Transformer's O(n^2) complexity, it can process document sequences up to 4096 characters with linear complexity.

Hybrid Attention Mechanism

Integrates local window attention with task-oriented global attention, perfectly replacing standard self-attention modules.

Whole Word Masking Mechanism

Introduces Whole Word Masking (WWM) mechanism adapted to Chinese characteristics, reportedly the first open-source PyTorch implementation of Chinese WWM.

Model Capabilities

Long Text Processing

Text Classification

Reading Comprehension

Coreference Resolution

Sentiment Analysis

Use Cases

Sentiment Analysis

CCF Sentiment Analysis

Used for Chinese text sentiment classification tasks

Development set F1 reached 80.51, comparable to Roberta-mid

Reading Comprehension

Chinese Reading Comprehension (CMRC)

Used for Chinese reading comprehension tasks

F1:86.15, EM:66.84, outperforming Bert baseline

Coreference Resolution

Coreference Resolution Task

Used for Chinese coreference resolution tasks

Conll-F1:67.81, outperforming Bert and Roberta

🚀 Chinese Pretrained Longformer Model | Longformer_ZH with PyTorch

This project offers a pre - trained Chinese Longformer model. Compared to the O(n^2) complexity of the Transformer model, Longformer provides an efficient approach to process long - document level sequences with linear complexity. Its attention mechanism combines standard self - attention and global attention, facilitating the model to better learn information from ultra - long sequences. There is a shortage of resources for Chinese Longformer or long - sequence Chinese tasks, so we open - source our pre - trained model parameters, along with the corresponding loading methods and pre - training scripts.

🚀 Quick Start

✨ Features

Efficient Processing: Capable of handling long - document level sequences with linear complexity, in contrast to the O(n^2) complexity of the Transformer.
Combined Attention Mechanism: Integrates local windowed attention and global attention for better learning of long - sequence information.
Chinese Adaptation: Specifically pre - trained for Chinese language tasks, with the introduction of the Whole - Word - Masking mechanism for better language fitting.

📦 Installation

You can download our model from Google Drive or Baidu Yun:

Google Drive: https://drive.google.com/file/d/1IDJ4aVTfSFUQLIqCYBtoRpnfbgHPoxB4/view?usp=sharing
Baidu Yun: Link: https://pan.baidu.com/s/1HaVDENx52I7ryPFpnQmq1w, Extraction Code: y601

We also support automatic loading via HuggingFace.Transformers:

from Longformer_zh import LongformerZhForMaksedLM
LongformerZhForMaksedLM.from_pretrained('ValkyriaLenneth/longformer_zh')

⚠️ Important Note

Please use transformers.LongformerModel.from_pretrained to load the model directly.
The following notices are abandoned, please ignore them. Different from the original English Longformer, Longformer_Zh is based on Roberta_zh, which is a subclass of Transformers.BertModel rather than RobertaModel. So, it cannot be loaded directly using the original code. We provide a modified Longformer_zh class for model loading. If you want to use our model on more downstream tasks, please refer to Longformer_zh.py and replace the Attention layer with the Longformer Attention layer.

🔧 Technical Details

Pretraining Corpus: The corpus for pre - training is from https://github.com/brightmart/nlp_chinese_corpus. Based on the Longformer paper, we use a mixture of 4 different Chinese corpora.
Model Baseline: Our model is based on Roberta_zh_mid (https://github.com/brightmart/roberta_zh). The pre - training scripts are modified from https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb.
Whole - Word - Masking: We introduce the Whole - Word - Masking method into pre - training for better fitting the Chinese language. Our WWM scripts are refactored from Roberta_zh_Tensorflow, and as far as we know, it is the first open - source Whole - word - masking script in Pytorch.
Model Parameters: The model has a max_seq_length = 4096. Pre - training took about 4 days on 4 * Titan RTX. We used Nvidia.Apex for mixed - precision training to accelerate the process. For data pre - processing, we used Jieba for Chinese tokenization and JIONLP for data cleaning.

📚 Documentation

Evaluation

We conducted evaluations on several tasks:

CCF Sentiment Analysis

Since it is difficult to obtain open - sourced long - sequence Chinese NLP tasks, we used the CCF - Sentiment - Analysis task for evaluation.

Model	Dev F
Bert	80.3
Bert - wwm - ext	80.5
Roberta - mid	80.5
Roberta - large	81.25
Longformer_SC	79.37
Longformer_ZH	80.51

Pretraining BPC

We also provide BPC (bits - per - character) scores of pre - training. The lower the BPC score, the better the performance of the language model. You can also treat it as PPL.

Model	BPC
Longformer before training	14.78
Longformer after training	3.10

CMRC (Chinese Machine Reading Comprehension)

Model	F1	EM
Bert	85.87	64.90
Roberta	86.45	66.57
Longformer_zh	86.15	66.84

Chinese Coreference Resolution

Model	Conll - F1	Precision	Recall
Bert	66.82	70.30	63.67
Roberta	67.77	69.28	66.32
Longformer_zh	67.81	70.13	65.64

📄 License

Not provided in the original document.

Acknowledgments

Thanks to the Okumula·Funakoshi Lab from the Tokyo Institute of Technology for providing the computing resources and the opportunity to complete this project.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご