XLM-RoBERTa-Longformer-Base-4096 Open Source Model - Supports multiple languages and can process sequences up to 4096 tokens long

Xlm Roberta Longformer Base 4096

Developed by Peltarion

Extended XLM-RoBERTa model supporting sequences up to 4096 tokens, suitable for multilingual tasks

OtherOpen Source License:Apache-2.0 #Multilingual long-text processing #4096 token length #Low-resource language optimization

Downloads 64

Release Time : 3/2/2022

Model Overview

The XLM-R Long Sequence Model is an extended version of XLM-RoBERTa that supports longer sequence processing (original version only supports 512 tokens) through special pre-training. It performs exceptionally well on multilingual QA tasks, especially for low-resource language scenarios.

Model Features

Extended Context Support

Supports processing sequences up to 4096 tokens (original XLM-R only supports 512 tokens)

Low-resource Language Optimization

No need for separate pre-training per language, especially suitable for low-resource languages like Swedish

Efficient Training Scheme

Uses gradient accumulation (64 steps) and 16-bit precision training to reduce GPU memory requirements

Model Capabilities

Long-text understanding

Multilingual QA

Cross-lingual transfer learning

Use Cases

QA Systems

Multilingual Long-document QA

Handles cross-lingual QA tasks involving long contexts

Text Understanding

Low-resource Language Document Analysis

Semantic analysis of long documents in low-resource languages like Swedish

🚀 XLM-R Longformer Model

XLM-R Longformer is an extended XLM-R model that supports sequence lengths up to 4096 tokens, instead of the typical 512. This model was pre - trained from the XLM - RoBERTa checkpoint using the Longformer pre - training scheme on the English WikiText - 103 corpus. The goal was to explore methods for creating efficient Transformers for low - resource languages like Swedish without pre - training on long - context datasets in each respective language. The trained model is the outcome of a master thesis project at Peltarion and was fine - tuned on multilingual question - answering tasks. The code is available [here](https://github.com/MarkusSagen/Master - Thesis - Multilingual - Longformer#xlm - r).

✨ Features

Extended XLM - R model supporting sequence lengths up to 4096 tokens.
Pre - trained on the English WikiText - 103 corpus.
Fine - tuned on multilingual question - answering tasks.

📦 Installation

The installation details are not provided in the original README. However, to use the model, you need to have the necessary Python libraries installed, such as torch and transformers. You can install them using pip:

pip install torch transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModel, AutoTokenizer

MAX_SEQUENCE_LENGTH = 4096
MODEL_NAME_OR_PATH = "markussagen/xlm-roberta-longformer-base-4096"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME_OR_PATH,
    max_length=MAX_SEQUENCE_LENGTH,
    padding="max_length",
    truncation=True,
)

model = AutoModelForQuestionAnswering.from_pretrained(
    MODEL_NAME_OR_PATH, 
    max_length=MAX_SEQUENCE_LENGTH,
)

📚 Documentation

Training Procedure

The model was trained on the WikiText - 103 corpus using a 48GB GPU with the following training script and parameters. The model was pre - trained for 6000 iterations, which took approximately 5 days. See the full [training script](https://github.com/MarkusSagen/Master - Thesis - Multilingual - Longformer/blob/main/scripts/finetune_qa_models.py) and [Github repo](https://github.com/MarkusSagen/Master - Thesis - Multilingual - Longformer) for more information.

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip   

export DATA_DIR=./wikitext-103-raw

scripts/run_long_lm.py \
    --model_name_or_path xlm-roberta-base \
    --model_name xlm-roberta-to-longformer \
    --output_dir ./output \
    --logging_dir ./logs \
    --val_file_path $DATA_DIR/wiki.valid.raw \
    --train_file_path $DATA_DIR/wiki.train.raw \
    --seed 42 \
    --max_pos 4096 \
    --adam_epsilon 1e-8 \
    --warmup_steps 500 \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --max_steps 6000 \
    --evaluate_during_training \
    --logging_steps 50 \
    --eval_steps 50 \
    --save_steps 6000  \
    --max_grad_norm 1.0 \
    --per_device_eval_batch_size 2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --overwrite_output_dir \
    --fp16 \
    --do_train \
    --do_eval

🔧 Technical Details

Since both XLM - R model and Longformer models are large models, it is recommended to run the models with NVIDIA Apex (16 - bit precision), a large GPU, and several gradient accumulation steps.

📄 License

This model is released under the Apache - 2.0 license.

Property	Details
Model Type	XLM - R Longformer
Training Data	WikiText - 103
License	Apache - 2.0

⚠️ Important Note

Since both XLM - R model and Longformer models are large models, it is recommended to run the models with NVIDIA Apex (16 - bit precision), a large GPU, and several gradient accumulation steps.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご