ru-longformer-base-4096 Open Source Russian Model - Supports long text processing, free to use with good results.

Ru Longformer Base 4096

Developed by kazzand

This is a basic Longformer model specifically designed for Russian, supporting a context length of up to 4096 tokens. It is initialized based on the weights of blinoff/roberta-base-russian-v0 and fine-tuned on a Russian book dataset.

Large Language Model

Transformers

#Russian long text processing #4096 token context #Attention mechanism optimization

Downloads 111

Release Time : 7/11/2023

Model Overview

This model is a Transformer model specifically designed for processing long Russian text sequences, suitable for generating text embeddings or fine-tuning for specific downstream tasks.

Model Features

Ultra-long context support

Supports processing text sequences of up to 4096 tokens, suitable for handling long Russian documents

Efficient attention mechanism

Adopts the sparse attention mechanism of Longformer, which is more efficient in long sequence processing

Russian optimization

Initialized based on a Russian RoBERTa model and fine-tuned on a Russian book dataset

Multi-layer Transformer architecture

A deep architecture with 12 attention heads and 12 hidden layers

Model Capabilities

Russian text understanding

Long text sequence processing

Text embedding generation

Masked language modeling

Use Cases

Text processing

Russian document embedding

Generate high-quality embedding representations for long Russian documents

Can be used for downstream tasks such as document retrieval and classification

Russian text completion

Use the masked language modeling ability for text completion

🚀 Russian Longformer Base Model

This is a Longformer base model specifically designed for the Russian language, offering extended context support and fine - tuned on Russian book datasets.

🚀 Quick Start

This is a base Longformer model tailored for the Russian language. It was initialized using the weights from blinoff/roberta-base-russian-v0 and has been adjusted to handle a context length of up to 4096 tokens. We fine - tuned this model on a dataset of Russian books. For more in - depth details, please refer to our post on Habr.

✨ Features

Extended Context: Supports a context length of up to 4096 tokens.
Fine - Tuned: Fine - tuned on a dataset of Russian books.
Versatile Use: Can be used directly for text embeddings or fine - tuned for specific downstream tasks.

📦 Installation

To use this model, you need to install the necessary libraries. You can do this using the following command:

pip install transformers sentencepiece

💻 Usage Examples

Basic Usage

The following code demonstrates how to generate text embeddings using the model:

# pip install transformers sentencepiece
import torch
from transformers import LongformerForMaskedLM, LongformerTokenizerFast

model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096')
tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096')

def get_cls_embedding(text, model, tokenizer, device='cuda'):
    model.to(device)
    batch = tokenizer(text, return_tensors='pt')

    #set global attention for cls token
    global_attention_mask = [
            [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
            for input_ids in batch["input_ids"]
        ]

    #add global attention mask to batch
    batch["global_attention_mask"] = torch.tensor(global_attention_mask)

    with torch.no_grad():
        output = model(**batch.to(device))
    return output.last_hidden_state[:,0,:]

🔧 Technical Details

Model Attributes

Property	Details
Attention Heads	12
Hidden Layers	12
Context Length	4096 tokens

The model can be used out - of - the - box to generate text embeddings or can be further fine - tuned for a particular downstream task.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご