🚀 Russian Text Summarization Model - LaciaSUM V1 (small)
This model is a fine - tuned version of d0rj/rut5 - base - summ
for automatic text summarization. It's specifically adapted for Russian texts and fine - tuned on a custom CSV dataset with original texts and their summaries.
✨ Features
- Objective: Automatic abstractive summarization of texts.
- Base Model:
d0rj/rut5 - base - summ
.
- Dataset: A custom CSV file with columns
Text
(original text) and Summarize
(summary).
- Preprocessing: Before tokenization, the prefix
summarize:
is added to the original text to help the model focus on summarization.
- Training Settings:
- Number of epochs: 9.
- Batch size: 4 per device.
- Warmup steps: 1000.
- FP16 training: Enabled if CUDA is available.
- Hardware: Trained on an RTX 3070 (about 40 minutes of training).
📦 Installation
No specific installation steps are provided in the original README. If you want to use this model, you can install the transformers
library which is used in the example code:
pip install transformers
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Lacia_sum_small_v1")
model = AutoModelForSeq2SeqLM.from_pretrained("LaciaStudio/Lacia_sum_small_v1")
text = "Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим новые перспективы в различных областях."
input_text = "summarize: " + text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)
Advanced Usage
The basic usage example already covers the main process of using the model. There is no specific advanced usage example in the original README. If you want to adjust parameters such as max_length
, num_beams
, etc., you can modify the code according to your needs.
📚 Documentation
The model was fine - tuned using the Transformers
library along with the Seq2SeqTrainer
from Hugging Face. The training script includes:
- Custom Dataset: The
SummarizationDataset
class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary.
- Token Processing: To improve loss computation, padding tokens in the target text are replaced with - 100.
This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats.
⚠️ Important Note
The model also supports English language, but its support was not tested.
📄 License
This model is released under the cc - by - nc - 4.0
license.
📄 Information Table
Property |
Details |
Model Type |
Fine - tuned d0rj/rut5 - base - summ for text summarization |
Training Data |
A custom CSV file with columns Text (original text) and Summarize (summary) |
Pipeline Tag |
Summarization |
Tags |
Summarization, natural - language - processing, text - summarization, machine - learning, deep - learning, transformer, artificial - intelligence, text - analysis, sequence - to - sequence, pytorch, tensorflow, safetensors, t5 |
Library Name |
Transformers |