FRED-T5-large-instruct-v0.1 Open-source Model - Free Deployment for Russian Text Editing and Question Answering

FRED T5 Large Instruct V0.1

Developed by bond005

FRED-T5-large-instruct-v0.1 is a Russian text auto-editing and question-answering model based on PyTorch and Transformers, primarily used for various Russian text processing tasks.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Russian text editing #Speech recognition error correction #Named entity recognition

Downloads 173

Release Time : 4/1/2024

Model Overview

This model was developed by bond005 for automatic editing and question-answering of Russian text, supporting functions such as speech recognition error correction, text summarization, paragraph segmentation, text simplification, named entity recognition, and more.

Model Features

Speech recognition error correction

Correct errors in speech recognition text and restore punctuation and capitalization.

Text summarization

Generate abstractive summaries of long texts, extracting core ideas.

Text simplification

Rewrite complex sentences into simpler, more understandable forms.

Named entity recognition

Identify people, geographical locations, and organizations in text.

General question answering

Answer various questions and execute instructions.

Model Capabilities

Speech recognition error correction

Text summarization

Paragraph segmentation

Text simplification

Named entity recognition

General question answering

Use Cases

Text processing

Speech recognition error correction

Correct errors in speech recognition text and restore punctuation and capitalization.

The corrected text is more accurate and formatted more properly.

Text summarization

Generate abstractive summaries of long texts, extracting core ideas.

The summary is concise and retains the core information of the original text.

Information extraction

Named entity recognition

Identify people, geographical locations, and organizations in text.

Accurately lists named entities in the text.

Question answering system

General question answering

Answer various questions and execute instructions.

Provides accurate answers to questions and executes instructions.

🚀 FRED-T5-large-instruct-v0.1

The FRED-T5-large-instruct-v0.1 model, trained by bond005, is designed for automatically editing text and generating answers to various questions in Russian. It can handle the following tasks:

asr_correction: Correct errors, restore punctuation, and capitalization in the ASR output (specifically, the output of Wav2Vec2-Large-Ru-Golos).
summarization: Perform abstractive summarization of long texts.
segmentation: Divide long text into paragraphs using the \n character as a special separator.
simplification: Transform a source sentence to make it easier to read and comprehend.
ner_organization: A variant of the "classical" named entity recognition task, aiming to find and list all organizations in the text, with each organization listed on a new line.
ner_person: A variant of the "classical" named entity recognition task, designed to find and list all persons in the text, with each person listed on a new line.
ner_location: A variant of the "classical" named entity recognition task, used to find and list all locations in the text, with each location listed on a new line.
Answering arbitrary questions and completing various instructions.

🚀 Quick Start

✨ Features

Automatically edit text and generate answers in Russian.
Capable of handling multiple natural language processing tasks.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

The following table shows the solved tasks and their corresponding instruction texts in Russian:

Property	Details
Solved Task	Instruction Text (in Russian)
asr_correction	Исправь, пожалуйста, ошибки распознавания речи в следующем тексте.
summarization	Выполни саммаризацию и выдели, пожалуйста, основную мысль следующего текста.
segmentation	Разбей, пожалуйста, следующий текст на абзацы.
simplification	Упрости, пожалуйста, следующий текст.
ner_person	Найди, пожалуйста, все именованные сущности типа "Человек" в следующем тексте и выпиши список таких сущностей.
ner_location	Найди, пожалуйста, все именованные сущности типа "Местоположение" в следующем тексте и выпиши список таких сущностей.
ner_organization	Найди, пожалуйста, все именованные сущности типа "Организация" в следующем тексте и выпиши список таких сущностей.
Arbitrary Questions	text of any question

You can view the code example describing the use of this model to solve all the above tasks in the corresponding Colab notebook.

Advanced Usage

ASR Correction

from typing import List

from transformers import T5ForConditionalGeneration
from transformers import GenerationConfig
from transformers import GPT2Tokenizer
import torch


def fix_recognition_error(texts: List[str], tokenizer: GPT2Tokenizer, config: GenerationConfig,
                          model: T5ForConditionalGeneration) -> List[str]:
    nonempty_texts = []
    for cur in texts:
        if len(cur.strip()) > 3:
            nonempty_texts.append(cur.strip())
    if len(nonempty_texts) == 0:
        return texts
    x = tokenizer(nonempty_texts, return_tensors='pt', padding=True).to(model.device)
    max_size = int(x.input_ids.shape[1] * 2.0 + 10)
    out = model.generate(**x, generation_config=config, max_length=max_size)
    results_for_nonempty_texts = [
        ' '.join(tokenizer.decode(cur, skip_special_tokens=True).strip().split()) for cur in out
    ]
    united_results = []
    idx = 0
    for cur in texts:
        if len(cur.strip()) > 3:
            united_results.append(results_for_nonempty_texts[idx])
            idx += 1
        else:
            united_results.append(cur.strip())
    return united_results


ru_llm_tokenizer = GPT2Tokenizer.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
ru_llm_model = T5ForConditionalGeneration.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
ru_llm_config = GenerationConfig.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
if torch.cuda.is_available():
    ru_llm_model = ru_llm_model.cuda()

asr_correction_example = \
    'Исправь, пожалуйста, ошибки распознавания речи в следующем тексте. ' \
    'краеугольным камнем любышь алгоритных машиного обучения является преждес его ' \
    'обобщающая способности тогда мы обучаем некоторую модель у нас есть обучающая ' \
    'выборка унаситькюмся ошибки и наша задачи сводится вообщем такомптиминационной ' \
    'задачи мы минимизируем в функцию ошибки по параметрам нашей модели на обучающие ' \
    'выбрать но на самом деле хотим там и не этого мы не обучающую ошибку хотим ' \
    'минимизировать'

output = fix_recognition_error([asr_correction_example], ru_llm_tokenizer,
                               ru_llm_config, ru_llm_model)[0]
print(output)

Краеугольным камнем любого алгоритма машинного обучения является прежде всего обобщающая способность. Тогда мы обучаем некоторую модель, у нас есть обучающая выборка, у нас есть коэффициенты ошибки, и наша задача сводится, в общем-то, к мотивационной задаче: мы минимизируем функцию ошибки по параметрам нашей модели, на обучающей выборке, но на самом деле хотим там и не этого. Мы не обучающую ошибку хотим минимизировать.

Summarization

from typing import List

from transformers import T5ForConditionalGeneration
from transformers import GenerationConfig
from transformers import GPT2Tokenizer
import torch

def generate_answer(answers: List[str], tokenizer: GPT2Tokenizer, config: GenerationConfig,
                    model: T5ForConditionalGeneration) -> List[str]:
    nonempty_answers = []
    for cur in answers:
        if len(cur.strip()) > 0:
            nonempty_answers.append(cur)
    if len(nonempty_answers) == 0:
        return ['' for _ in range(len(answers))]
    x = tokenizer(nonempty_answers, return_tensors='pt', padding=True).to(model.device)
    out = model.generate(**x, generation_config=config)
    questions_for_nonempty_texts = [
        tokenizer.decode(cur, skip_special_tokens=True).strip().replace('\r\n', '\n') for cur in out
    ]
    united_questions = []
    idx = 0
    for cur in answers:
        if len(cur.strip()) > 0:
            united_questions.append(questions_for_nonempty_texts[idx])
            idx += 1
        else:
            united_questions.append('')
    return united_questions


ru_llm_tokenizer = GPT2Tokenizer.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
ru_llm_model = T5ForConditionalGeneration.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
ru_llm_config = GenerationConfig.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
if torch.cuda.is_available():
    ru_llm_model = ru_llm_model.cuda()

summarization_example = \
    'Выполни саммаризацию и выдели, пожалуйста, основную мысль следующего текста. ' \
    'В данной работе проводится сравнение предварительного обучения трансформера на ' \
    'текстах естественного языка и на предложениях синтетического псевдоязыка. ' \
    'Искусственные тексты были автоматически сгенерированы по написанным нами правилам ' \
    'в контекстно-свободной грамматике. Результаты дообучения на выполнение заданий ' \
    'проекта RussianSuperGLUE статистически достоверно показали, что модели имеют ' \
    'одинаковые оценки, т.е. можно считать, что использование искусственных данных ' \
    'дает преимущество для “безопасности” искусственного интеллекта за счет ' \
    'возможности полностью контролировать состав выборки. Также мы можем говорить ' \
    'о том, что на этапе предобучения модели типа RoBERTa достаточно научиться ' \
    'распознавать только синтаксические и морфологические закономерности языка, ' \
    'которые могут быть успешно созданы довольно таким простым способом, как ' \
    'контекстно-свободная грамматика.'

output = generate_answer([summarization_example], ru_llm_tokenizer,
                         ru_llm_config, ru_llm_model)[0]
print(output)

В работе сравнивается предварительное обучение трансформера на текстах естественного языка и на предложениях синтетического псевдоязыка. Результаты дообучения на выполнение заданий проекта RussianSuperGLUE статистически достоверно показали, что модели имеют одинаковые оценки. Использование искусственных данных дает преимущество для безопасности искусственного интеллекта за счет возможности полностью контролировать состав выборки.

Segmentation

from typing import List

from transformers import T5ForConditionalGeneration
from transformers import GenerationConfig
from transformers import GPT2Tokenizer
import torch

def generate_answer(answers: List[str], tokenizer: GPT2Tokenizer, config: GenerationConfig,
                    model: T5ForConditionalGeneration) -> List[str]:
    nonempty_answers = []
    for cur in answers:
        if len(cur.strip()) > 0:
            nonempty_answers.append(cur)
    if len(nonempty_answers) == 0:
        return ['' for _ in range(len(answers))]
    x = tokenizer(nonempty_answers, return_tensors='pt', padding=True).to(model.device)
    out = model.generate(**x, generation_config=config)
    questions_for_nonempty_texts = [
        tokenizer.decode(cur, skip_special_tokens=True).strip().replace('\r\n', '\n') for cur in out
    ]
    united_questions = []
    idx = 0
    for cur in answers:
        if len(cur.strip()) > 0:
            united_questions.append(questions_for_nonempty_texts[idx])
            idx += 1
        else:
            united_questions.append('')
    return united_questions


ru_llm_tokenizer = GPT2Tokenizer.from_pretrained('bond005/FRED-T5-large-instruct-v0.1')
ru_llm_model = T5ForConditionalGen

📄 License

The model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご