ruDialoGpt3-medium-finetuned-telegram開源對話模型 - 支持基於俄論壇和電報記錄對話生成

首頁

Rudialogpt3 Medium Finetuned Telegram

由Kirili4ik開發

該模型是基於Sberbank-AI的DialoGPT在俄語論壇數據上預訓練後，通過個人電報聊天記錄微調而成的對話生成模型。

大型語言模型

Transformers

#俄語對話生成 #電報聊天微調 #個性化對話模型

下載量 37

發布時間 : 3/2/2022

模型概述

一個針對俄語優化的對話生成模型，經過個人電報聊天記錄微調，能生成符合個人聊天風格的響應。

模型特點

個性化微調

基於個人電報聊天記錄微調，能模仿特定對話風格

俄語優化

專門針對俄語對話場景進行訓練和優化

交互式對話

支持多輪對話上下文理解與生成

模型能力

俄語對話生成

上下文感知回覆

個性化風格模仿

多輪對話維持

使用案例

個性化聊天機器人

個人數字助手

創建具有個人聊天風格的對話助手

生成類似用戶本人風格的回覆

社交應用

自動回覆生成

在社交平臺上自動生成符合個人風格的回覆

🚀 ruDialoGpt3-medium-finetuned-telegram

ruDialoGpt3-medium-finetuned-telegram 是一個基於俄語訓練，並在個人 Telegram 聊天記錄上微調的對話模型，能為俄語對話場景提供支持。

🚀 快速開始

DialoGPT 以俄語進行訓練，並在我的 Telegram 聊天記錄上進行了微調。

該模型由 sberbank-ai 創建，在俄語論壇上進行訓練（詳見 Grossmend 的模型）。你可以在 habr 上找到關於其訓練方式的信息（俄語）。我創建了一個簡單的流程，並在自己導出的 Telegram 聊天記錄（約 30MB 的 JSON 文件）上對該模型進行了微調。實際上，從 Telegram 獲取數據並微調模型非常容易。因此，我為此製作了一個 Colab 教程：https://colab.research.google.com/drive/1fnAVURjyZRK9VQg1Co_-SKUQnRES8l9R?usp=sharing

⚠️ 重要提示

由於數據的特殊性，託管推理 API 可能無法正常工作。

🤗你可以使用我的 Spaces 演示來嘗試這個模型🤗

💻 使用示例

基礎用法

# Download model and tokenizer
checkpoint = "Kirili4ik/ruDialoGpt3-medium-finetuned-telegram"   
tokenizer =  AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.eval()


# util function to get expected len after tokenizing
def get_length_param(text: str, tokenizer) -> str:
    tokens_count = len(tokenizer.encode(text))
    if tokens_count <= 15:
        len_param = '1'
    elif tokens_count <= 50:
        len_param = '2'
    elif tokens_count <= 256:
        len_param = '3'
    else:
        len_param = '-'
    return len_param


# util function to get next person number (1/0) for Machine or Human in the dialogue
def get_user_param(text: dict, machine_name_in_chat: str) -> str:
    if text['from'] == machine_name_in_chat:
        return '1'  # machine
    else:
        return '0'  # human


chat_history_ids = torch.zeros((1, 0), dtype=torch.int)

while True:
    
    next_who = input("Who's phrase?\t")  #input("H / G?")     # Human or GPT

    # In case Human
    if next_who == "H" or next_who == "Human":
        input_user = input("===> Human: ")
        
        # encode the new user input, add parameters and return a tensor in Pytorch
        new_user_input_ids = tokenizer.encode(f"|0|{get_length_param(input_user, tokenizer)}|" \
                                              + input_user + tokenizer.eos_token, return_tensors="pt")
        # append the new user input tokens to the chat history
        chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)

    if next_who == "G" or next_who == "GPT":

        next_len = input("Phrase len? 1/2/3/-\t")  #input("Exp. len?(-/1/2/3): ")
        # encode the new user input, add parameters and return a tensor in Pytorch
        new_user_input_ids = tokenizer.encode(f"|1|{next_len}|", return_tensors="pt")
        # append the new user input tokens to the chat history
        chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)
        
        # print(tokenizer.decode(chat_history_ids[-1])) # uncomment to see full gpt input
        
        # save previous len
        input_len = chat_history_ids.shape[-1]
        # generated a response; PS you can read about the parameters at hf.co/blog/how-to-generate
        chat_history_ids = model.generate(
            chat_history_ids,
            num_return_sequences=1,                     # use for more variants, but have to print [i]
            max_length=512,
            no_repeat_ngram_size=3,
            do_sample=True,
            top_k=50,
            top_p=0.9,
            temperature = 0.6,                          # 0 for greedy
            mask_token_id=tokenizer.mask_token_id,
            eos_token_id=tokenizer.eos_token_id,
            unk_token_id=tokenizer.unk_token_id,
            pad_token_id=tokenizer.pad_token_id,
            device='cpu'
        )
        
        
        # pretty print last ouput tokens from bot
        print(f"===> GPT-3:  {tokenizer.decode(chat_history_ids[:, input_len:][0], skip_special_tokens=True)}")