Reformer-enwik8開源語言模型 - 免費用於文本生成與壓縮任務

首頁

Reformer Enwik8

由google開發

基於字符級別並在enwik8數據集上訓練的Reformer語言模型，用於文本生成和壓縮任務。

大型語言模型

Transformers

#字符級語言模型 #維基百科數據訓練 #長序列處理

下載量 637

發布時間 : 3/2/2022

模型概述

該模型是一個字符級別的語言模型，在enwik8數據集上訓練，主要用於文本生成和壓縮任務。它直接在字符級別操作，無需分詞器。

模型特點

字符級別操作

直接在字符級別處理文本，無需分詞器，簡化了預處理步驟。

高效訓練

使用Reformer架構，優化了長序列處理能力，適合處理大文本塊。

文本壓縮能力

在enwik8數據集上訓練，具備良好的文本壓縮能力。

模型能力

文本生成

文本壓縮

使用案例

文本生成

自動補全

根據輸入的文本片段生成後續內容。

生成連貫的文本續寫。

數據壓縮

文本壓縮

利用模型對文本數據進行壓縮。

在Hutter獎等壓縮任務中表現良好。

🚀 字符級Reformer語言模型，在enwik8數據集上訓練

本項目的字符級Reformer語言模型是在enwik8數據集上進行訓練的。enwik8是一個基於維基百科構建的數據集，常被用於衡量模型對數據的壓縮能力，例如在赫特獎（Hutter Prize）的範疇內：https://en.wikipedia.org/wiki/Hutter_Prize 。

reformer-enwik8模型在enwik8數據集的前9000萬個字符上進行了預訓練，文本被分割成大小為65536個字符（即2^16）的批次。模型權重取自https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 ，並轉換為Hugging Face的PyTorch ReformerLM模型ReformerModelWithLMHead。

該模型是一個基於字符的語言模型，因此無需分詞器。可以使用以下函數進行編碼和解碼：

💻 使用示例

基礎用法

import torch

# Encoding
def encode(list_of_strings, pad_token_id=0):
    max_length = max([len(string) for string in list_of_strings])

    # create emtpy tensors
    attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
    input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)

    for idx, string in enumerate(list_of_strings):
        # make sure string is in byte format
        if not isinstance(string, bytes):
            string = str.encode(string)

        input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
        attention_masks[idx, :len(string)] = 1

    return input_ids, attention_masks
    
# Decoding
def decode(outputs_ids):
    decoded_outputs = []
    for output_ids in outputs_ids.tolist():
        # transform id back to char IDs < 2 are simply transformed to ""
        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
    return decoded_outputs

高級用法

from transformers import ReformerModelWithLMHead

model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
decode(model.generate(encoded, do_sample=True, max_length=150))

# gives:
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro