Reformer-enwik8开源语言模型 - 免费用于文本生成与压缩任务

首页

Reformer Enwik8

由 google 开发

基于字符级别并在enwik8数据集上训练的Reformer语言模型，用于文本生成和压缩任务。

大型语言模型

Transformers

#字符级语言模型 #维基百科数据训练 #长序列处理

下载量 637

发布时间 : 3/2/2022

模型简介

该模型是一个字符级别的语言模型，在enwik8数据集上训练，主要用于文本生成和压缩任务。它直接在字符级别操作，无需分词器。

模型特点

字符级别操作

直接在字符级别处理文本，无需分词器，简化了预处理步骤。

高效训练

使用Reformer架构，优化了长序列处理能力，适合处理大文本块。

文本压缩能力

在enwik8数据集上训练，具备良好的文本压缩能力。

模型能力

文本生成

文本压缩

使用案例

文本生成

自动补全

根据输入的文本片段生成后续内容。

生成连贯的文本续写。

数据压缩

文本压缩

利用模型对文本数据进行压缩。

在Hutter奖等压缩任务中表现良好。

🚀 字符级Reformer语言模型，在enwik8数据集上训练

本项目的字符级Reformer语言模型是在enwik8数据集上进行训练的。enwik8是一个基于维基百科构建的数据集，常被用于衡量模型对数据的压缩能力，例如在赫特奖（Hutter Prize）的范畴内：https://en.wikipedia.org/wiki/Hutter_Prize 。

reformer-enwik8模型在enwik8数据集的前9000万个字符上进行了预训练，文本被分割成大小为65536个字符（即2^16）的批次。模型权重取自https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 ，并转换为Hugging Face的PyTorch ReformerLM模型ReformerModelWithLMHead。

该模型是一个基于字符的语言模型，因此无需分词器。可以使用以下函数进行编码和解码：

💻 使用示例

基础用法

import torch

# Encoding
def encode(list_of_strings, pad_token_id=0):
    max_length = max([len(string) for string in list_of_strings])

    # create emtpy tensors
    attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
    input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)

    for idx, string in enumerate(list_of_strings):
        # make sure string is in byte format
        if not isinstance(string, bytes):
            string = str.encode(string)

        input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
        attention_masks[idx, :len(string)] = 1

    return input_ids, attention_masks
    
# Decoding
def decode(outputs_ids):
    decoded_outputs = []
    for output_ids in outputs_ids.tolist():
        # transform id back to char IDs < 2 are simply transformed to ""
        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
    return decoded_outputs

高级用法

from transformers import ReformerModelWithLMHead

model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
decode(model.generate(encoded, do_sample=True, max_length=150))

# gives:
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro