mdlm-owt Open-Source Language Model - Similar to GPT2-medium, Achieves Input Reconstruction at Different Masking Levels

Mdlm Owt

Developed by kuleshov-group

A masked language model based on the diffusion process, similar in scale to GPT2-medium, trained through a forward diffusion process capable of reconstructing original inputs from different masking levels.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Masked Diffusion Language Model #OpenWebText Training #GPT2 Scale

Downloads 13.14k

Release Time : 6/6/2024

Model Overview

This model is a masked diffusion language model trained via a forward diffusion process, capable of handling inputs ranging from fully masked to fully unmasked and reconstructing the original input in the process.

Model Features

Diffusion Process Training

Trained through a forward diffusion process, generating different input levels from fully masked to fully unmasked.

Large-Scale Training

Trained for 1 million steps on the OpenWebText corpus, processing 33 billion tokens.

Efficient Reconstruction

Capable of effectively reconstructing original inputs from different masking levels and outputting logical values.

Model Capabilities

Masked Language Modeling

Text Reconstruction

Use Cases

Natural Language Processing

Text Completion

Completing and reconstructing partially masked text.

Language Model Pretraining

Used as a pretrained model for downstream NLP tasks.

🚀 Transformers

A library for using pre - trained models for masked language modeling.

🚀 Quick Start

To use the pre - trained model for masked language modeling, use the following snippet:

from transformers import AutoModelForMaskedLM, AutoTokenizer

# See the `MDLM` collection page on the hub for list of available models.
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2')
model_name = 'kuleshov-group/mdlm-owt'
model = AutoModelForMaskedLM.from_pretrained(model_name)

For more details, please see our github repository: MDLM

✨ Features

Model Structure: The model has a context length of 1024 and is similar in size to GPT2 - medium with approximately 130 million non - embedding parameters.
Training Process: It was trained using a forward diffusion process that generates inputs varying from fully masked to fully unmasked. The goal is to reconstruct the original input from these varying levels of masking and output logits in the process.
Training Data: The training regimen comprised one million steps on the OpenWebText corpus, involving the processing of a total of 33 billion tokens.

For more details, please see our paper: Simple and Effective Masked Diffusion Language Models.

📚 Documentation

Model Details

The model, which has a context length of 1024 and is similar in size to GPT2 - medium with approximately 130 million non - embedding parameters, was trained using a forward diffusion process that generates inputs varying from fully masked to fully unmasked. Its objective is to reconstruct the original input from these varying levels of masking, outputting logits in the process. The training regimen comprised one million steps on the OpenWebText corpus, involving the processing of a total of 33 billion tokens.

Citation

Please cite our work using the bibtex below:

BibTeX:

@misc{sahoo2024simple,
      title={Simple and Effective Masked Diffusion Language Models}, 
      author={Subham Sekhar Sahoo and Marianne Arriola and Yair Schiff and Aaron Gokaslan and Edgar Marroquin and Justin T Chiu and Alexander Rush and Volodymyr Kuleshov},
      year={2024},
      eprint={2406.07524},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

APA:

@software{Sahoo_Simple_and_Effective_2024,
author = {Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiu, Justin T and Rush, Alexander and Kuleshov, Volodymyr},
doi = {10.48550/arXiv.2406.07524},
month = jun,
title = {{Simple and Effective Masked Diffusion Language Models}},
version = {arXiv:2406.07524v1},
year = {2024}
}

Model Card Contact

Subham Sekhar Sahoo (ssahoo@cs.cornell.edu)

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご