gpt2023 Open-Source Language Model - Free to Use, with Improved Text Generation Ability

Gpt2023

Developed by crumb

A 124M-parameter language model based on the GPT-2 architecture, fine-tuned on 2.23B tokens of diverse data with improved text generation capabilities

Large Language Model

Transformers

EnglishOpen Source License:MIT #Lightweight Language Model #Multi-domain Fine-tuning #Timely Text Generation

Downloads 136

Release Time : 4/30/2023

Model Overview

This is a fine-tuned version of OpenAI's smallest GPT-2 model (124M parameters), trained on data from Common Crawl web pages, ArXiv papers, and GitHub code, optimized for generation quality and temporal awareness

Model Features

Efficient Fine-tuning

Fine-tuned on 2.23B tokens, approaching Chinchilla's optimal pre-training token requirement

Diverse Data

Training data includes web content, academic papers, and code, covering multi-domain knowledge

Temporal Improvements

Compared to the original GPT-2, it has better awareness of recent events like the COVID-19 pandemic

Lightweight Deployment

Can run on an RTX3060 with just 12GB VRAM, suitable for local deployment

Model Capabilities

Text Generation

Language Understanding

Contextual Completion

Use Cases

Content Creation

Article Generation

Generates coherent text paragraphs based on prompts

Example: COVID-19 analysis text generation

Education & Research

Academic Summarization

Generates research summaries based on ArXiv paper data

🚀 GPT2(023) Model Card

This is the smallest GPT-2 model (124m) from OpenAI, finetuned on approximately 2.23B tokens. This is almost the 2.48B tokens needed to 'chinchilla-optimally' pretrain it, and it's also more tokens than Cerebras-GPT-111M was trained on in total. The tokens consist of 1.3B from common crawl sites in 2023, 540M from ArXiv, and 390M from GitHub.

The model was trained with a learning rate of 1e-4, a warmup of 1024 steps, and then the learning rate decayed to 0. There were 4400 total steps during training at a batch size of 512 examples with a context length of 1024. The batch size and context length are the same as those in the pre - training of GPT2 itself. Training took a total of 1.18e+18 FLOs over 79.32 hours locally with a 12GB RTX3060. The final train loss was 2.73.

✨ Features

Evaluation of GPT2023

(in progress)

model	piqa acc	winogrande acc	lambada ppl	lambada acc	arc acc	sciq acc	wsc acc
pythia - 70m	59.85	51.22	140.81	21.40	17.15	65.00	36.53
pythia - 160m	62.68	51.07	30.03	36.76	19.62	76.20	36.58
pythia - 410m	66.54	52.24	11.75	49.93	21.67	80.80	60.58
opt - 125m	63.00	50.27	26.02	37.90	18.94	75.1	36.54
---	---	---	---	---	---	---	---
gpt2 (124m)	62.89	51.61	40.06	32.56	19.03	75	43.27
gpt2023 (124m)	62.02	49.64	34.55	33.98	18.94	76.1	36.54

The resulting model achieves a perplexity of 339.38, making it competitive with Cerebras - 590m with only 21% of the parameters, and much better than the original GPT - 2 which scores 491.57!

(metric explanation here: https://twitter.com/aicrumb/status/1650350363898265601, tldr it's a joke)

To demonstrate how GPT2(023) is aware of recent events, let’s take a look at a given example:

# About Covid - 19
 - -
The Covid - 19

The model completes the text as:

# About Covid - 19
 - -
The Covid - 19 pandemic is the worldwide pandemic that has left thousands of people unable to enter and work in or continue their normal daily normal life. In this brief post, we examine three of the main factors that have accelerated the pandemic and predict the path the pandemic will take through the rest of the world.

As you can see, GPT2(023) can generate coherent and relevant text pertaining to the Covid - 19 pandemic, showcasing its ability to understand recent events. However, it struggles with certain subjects that weren’t extremely relevant in its training data. As only 2.23 billion tokens were used during finetuning, the model may have missed out on many recent events. One of those events being the latest US election.

Given text in a question and answer format:

Q: Who is the last president?
A: Donald Trump

Q: Who is the most recent president?
A:

The model completes the text with: Barack Obama

Model description

(from GPT - 2 model card)

GPT - 2 is a transformer model pretrained on a very large corpus of English data in a self - supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask - mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.

This is the smallest version of GPT - 2, with 124M parameters.

📦 Installation

(Not provided in the original document, skipped)

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='crumb/gpt2023')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
 {'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
 {'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},
 {'generated_text': "Hello, I'm a language model, a system model. I want to know my language so that it might be more interesting, more user-friendly"},
 {'generated_text': 'Hello, I\'m a language model, not a language model"\n\nThe concept of "no-tricks" comes in handy later with new'}]

Advanced Usage

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('crumb/gpt2023')
model = GPT2Model.from_pretrained('crumb/gpt2023')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

📚 Documentation

Limitations and bias

The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the OpenAI team themselves point out in their model card:

⚠️ Important Note

Because large - scale language models like GPT - 2 do not distinguish fact from fiction, we don’t support use - cases that require the generated text to be true.

Additionally, language models like GPT - 2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use - case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT - 2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Property	Details
Avg.	24.85
ARC (25 - shot)	21.93
HellaSwag (10 - shot)	31.11
MMLU (5 - shot)	25.05
TruthfulQA (0 - shot)	40.71
Winogrande (5 - shot)	50.12
GSM8K (5 - shot)	0.3
DROP (3 - shot)	4.73

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご