๐ Comedy Prompt Language Model
This is a comedy prompt language model developed using AWS's trn1 instances. It has undergone pre - training and then fine - tuning with comedy prompt data.
๐ Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "watashiha/watashiha-gpt-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
if torch.cuda.is_available():
model = model.to("cuda")
text = "ใ้ก:ใใฉใผๆ ็ปใฎใโโโใใ้ใใ๏ผใ<SEP>ๅ็ญ:"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
output_ids = model.generate(
token_ids,
do_sample=True,
max_new_tokens=32,
top_p=0.9,
top_k=50,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
print(output)
"""ใ้ก:ใใฉใผๆ ็ปใฎใโโโใใ้ใใ๏ผใ<SEP>ๅ็ญ:ๆใใใฎ็ฅใใใฎๅคงๅญฆ็"""
โจ Features
- Model Architecture: Based on the GPT2 architecture.
- Vocabulary Size: 44,880.
- Model Size: 6B parameters.
- License: Apache License 2.0.
- Library: [aws - neuron - reference - for - megatron - lm](https://github.com/aws - neuron/aws - neuron - reference - for - megatron - lm).
๐ฆ Installation
No specific installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "watashiha/watashiha-gpt-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
if torch.cuda.is_available():
model = model.to("cuda")
text = "ใ้ก:ใใฉใผๆ ็ปใฎใโโโใใ้ใใ๏ผใ<SEP>ๅ็ญ:"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
output_ids = model.generate(
token_ids,
do_sample=True,
max_new_tokens=32,
top_p=0.9,
top_k=50,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
print(output)
"""ใ้ก:ใใฉใผๆ ็ปใฎใโโโใใ้ใใ๏ผใ<SEP>ๅ็ญ:ๆใใใฎ็ฅใใใฎๅคงๅญฆ็"""
Advanced Usage
There is no advanced usage example in the original document, so this part is skipped.
๐ Documentation
Training Data
The pre - training was conducted using the following corpora, with a total of 47.7 billion tokens:
- Japanese data from C4.
- Japanese data from CC - 100.
- Japanese data from OSCAR.
- Japanese dump data from Wikipedia.
- In - house data.
Fine - tuning was performed using 6.93 million comedy prompt data.
Performance Comparison
The following is the result of fine - tuning each model under the same conditions and having the generated jokes evaluated on a four - point scale by a mobile comedy prompt legend:
Out of range: The model fails to understand the topic as Japanese.
One star: The model understands the topic but the joke is not well - formed (not funny).
Two stars: The joke is well - formed (funny).
Three stars: The joke is very funny (above a certain level of funniness).
|
Out of range |
One star |
Two stars |
Three stars |
watashiha - gpt - 6b |
77 |
204 |
175 |
44 |
[rinna/japanese - gpt - neox - 3.6b](https://huggingface.co/rinna/japanese - gpt - neox - 3.6b) |
88 |
194 |
185 |
30 |
[stabilityai/japanese - stablelm - base - alpha - 7b](https://huggingface.co/stabilityai/japanese - stablelm - base - alpha - 7b) |
96 |
164 |
196 |
43 |
[elyza/ELYZA - japanese - Llama - 2 - 7b - fast](https://huggingface.co/elyza/ELYZA - japanese - Llama - 2 - 7b - fast) |
75 |
197 |
198 |
25 |
๐ง Technical Details
There is no specific technical implementation details in the original document, so this section is skipped.
๐ License
This project is licensed under the Apache License 2.0.
๐จโ๐ป Developers
- UCHIDA, Tatsuya
- KOBASHI, Yohei
- KUROKI, Shuya
- KUBOTA, Hikaru
- TAKENOUCHI, Daisuke