🚀 gpt2-large-japanese
This repository offers a large-sized Japanese GPT-2 model. The model was trained by ABEJA, Inc. It aims to facilitate text generation tasks in the Japanese language.
🚀 Quick Start
First, install sentencepiece. We have confirmed behavior with the latest version as of August 2022. (Skip if not necessary.)
pip install sentencepiece
💻 Usage Examples
Basic Usage
When using pipeline for text generation:
from transformers import pipeline
generator = pipeline("text-generation", model="abeja/gpt2-large-japanese")
generated = generator(
"人とAIが協調するためには、",
max_length=30,
do_sample=True,
num_return_sequences=3,
top_p=0.95,
top_k=50,
pad_token_id=3
)
print(*generated, sep="\n")
"""
[out]
{'generated_text': '人とAIが協調するためには、社会的なルールをきちんと理解して、人と共存し、協働して生きていくのが重要だという。'}
{'generated_text': '人とAIが協調するためには、それぞれが人間性を持ち、またその人間性から生まれるインタラクションを調整しなければならないことはいうまで'}
{'generated_text': '人とAIが協調するためには、AIが判断すべきことを人間が決める必要がある。人工知能の目的は、人間の知性、記憶、理解、'}
"""
Advanced Usage
When using PyTorch:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("abeja/gpt2-large-japanese")
model = AutoModelForCausalLM.from_pretrained("abeja/gpt2-large-japanese")
input_text = "人とAIが協調するためには、"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
gen_tokens = model.generate(
input_ids,
max_length=100,
do_sample=True,
num_return_sequences=3,
top_p=0.95,
top_k=50,
pad_token_id=tokenizer.pad_token_id
)
for gen_text in tokenizer.batch_decode(gen_tokens, skip_special_tokens=True):
print(gen_text)
When using TensorFlow:
from transformers import AutoTokenizer, TFAutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("abeja/gpt2-large-japanese")
model = TFAutoModelForCausalLM.from_pretrained("abeja/gpt2-large-japanese", from_pt=True)
input_text = "人とAIが協調するためには、"
input_ids = tokenizer.encode(input_text, return_tensors="tf")
gen_tokens = model.generate(
input_ids,
max_length=100,
do_sample=True,
num_return_sequences=3,
top_p=0.95,
top_k=50,
pad_token_id=tokenizer.pad_token_id
)
for gen_text in tokenizer.batch_decode(gen_tokens, skip_special_tokens=True):
print(gen_text)
📚 Documentation
Dataset
The model was trained on Japanese CC-100, Japanese Wikipedia, and Japanese OSCAR.
Tokenization
The model uses a sentencepiece-based tokenizer, and the vocabulary was trained on the Japanese Wikipedia.
📄 License
The MIT license
Additional Information
Property |
Details |
Model Type |
Large-sized Japanese GPT-2 model |
Training Data |
CC-100, Wikipedia, OSCAR |
Tags |
ja, japanese, gpt2, text-generation, lm, nlp |
Widget Input Example |
"人とAIが協調するためには、" |