GPT2-Large-Japanese Open-Source Japanese Model - Free Support for High-Quality Japanese Text Generation Tasks

Gpt2 Large Japanese

Developed by abeja

A large Japanese GPT-2 model trained by ABEJA, supporting Japanese text generation tasks

Supports Multiple LanguagesOpen Source License:MIT #Japanese Text Generation #Large Language Model #Japanese Writing Assistance

Downloads 960

Release Time : 8/29/2022

Model Overview

This is a large Japanese language model based on the GPT-2 architecture, specifically optimized for Japanese text generation tasks.

Model Features

Japanese-specific Model

Specially trained and optimized for Japanese text

Diverse Generation

Supports various sampling strategies for diverse text generation

Rich Pretraining Data

Utilizes multiple high-quality datasets including Japanese CC-100, Japanese Wikipedia, and Japanese OSCAR

Model Capabilities

Japanese Text Generation

Context Understanding

Diverse Text Sampling

Use Cases

Content Creation

Article Continuation

Continue writing a complete article based on a given opening

Generates fluent and coherent Japanese text

AI-assisted Writing

Creative Writing

Assists writers in creative ideation and content generation

Provides diverse writing ideas

🚀 gpt2-large-japanese

This repository offers a large-sized Japanese GPT-2 model. The model was trained by ABEJA, Inc. It aims to facilitate text generation tasks in the Japanese language.

🚀 Quick Start

First, install sentencepiece. We have confirmed behavior with the latest version as of August 2022. (Skip if not necessary.)

pip install sentencepiece

💻 Usage Examples

Basic Usage

When using pipeline for text generation:

from transformers import pipeline

generator = pipeline("text-generation", model="abeja/gpt2-large-japanese")
generated = generator(
    "人とAIが協調するためには、",
    max_length=30,
    do_sample=True,
    num_return_sequences=3,
    top_p=0.95,
    top_k=50,
    pad_token_id=3
)
print(*generated, sep="\n")

"""
[out]
{'generated_text': '人とAIが協調するためには、社会的なルールをきちんと理解して、人と共存し、協働して生きていくのが重要だという。'}
{'generated_text': '人とAIが協調するためには、それぞれが人間性を持ち、またその人間性から生まれるインタラクションを調整しなければならないことはいうまで'}
{'generated_text': '人とAIが協調するためには、AIが判断すべきことを人間が決める必要がある。人工知能の目的は、人間の知性、記憶、理解、'}
"""

Advanced Usage

When using PyTorch:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("abeja/gpt2-large-japanese")
model = AutoModelForCausalLM.from_pretrained("abeja/gpt2-large-japanese")

input_text = "人とAIが協調するためには、"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

gen_tokens = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,
    num_return_sequences=3,
    top_p=0.95,
    top_k=50,
    pad_token_id=tokenizer.pad_token_id
)
for gen_text in tokenizer.batch_decode(gen_tokens, skip_special_tokens=True):
    print(gen_text)

When using TensorFlow:

from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("abeja/gpt2-large-japanese")
model = TFAutoModelForCausalLM.from_pretrained("abeja/gpt2-large-japanese", from_pt=True)

input_text = "人とAIが協調するためには、"
input_ids = tokenizer.encode(input_text, return_tensors="tf")

gen_tokens = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,
    num_return_sequences=3,
    top_p=0.95,
    top_k=50,
    pad_token_id=tokenizer.pad_token_id
)
for gen_text in tokenizer.batch_decode(gen_tokens, skip_special_tokens=True):
    print(gen_text)

📚 Documentation

Dataset

The model was trained on Japanese CC-100, Japanese Wikipedia, and Japanese OSCAR.

Tokenization

The model uses a sentencepiece-based tokenizer, and the vocabulary was trained on the Japanese Wikipedia.

📄 License

The MIT license

Additional Information

Property	Details
Model Type	Large-sized Japanese GPT-2 model
Training Data	CC-100, Wikipedia, OSCAR
Tags	ja, japanese, gpt2, text-generation, lm, nlp
Widget Input Example	"人とAIが協調するためには、"

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご