Watashiha - GPT - 6B: An Open - source Japanese Darake Model - Freely Deployable to Generate Amusing Content

Watashiha Gpt 6b

Developed by watashiha

A Japanese Daikiri language model developed based on the GPT2 architecture, pre-trained and fine-tuned with Daikiri data.

Large Language Model

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Daikiri Generation #GPT2 Architecture Optimization #Humorous Content Creation

Downloads 1,831

Release Time : 12/28/2023

Model Overview

This is a language model specifically designed for generating Daikiri (a traditional Japanese word game) content. It is developed based on the GPT2 architecture, pre-trained with a large amount of Japanese corpora and fine-tuned with 6.93 million Daikiri data entries.

Model Features

Professional Daikiri Generation

Optimized specifically for Daikiri content, capable of generating interesting answers that comply with the rules of the traditional Japanese word game.

Large-scale Pre-training

Pre-trained with 47.7 billion tokens of Japanese corpora, including datasets such as C4, CC-100, OSCAR, and Wikipedia.

Professional Fine-tuning

Fine-tuned with 6.93 million Daikiri data entries to improve the relevance and趣味性 of the generated content.

AWS Optimization

Developed using AWS's trn1 instances and optimized for cloud environments.

Model Capabilities

Japanese Text Generation

Daikiri Content Creation

Humorous Text Generation

Use Cases

Entertainment Applications

Daikiri Game

Used to generate interesting answers that comply with the rules of the Daikiri game.

In the evaluation, approximately 44% of the answers were rated as 3-star (with a certain level of 趣味性 or higher).

Content Creation

Provide creative content for comedy shows or social media.

🚀 Comedy Prompt Language Model

This is a comedy prompt language model developed using AWS's trn1 instances. It has undergone pre - training and then fine - tuning with comedy prompt data.

🚀 Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "watashiha/watashiha-gpt-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

if torch.cuda.is_available():
    model = model.to("cuda")

text = "お題:ホラー映画の「○○○から逃げろ！」<SEP>回答:"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)

output_ids = model.generate(
    token_ids,
    do_sample=True,
    max_new_tokens=32,
    top_p=0.9,
    top_k=50,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
print(output)
"""お題:ホラー映画の「○○○から逃げろ！」<SEP>回答:怖いもの知らずの大学生"""

✨ Features

Model Architecture: Based on the GPT2 architecture.
Vocabulary Size: 44,880.
Model Size: 6B parameters.
License: Apache License 2.0.
Library: [aws - neuron - reference - for - megatron - lm](https://github.com/aws - neuron/aws - neuron - reference - for - megatron - lm).

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "watashiha/watashiha-gpt-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

if torch.cuda.is_available():
    model = model.to("cuda")

text = "お題:ホラー映画の「○○○から逃げろ！」<SEP>回答:"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)

output_ids = model.generate(
    token_ids,
    do_sample=True,
    max_new_tokens=32,
    top_p=0.9,
    top_k=50,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
print(output)
"""お題:ホラー映画の「○○○から逃げろ！」<SEP>回答:怖いもの知らずの大学生"""

Advanced Usage

There is no advanced usage example in the original document, so this part is skipped.

📚 Documentation

Training Data

The pre - training was conducted using the following corpora, with a total of 47.7 billion tokens:

Japanese data from C4.
Japanese data from CC - 100.
Japanese data from OSCAR.
Japanese dump data from Wikipedia.
In - house data.

Fine - tuning was performed using 6.93 million comedy prompt data.

Performance Comparison

The following is the result of fine - tuning each model under the same conditions and having the generated jokes evaluated on a four - point scale by a mobile comedy prompt legend:

Out of range: The model fails to understand the topic as Japanese. One star: The model understands the topic but the joke is not well - formed (not funny). Two stars: The joke is well - formed (funny). Three stars: The joke is very funny (above a certain level of funniness).

	Out of range	One star	Two stars	Three stars
watashiha - gpt - 6b	77	204	175	44
[rinna/japanese - gpt - neox - 3.6b](https://huggingface.co/rinna/japanese - gpt - neox - 3.6b)	88	194	185	30
[stabilityai/japanese - stablelm - base - alpha - 7b](https://huggingface.co/stabilityai/japanese - stablelm - base - alpha - 7b)	96	164	196	43
[elyza/ELYZA - japanese - Llama - 2 - 7b - fast](https://huggingface.co/elyza/ELYZA - japanese - Llama - 2 - 7b - fast)	75	197	198	25

🔧 Technical Details

There is no specific technical implementation details in the original document, so this section is skipped.

📄 License

This project is licensed under the Apache License 2.0.

👨‍💻 Developers

UCHIDA, Tatsuya
KOBASHI, Yohei
KUROKI, Shuya
KUBOTA, Hikaru
TAKENOUCHI, Daisuke

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご