bart-base-japanese Open-source Japanese Model - Support Free Deployment for Natural Language Processing Tasks

Home

Bart Base Japanese

Developed by ku-nlp

This is a Japanese BART base model pretrained on Japanese Wikipedia, suitable for natural language processing tasks.

Large Language Model

Transformers

Japanese#Japanese text generation #Wikipedia pretraining #Juman++ tokenization

Downloads 181

Release Time : 5/9/2023

Model Overview

This model is a BART base model pretrained on Japanese Wikipedia, primarily used for Japanese text generation and natural language processing tasks.

Model Features

Japanese-specific pretraining

The model is specifically pretrained on Japanese text, optimizing performance for Japanese natural language processing tasks.

Juman++ tokenization

Input text must be pre-tokenized using Juman++ to ensure efficient processing of Japanese text.

Multi-GPU training

The model was trained for 2 weeks on 4 Tesla V100 GPUs with distributed training, ensuring high performance.

Model Capabilities

Japanese text generation

Natural language processing

Text summarization

Machine translation

Use Cases

Natural language processing

Text summarization

Generate summaries of Japanese text.

Machine translation

Used for translation tasks between Japanese and other languages.

🚀 Japanese BART base

A pre - trained Japanese BART base model on Japanese Wikipedia, useful for various natural language processing tasks.

🚀 Quick Start

You can use this model as follows:

from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-base-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-base-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine - tune this model on downstream tasks.

✨ Features

This is a Japanese BART base model pre - trained on Japanese Wikipedia.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-base-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-base-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

Advanced Usage

You can fine - tune this model on downstream tasks. The code for fine - tuning is not provided in the original README.

📚 Documentation

Tokenization

The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0 - rc3 was used for pre - training. Each word is tokenized into subwords by sentencepiece.

Training data

We used the following corpora for pre - training:

Japanese Wikipedia (18M sentences)

Training procedure

We first segmented texts in the corpora into words using Juman++. Then, we built a sentencepiece model with 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese BART model using fairseq library. The training took 2 weeks using 4 Tesla V100 GPUs.

The following hyperparameters were used during pre - training:

distributed_type: multi - GPU
num_devices: 4
batch_size: 512
training_steps: 500,000
encoder layers: 6
decoder layers: 6
hidden size: 768

🔧 Technical Details

The model is pre - trained on Japanese Wikipedia. The input text needs to be pre - processed by Juman++ for word segmentation and then tokenized into subwords by sentencepiece. The training process involves building a sentencepiece model and using the fairseq library for training with specific hyperparameters on 4 Tesla V100 GPUs for 2 weeks.

📄 License

The model is licensed under CC - BY - SA 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご