bart-large-japanese Open-source Japanese Large Model - Free Deployment Support for NLP Tasks such as Text Generation

Bart Large Japanese

Developed by ku-nlp

A Japanese BART large model pre-trained on Japanese Wikipedia, suitable for text generation and natural language processing tasks.

Large Language Model

Transformers

Japanese#Japanese text generation #Wikipedia pre-training #Juman++ tokenization

Downloads 206

Release Time : 5/9/2023

Model Overview

This is a Japanese BART large model pre-trained on Japanese Wikipedia, primarily used for text generation and natural language processing tasks.

Model Features

Japanese-specific pre-training

Specifically pre-trained for Japanese, optimizing Japanese text processing capabilities.

Based on Juman++ tokenization

Input text must be pre-tokenized using Juman++ to ensure processing accuracy.

Large-scale training data

Pre-trained using Japanese Wikipedia (18 million sentences).

Model Capabilities

Japanese text generation

Natural language processing

Text summarization

Machine translation

Use Cases

Academic research

Natural language processing research

Used for research and experiments related to Japanese natural language processing.

Text processing

Text summarization

Generate summaries of Japanese text.

🚀 Japanese BART large

A pre - trained Japanese BART large model on Japanese Wikipedia, offering capabilities for various natural language processing tasks.

🚀 Quick Start

You can use this model as follows:

from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-large-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-large-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine - tune this model on downstream tasks.

✨ Features

This is a Japanese BART large model pre - trained on Japanese Wikipedia.
It can be used for various natural language processing tasks and fine - tuned on downstream tasks.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-large-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-large-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

Advanced Usage

You can fine - tune this model on downstream tasks according to your specific needs.

📚 Documentation

Tokenization

The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0 - rc3 was used for pre - training. Each word is tokenized into subwords by sentencepiece.

Training data

We used the following corpora for pre - training:

Japanese Wikipedia (18M sentences)

Training procedure

We first segmented texts in the corpora into words using Juman++. Then, we built a sentencepiece model with 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese BART model using fairseq library. The training took about 1 month using 4 Tesla V100 GPUs.

The following hyperparameters were used during pre - training:

Property	Details
distributed_type	multi - GPU
num_devices	4
batch_size	512
training_steps	250,000
encoder layers	12
decoder layers	12
hidden size	1024

📄 License

This model is licensed under cc - by - sa - 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご