albert-base-japanese-v1 Open-source Japanese Model - Free Support for Japanese Text Masked Filling Processing

Albert Base Japanese V1

Developed by ken11

This is a pre-trained Japanese ALBERT model primarily designed for fill-mask tasks, supporting Japanese text processing.

Large Language Model

Transformers

JapaneseOpen Source License:MIT #Japanese Fill-Mask #ALBERT Lightweight Architecture #Wikipedia Training

Downloads 609

Release Time : 3/2/2022

Model Overview

This model is a Japanese pre-trained model based on the ALBERT architecture, designed for fine-tuning various natural language processing tasks, with particular strength in fill-mask tasks.

Model Features

Japanese-Specific

Pre-trained model optimized specifically for Japanese text

ALBERT Architecture

Utilizes the lightweight ALBERT architecture with high parameter efficiency

SentencePiece Tokenization

Uses SentencePiece as the tokenizer, delivering effective Japanese text processing

Model Capabilities

Japanese Text Understanding

Fill-Mask Prediction

Fine-Tuning for NLP Tasks

Use Cases

Academic Research

Disciplinary Field Prediction

Predicting academic research fields involved in studies

Can accurately predict field names such as 'Psychology' or 'Mathematics'

Text Completion

Sentence Completion

Automatically completing missing parts in Japanese sentences

Provides reasonable completion suggestions based on context

🚀 albert-base-japanese-v1

This is a pre - trained ALBERT model for the Japanese language.

🚀 Quick Start

✨ Features

This is a pre - trained ALBERT model for the Japanese language, which can be used for tasks such as fill - mask.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

This model is a PreTrained model and is basically intended to be fine - tuned for various tasks.

Advanced Usage - Fill - Mask

This model uses Sentencepiece for the Tokenizer. There is a problem of extra tokens being mixed in after the [MASK] token, so when using it, you need to do the following:

for PyTorch

from transformers import (
    AlbertForMaskedLM, AlbertTokenizerFast
)
import torch


tokenizer = AlbertTokenizerFast.from_pretrained("ken11/albert-base-japanese-v1")
model = AlbertForMaskedLM.from_pretrained("ken11/albert-base-japanese-v1")

text = "大学で[MASK]の研究をしています"
tokenized_text = tokenizer.tokenize(text)
del tokenized_text[tokenized_text.index(tokenizer.mask_token) + 1]

input_ids = [tokenizer.cls_token_id]
input_ids.extend(tokenizer.convert_tokens_to_ids(tokenized_text))
input_ids.append(tokenizer.sep_token_id)

inputs = {"input_ids": [input_ids], "token_type_ids": [[0]*len(input_ids)], "attention_mask": [[1]*len(input_ids)]}
batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in inputs.items()}
output = model(**batch)[0]
_, result = output[0, input_ids.index(tokenizer.mask_token_id)].topk(5)

print(tokenizer.convert_ids_to_tokens(result.tolist()))
# ['英語', '心理学', '数学', '医学', '日本語']

for TensorFlow

from transformers import (
    TFAlbertForMaskedLM, AlbertTokenizerFast
)
import tensorflow as tf


tokenizer = AlbertTokenizerFast.from_pretrained("ken11/albert-base-japanese-v1")
model = TFAlbertForMaskedLM.from_pretrained("ken11/albert-base-japanese-v1")

text = "大学で[MASK]の研究をしています"
tokenized_text = tokenizer.tokenize(text)
del tokenized_text[tokenized_text.index(tokenizer.mask_token) + 1]

input_ids = [tokenizer.cls_token_id]
input_ids.extend(tokenizer.convert_tokens_to_ids(tokenized_text))
input_ids.append(tokenizer.sep_token_id)

inputs = {"input_ids": [input_ids], "token_type_ids": [[0]*len(input_ids)], "attention_mask": [[1]*len(input_ids)]}
batch = {k: tf.convert_to_tensor(v, dtype=tf.int32) for k, v in inputs.items()}
output = model(**batch)[0]
result = tf.math.top_k(output[0, input_ids.index(tokenizer.mask_token_id)], k=5)

print(tokenizer.convert_ids_to_tokens(result.indices.numpy()))
# ['英語', '心理学', '数学', '医学', '日本語']

📚 Documentation

Training Data

The following data is used for training:

Tokenizer

The tokenizer uses Sentencepiece. The training data for this is the same as above.

📄 License

The MIT license

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご