BERT-1.3B Open-Source Language Model - Specifically Designed for Japanese Scenarios, Supports Text Processing Applications

Home

Bert 1.3b

Developed by retrieva-jp

Transformer encoder pretrained based on Megatron-LM, specifically designed for Japanese scenarios

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese BERT #Long text processing #SwiGLU activation

Downloads 56

Release Time : 6/25/2024

Model Overview

RetrievaBERT is a Transformer encoder pretrained on the Megatron-LM framework, primarily targeting Japanese application scenarios, featuring advanced characteristics such as pre-normalization and SwiGLU activation functions

Model Features

Pre-normalization (PreNorm)

Improves training stability

SwiGLU activation function

Enhances model expressiveness

Grouped query attention mechanism

Efficient attention computation

Long text processing capability

Supports long text processing up to 2048 tokens

Model Capabilities

Japanese text understanding

English text understanding

Masked language modeling

Downstream task fine-tuning

Use Cases

Text understanding

Japanese text classification

Can be used for tasks such as Japanese sentiment analysis and topic classification

Achieved 0.959 accuracy on the MARC-ja task

Semantic similarity calculation

Can be used to calculate semantic similarity between Japanese text pairs

Pearson correlation coefficient of 0.917 on the JSTS task

Question answering systems

Japanese question answering system

Can be used to build Japanese-based question answering systems

EM score of 0.875 on the JSQuAD task

🚀 RetrievaBERT Model

The RetrievaBERT is a pre-trained Transformer Encoder using Megatron-LM, specifically designed for Japanese language use. It offers advanced features and capabilities for natural language processing tasks.

🚀 Quick Start

The RetrievaBERT is a pre - trained Transformer Encoder. It's designed for Japanese use. You can start using it as a Masked Language Model (MLM) or fine - tune it for downstream tasks.

✨ Features

What's New

November 2024 (v1.0.1): Bug fix for the model parameters.
- The up_proj's bias was initialized with the gate's one. This bug was fixed.

Model Details

Model Description

The RetrievaBERT is a pre - trained Transformer Encoder using Megatron-LM, tailored for Japanese. It has several advanced features compared to traditional BERT models:

PreNorm: Improves stability during training.
SwiGLU: An enhanced activation function for better performance.
Grouped - Query Attention (Multi - Query Attention): An efficient attention mechanism.
Max Sequence Length: 2048 tokens, allowing for longer context.
Parameters: 1.3 billion parameters.
Pre - training Objective: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
Token Type IDs: Not used in this model.

Model Sources

Developed by: Retrieva, Inc.
Model type: Based on MegatronBERT Architecture.
Language(s) (NLP): Primarily Japanese (optional support for English).
License: Apache 2.0

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

This model can be used as a Masked Language Model (MLM). The mask token used is <MASK|LLM - jp>. Note that you need to set trust_remote_code to True because RetrievaBERT uses a custom model implementation.

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "こんにちは！私の名前は<MASK|LLM-jp>です！"
print(pipe(text))

Advanced Usage

RetrievaBERT is compatible with Hugging Face's AutoModels. To fine - tune RetrievaBERT for your specific task, use the corresponding AutoModel class. For detailed configuration, refer to the config.json file.

📚 Documentation

Uses

This model can be used as a Masked Language Model (MLM), but it is mainly intended to be fine - tuned on downstream tasks.

Training Details

Training Data

The RetrievaBERT model was pre - trained on the reunion of five datasets:

[Japanese CommonCrawl Dataset by LLM - jp](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v2).
[RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon - refinedweb).
Chinese Wikipedia dumped on 20240120.
Korean Wikipedia dumped on 20240120.
[The Stack](https://huggingface.co/datasets/bigcode/the - stack)

The model was trained on 180 billion tokens using the above dataset.

Training Procedure

The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024. The curriculum learning similar to the Sequence Length Warmup was adopted, training with the following sequence lengths and number of steps:

The sequence length of 128: 31,000 steps.
The sequence length of 256: 219,000 steps.
The sequence length of 512: 192,000 steps.
The sequence length of 2048: 12,000 steps.

Training Hyperparameters

The model was trained with the following hyperparameters:

Learning rate: 1.5e - 4.
Learning rate decay style: Linear.
Learning rate warmup fraction: 0.01
Minimum learning rate: 1e - 6
Floating point expression: BF16

Evaluation

The following models were fine - tuned and evaluated on the JGLUE development set. The learning rate and training epochs were adjusted for each model and task according to the JGLUE paper.

Model	MARC - ja/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
tohoku - nlp/bert - base - japanese - v3	0.957	0.914	0.876	0.906	0.878	0.946	0.849
tohoku - nlp/bert - large - japanese - v2	0.959	0.916	0.877	0.901	0.884	0.951	0.867
ku - nlp/deberta - v3 - base - japanese	0.958	0.925	0.890	0.902	0.925	0.910	0.882
retrieva - jp/bert - 1.3b	0.959	0.917	0.881	0.898	0.875	0.874	0.827

🔧 Technical Details

Model Architectures

The RetrievaBERT model is based on BERT with the following hyperparameters:

Number of layers: 48
Hidden layer size: 1536
FFN hidden layer size: 4096
Number of attention heads: 24
Maximum length of position embeddings: 2048

The main differences from the original BERT are:

PreNorm: Improved stability during training.
SwiGLU: Enhanced activation function for better performance.
Grouped - Query Attention (Multi - Query Attention): Efficient attention mechanism.

Compute Infrastructure

TSUBAME 4

This model is based on results obtained from the [TSUBAME deep - learning mini - camp](https://www.t4.gsic.titech.ac.jp/en/minicamp - dl - 202406).

Software

The model was trained using [Megatron - LM](https://github.com/NVIDIA/Megatron - LM).

📄 License

The model is licensed under Apache 2.0.

More Information

https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)

Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

Model Card Contact

pr@retrieva.jp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご