๐ RetrievaBERT Model
The RetrievaBERT is a pre-trained Transformer Encoder using Megatron-LM, specifically designed for Japanese language use. It offers advanced features and capabilities for natural language processing tasks.
๐ Quick Start
The RetrievaBERT is a pre - trained Transformer Encoder. It's designed for Japanese use. You can start using it as a Masked Language Model (MLM) or fine - tune it for downstream tasks.
โจ Features
What's New
- November 2024 (
v1.0.1
): Bug fix for the model parameters.
- The up_proj's bias was initialized with the gate's one. This bug was fixed.
Model Details
Model Description
The RetrievaBERT is a pre - trained Transformer Encoder using Megatron-LM, tailored for Japanese. It has several advanced features compared to traditional BERT models:
- PreNorm: Improves stability during training.
- SwiGLU: An enhanced activation function for better performance.
- Grouped - Query Attention (Multi - Query Attention): An efficient attention mechanism.
- Max Sequence Length: 2048 tokens, allowing for longer context.
- Parameters: 1.3 billion parameters.
- Pre - training Objective: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
- Token Type IDs: Not used in this model.
Model Sources
- Developed by: Retrieva, Inc.
- Model type: Based on MegatronBERT Architecture.
- Language(s) (NLP): Primarily Japanese (optional support for English).
- License: Apache 2.0
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
This model can be used as a Masked Language Model (MLM). The mask token used is <MASK|LLM - jp>
. Note that you need to set trust_remote_code
to True
because RetrievaBERT uses a custom model implementation.
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "ใใใซใกใฏ๏ผ็งใฎๅๅใฏ<MASK|LLM-jp>ใงใ๏ผ"
print(pipe(text))
Advanced Usage
RetrievaBERT is compatible with Hugging Face's AutoModels. To fine - tune RetrievaBERT for your specific task, use the corresponding AutoModel class. For detailed configuration, refer to the config.json file.
๐ Documentation
Uses
This model can be used as a Masked Language Model (MLM), but it is mainly intended to be fine - tuned on downstream tasks.
Training Details
Training Data
The RetrievaBERT model was pre - trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM - jp](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon - refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the - stack)
The model was trained on 180 billion tokens using the above dataset.
Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024. The curriculum learning similar to the Sequence Length Warmup was adopted, training with the following sequence lengths and number of steps:
- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.
Training Hyperparameters
The model was trained with the following hyperparameters:
- Learning rate: 1.5e - 4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e - 6
- Floating point expression: BF16
Evaluation
The following models were fine - tuned and evaluated on the JGLUE development set. The learning rate and training epochs were adjusted for each model and task according to the JGLUE paper.
Model |
MARC - ja/acc |
JSTS/pearson |
JSTS/spearman |
JNLI/acc |
JSQuAD/EM |
JSQuAD/F1 |
JComQA/acc |
tohoku - nlp/bert - base - japanese - v3 |
0.957 |
0.914 |
0.876 |
0.906 |
0.878 |
0.946 |
0.849 |
tohoku - nlp/bert - large - japanese - v2 |
0.959 |
0.916 |
0.877 |
0.901 |
0.884 |
0.951 |
0.867 |
ku - nlp/deberta - v3 - base - japanese |
0.958 |
0.925 |
0.890 |
0.902 |
0.925 |
0.910 |
0.882 |
retrieva - jp/bert - 1.3b |
0.959 |
0.917 |
0.881 |
0.898 |
0.875 |
0.874 |
0.827 |
๐ง Technical Details
Model Architectures
The RetrievaBERT model is based on BERT with the following hyperparameters:
- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048
The main differences from the original BERT are:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped - Query Attention (Multi - Query Attention): Efficient attention mechanism.
Compute Infrastructure
TSUBAME 4
This model is based on results obtained from the [TSUBAME deep - learning mini - camp](https://www.t4.gsic.titech.ac.jp/en/minicamp - dl - 202406).
Software
The model was trained using [Megatron - LM](https://github.com/NVIDIA/Megatron - LM).
๐ License
The model is licensed under Apache 2.0.
More Information
https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
Model Card Authors
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
Model Card Contact
pr@retrieva.jp