🚀 makiart/multilingual-ModernBert-large-preview
This multilingual model, developed by the Algomatic team, aims to provide high - quality masked language prediction. It leverages computational resources from the ABCI Generative AI Hackathon, offering a large context length and a considerable vocabulary size for various language tasks.
🚀 Quick Start
Prerequisites
Install the required package using:
pip install -U transformers>=4.48.0
If your GPU supports FlashAttention, you can achieve more efficient inference by installing:
pip install flash-attn --no-build-isolation
✨ Features
- Long Context Handling: With a context length of 8192, it can handle long - text tasks effectively.
- Large Vocabulary: A vocabulary size of 151,680 enables it to cover a wide range of language expressions.
- Multilingual Support: Utilizes fineweb and fineweb2 datasets, supporting multiple languages.
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")
for result in results:
print(result)
results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")
for result in results:
print(result)
results = fill_mask("我们必须[MASK],我们只能成为此时此地的那个自己,而无法成为其他任何人。")
for result in results:
print(result)
Advanced Usage
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("たとえ[MASK]の中であっても鍋から的確に意中の具をつまみだせる技術")
for result in results:
print(result)
📚 Documentation
Model Description
- Training Approach:
- The base model's weights are inherited by tiling from the middle.
- Approximately 60B tokens with a context length of 8192 are used for training.
- Tokenizer: Based on Qwen2.5, with a vocabulary size of 151,680 tokens. It has been customized to differentiate indentation, making it better suited for handling code text.
- Dataset:
- Utilizes the fineweb and fineweb2 datasets.
- For languages with an abundance of data, the volume has been reduced.
- Computational Resources: Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 2 days.
Evaluation
A comprehensive evaluation has not been performed yet 😭. Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.
📄 License
This project is licensed under the MIT license.
Property |
Details |
Model Type |
Multilingual Masked Language Model |
Training Data |
fineweb, fineweb2 |