multilingual-ModernBert-base-preview Open-source Multilingual Model - Supports Masked Language Modeling Task, Highly Practical for Long Contexts

Multilingual ModernBert Base Preview

Developed by makiart

A multilingual BERT model developed by the Algomatic team, supporting mask-filling tasks with an 8192 context length and a vocabulary of 151,680.

Large Language Model

Safetensors

Open Source License:MIT #Multilingual Mask Filling #Long Context Processing #Programming Text Optimization

Downloads 60

Release Time : 2/10/2025

Model Overview

This model is a multilingual BERT model primarily designed for mask-filling tasks. It supports multiple languages and features extended context processing capabilities, making it suitable for text understanding and generation tasks.

Model Features

Long Context Support

Supports an 8192 context length, ideal for long-text processing tasks.

Multilingual Capability

Supports multiple languages including Korean, English, Chinese, and Japanese.

Efficient Inference

Supports FlashAttention for more efficient inference on compatible GPUs.

Custom Tokenizer

Based on Qwen2.5 tokenizer with a 151,680 vocabulary size, optimized for code indentation recognition.

Model Capabilities

Mask Filling

Multilingual Text Understanding

Long Text Processing

Use Cases

Text Understanding and Generation

Korean Text Filling

Fills missing parts in Korean sentences.

Example result: {'score': 0.248046875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}

English Text Filling

Fills missing parts in English sentences.

Example result: {'score': 0.20703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}

Chinese Text Filling

Fills missing parts in Chinese sentences.

Example result: {'score': 0.177734375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}

Japanese Text Filling

Fills missing parts in Japanese sentences.

Example result: {'score': 0.11865234375, 'token': 142732, 'token_str': 'ケーキ', 'sequence': '大きなケーキを一人で切り分けて食べるというのは孤独の極地ですからね'}

🚀 makiart/multilingual-ModernBert-base-preview

This multilingual model, developed by the Algomatic team, is designed for the fill-mask task. It was trained using computational resources provided by the ABCI Generative AI Hackathon.

🚀 Quick Start

Prerequisites

Install the required package using:

pip install -U transformers>=4.48.0

If your GPU supports FlashAttention, you can achieve more efficient inference by installing:

pip install flash-attn --no-build-isolation

✨ Features

Long Context Length: Supports a context length of 8192.
Large Vocabulary: Has a vocabulary size of 151,680.
High Training Tokens: Trained on approximately 250B tokens.
Parameter Efficiency: With 228M total parameters and 110M non-embedding parameters.
Diverse Datasets: Utilizes fineweb and fineweb2 datasets, with reduced data volume for languages with abundant data.

📦 Installation

pip install -U transformers>=4.48.0

If your GPU supports FlashAttention, enhance inference efficiency with:

pip install flash-attn --no-build-isolation

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)


results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")

for result in results:
    print(result)

# {'score': 0.248046875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}
# {'score': 0.1328125, 'token': 61298, 'token_str': ' 한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 한 데서 시작된다.'}
# {'score': 0.06689453125, 'token': 95002, 'token_str': ' 할', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 할 데서 시작된다.'}
# {'score': 0.055419921875, 'token': 130679, 'token_str': ' 위한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 위한 데서 시작된다.'}
# {'score': 0.04052734375, 'token': 131582, 'token_str': ' 통해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 통해 데서 시작된다.'}


results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")

for result in results:
    print(result)

# {'score': 0.20703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}
# {'score': 0.09765625, 'token': 27850, 'token_str': ' failures', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our failures.'}
# {'score': 0.040771484375, 'token': 34565, 'token_str': ' troubles', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our troubles.'}
# {'score': 0.03173828125, 'token': 18707, 'token_str': ' dreams', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our dreams.'}
# {'score': 0.028076171875, 'token': 23209, 'token_str': ' fears', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our fears.'}


results = fill_mask("我们必须[MASK]，我们只能成为此时此地的那个自己，而无法成为其他任何人。")

for result in results:
    print(result)

# {'score': 0.177734375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.138671875, 'token': 104953, 'token_str': '承认', 'sequence': '我们必须承认，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.12255859375, 'token': 101265, 'token_str': '明白', 'sequence': '我们必须明白，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.07421875, 'token': 105712, 'token_str': '记住', 'sequence': '我们必须记住，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.0654296875, 'token': 106836, 'token_str': '认识到', 'sequence': '我们必须认识到，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}

Advanced Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base-preview", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base-preview")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

results = fill_mask("大きな[MASK]を一人で切り分けて食べるというのは孤独の極地ですからね")

for result in results:
    print(result)

# {'score': 0.11865234375, 'token': 142732, 'token_str': 'ケーキ', 'sequence': '大きなケーキを一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.10498046875, 'token': 52853, 'token_str': '物', 'sequence': '大きな物を一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.08154296875, 'token': 108371, 'token_str': '魚', 'sequence': '大きな魚を一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.05615234375, 'token': 111974, 'token_str': '料理', 'sequence': '大きな料理を一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.043701171875, 'token': 115913, 'token_str': '動物', 'sequence': '大きな動物を一人で切り分けて食べるというのは孤独の極地ですからね'}

📚 Documentation

Model Description

Training Approach: The model was trained using a two-stage Masked Language Modeling (MLM) process:
- Masking Rate: 30%
- Training Data: Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
Tokenizer: Based on Qwen2.5, the tokenizer features:
- A vocabulary size of 151,680 tokens.
- Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
Dataset:
- Utilizes the fineweb and fineweb2 datasets.
- For languages with an abundance of data, the volume has been reduced.
Computational Resources: Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.

🔧 Technical Details

Context Length: 8192
Vocabulary Size: 151,680
Total Training Tokens: Approximately 250B tokens
Parameter Count: 228M
Non-embedding Parameter Count: 110M

📄 License

This model is released under the MIT license.

🔧 Evaluation

A comprehensive evaluation has not been performed yet 😭. Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご