multilingual-ModernBert-large-preview Open-source Multilingual Model

Multilingual ModernBert Large Preview

Developed by makiart

A large multilingual BERT model developed by the Algomatic team, supporting 8192 context length, trained on approximately 60 billion tokens, suitable for mask filling tasks.

Large Language Model

Safetensors

Open Source License:MIT #Multilingual Mask Filling #Long Context 8192 #Code-Optimized Tokenizer

Downloads 27

Release Time : 2/11/2025

Model Overview

This is a large multilingual BERT model designed specifically for mask filling tasks, supporting multiple language processing with a large vocabulary and context handling capability.

Model Features

Long Context Support

Supports 8192 context length, suitable for long-text tasks.

Multilingual Capability

Capable of processing text in multiple languages (e.g., Korean, English, Chinese).

Efficient Inference

Supports FlashAttention technology, enabling efficient inference on compatible GPUs.

Large Vocabulary

Vocabulary size of 151,680, optimized for code text processing and capable of distinguishing indentation.

Model Capabilities

Multilingual text processing

Mask filling prediction

Long-text understanding

Use Cases

Text Processing

Korean Text Filling

Predict masked words in Korean sentences.

English Text Filling

Predict masked words in English sentences.

Chinese Text Filling

Predict masked words in Chinese sentences.

🚀 makiart/multilingual-ModernBert-large-preview

This multilingual model, developed by the Algomatic team, aims to provide high - quality masked language prediction. It leverages computational resources from the ABCI Generative AI Hackathon, offering a large context length and a considerable vocabulary size for various language tasks.

🚀 Quick Start

Prerequisites

Install the required package using:

pip install -U transformers>=4.48.0

If your GPU supports FlashAttention, you can achieve more efficient inference by installing:

pip install flash-attn --no-build-isolation

✨ Features

Long Context Handling: With a context length of 8192, it can handle long - text tasks effectively.
Large Vocabulary: A vocabulary size of 151,680 enables it to cover a wide range of language expressions.
Multilingual Support: Utilizes fineweb and fineweb2 datasets, supporting multiple languages.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")

for result in results:
    print(result)

# {'score': 0.09716796875, 'token': 131582, 'token_str': ' 통해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 통해 데서 시작된다.'}
# {'score': 0.058837890625, 'token': 61298, 'token_str': ' 한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 한 데서 시작된다.'}
# {'score': 0.04296875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}
# {'score': 0.02783203125, 'token': 130039, 'token_str': ' 위해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 위해 데서 시작된다.'}
# {'score': 0.026123046875, 'token': 134108, 'token_str': ' 만들어', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 만들어 데서 시작된다.'}


results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")

for result in results:
    print(result)

# {'score': 0.1845703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}
# {'score': 0.08740234375, 'token': 27850, 'token_str': ' failures', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our failures.'}
# {'score': 0.06005859375, 'token': 23209, 'token_str': ' fears', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our fears.'}
# {'score': 0.0322265625, 'token': 34565, 'token_str': ' troubles', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our troubles.'}
# {'score': 0.0250244140625, 'token': 18707, 'token_str': ' dreams', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our dreams.'}


results = fill_mask("我们必须[MASK]，我们只能成为此时此地的那个自己，而无法成为其他任何人。")

for result in results:
    print(result)

# {'score': 0.1904296875, 'token': 104953, 'token_str': '承认', 'sequence': '我们必须承认，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.1484375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.1484375, 'token': 106836, 'token_str': '认识到', 'sequence': '我们必须认识到，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.10205078125, 'token': 101265, 'token_str': '明白', 'sequence': '我们必须明白，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.0703125, 'token': 105712, 'token_str': '记住', 'sequence': '我们必须记住，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}

Advanced Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

results = fill_mask("たとえ[MASK]の中であっても鍋から的確に意中の具をつまみだせる技術")

for result in results:
    print(result)

# {'score': 0.5078125, 'token': 45629, 'token_str': '家', 'sequence': 'たとえ家の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.11279296875, 'token': 116990, 'token_str': '鍋', 'sequence': 'たとえ鍋の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.060546875, 'token': 105010, 'token_str': '厨房', 'sequence': 'たとえ厨房の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.02685546875, 'token': 101064, 'token_str': '家庭', 'sequence': 'たとえ家庭の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.0184326171875, 'token': 142935, 'token_str': 'キッチン', 'sequence': 'たとえキッチンの中であっても鍋から的確に意中の具をつまみだせる技術'}

📚 Documentation

Model Description

Training Approach:
- The base model's weights are inherited by tiling from the middle.
- Approximately 60B tokens with a context length of 8192 are used for training.
Tokenizer: Based on Qwen2.5, with a vocabulary size of 151,680 tokens. It has been customized to differentiate indentation, making it better suited for handling code text.
Dataset:
- Utilizes the fineweb and fineweb2 datasets.
- For languages with an abundance of data, the volume has been reduced.
Computational Resources: Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 2 days.

Evaluation

A comprehensive evaluation has not been performed yet 😭. Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.

📄 License

This project is licensed under the MIT license.

Property	Details
Model Type	Multilingual Masked Language Model
Training Data	fineweb, fineweb2

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご