ModernBERT-base-ita開源模型 - 基於海量數據預訓練，長文本處理更強大

首頁

Modernbert Base Ita

由DeepMount00開發

ModernBERT 是一種現代化的雙向僅編碼器 Transformer 模型（BERT 風格），在 2 萬億英語和代碼數據上進行了預訓練，原生上下文長度高達 8,192 個標記。

大型語言模型

Transformers

支持多種語言開源協議:Apache-2.0 #長文本處理 #代碼語義搜索 #旋轉位置嵌入

下載量 81

發布時間 : 12/19/2024

模型概述

ModernBERT 是一種現代化的雙向僅編碼器 Transformer 模型，適用於處理長文檔的任務，如檢索、分類和大規模語料庫中的語義搜索。

模型特點

旋轉位置嵌入（RoPE）

支持長上下文處理。

局部-全局交替注意力

提高長輸入效率。

去填充和 Flash Attention

實現高效推理。

原生支持長上下文

原生上下文長度高達 8,192 個標記。

模型能力

填充掩碼

長上下文處理

語義搜索

代碼檢索

文本分類

使用案例

自然語言處理

文本分類

對長文檔進行分類任務。

語義搜索

在大規模語料庫中進行語義搜索。

代碼處理

代碼檢索

在代碼庫中進行檢索任務。

在 CodeSearchNet 和 StackQA 上實現了代碼檢索的最新最優結果。

🚀 ModernBERT

ModernBERT是一個現代化的雙向僅編碼器Transformer模型（BERT風格），在2萬億個英語和代碼數據的標記上進行了預訓練，原生上下文長度可達8192個標記。它能有效處理長文檔相關任務，適用於多種下游任務。

🚀 快速開始

你可以直接使用transformers庫來使用這些模型。在下一個transformers版本發佈之前，這樣做需要從主分支安裝transformers：

pip install git+https://github.com/huggingface/transformers.git

由於ModernBERT是一個掩碼語言模型（MLM），你可以使用fill-mask管道或通過AutoModelForMaskedLM加載它。要將ModernBERT用於分類、檢索或問答等下游任務，請按照標準的BERT微調方法對其進行微調。

✨ 主要特性

長上下文支持：採用旋轉位置嵌入（Rotary Positional Embeddings，RoPE）技術，能夠有效處理長上下文內容。
高效處理長輸入：使用局部 - 全局交替注意力（Local - Global Alternating Attention）機制，提高長輸入處理效率。
高效推理：藉助去填充和Flash Attention技術，實現高效推理。
多規模可選：提供不同規模的模型，包括ModernBERT-base（22層，1.49億參數）和ModernBERT-large（28層，3.95億參數）。
廣泛適用性：在大量文本和代碼語料上進行訓練，適用於代碼檢索和混合（文本 + 代碼）語義搜索等多種下游任務。

📦 安裝指南

安裝transformers庫

pip install git+https://github.com/huggingface/transformers.git

安裝Flash Attention（可選）

⚠️ 重要提示

如果你的GPU支持，建議使用Flash Attention 2以達到最高效率。安裝Flash Attention如下，然後正常使用模型：

pip install flash-attn

💻 使用示例

基礎用法

使用AutoModelForMaskedLM：

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "DeepMount00/ModernBERT-base-ita"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "La capitale dell'Italia è [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  Roma

高級用法

使用管道：

import torch
from transformers import pipeline
from pprint import pprint

pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)

input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

📚 詳細文檔

評估

在一系列任務中對ModernBERT進行了評估，包括自然語言理解（GLUE）、通用檢索（BEIR）、長上下文檢索（MLDR）和代碼檢索（CodeSearchNet和StackQA）。

在GLUE上，ModernBERT-base超越了其他類似規模的編碼器模型，而ModernBERT-large僅次於Deberta-v3-large。
對於通用檢索任務，ModernBERT在BEIR的單向量（DPR風格）和多向量（ColBERT風格）設置中均表現出色。
由於在訓練數據中包含了代碼數據，作為骨幹模型的ModernBERT在CodeSearchNet和StackQA上也取得了新的最先進的代碼檢索結果。

侷限性

語言適用性：ModernBERT的訓練數據主要是英語和代碼，因此對於其他語言的性能可能較低。
推理速度：雖然它能有效處理長序列，但使用完整的8192個標記窗口可能比短上下文推理慢。
數據偏差：與任何大語言模型一樣，ModernBERT可能會產生反映其訓練數據中存在的偏差的表示。在依賴關鍵或敏感輸出之前，請先進行驗證。

📄 許可證

我們根據Apache 2.0許可證發佈ModernBERT模型架構、模型權重和訓練代碼庫。

🔖 引用

@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

ModernBERT是Answer.AI、LightOn和其他夥伴的合作項目。

如需瞭解更多關於ModernBERT的信息，我們推薦閱讀發佈博客文章以獲取高層次概述，閱讀arXiv預印本以獲取深入信息。