Norbert3-xs開源挪威語語言模型 - 免費易用助力挪威語文本處理

首頁

Norbert3 Xs

由ltg開發

NorBERT 3 xs 是一個針對挪威語優化的BERT模型，屬於新一代NorBERT語言模型系列中的最小版本，參數量為15M。

大型語言模型

Transformers

其他開源協議:Apache-2.0 #挪威語掩碼預測 #輕量級BERT #多方言支持

下載量 228

發布時間 : 3/28/2023

模型概述

該模型是基於BERT架構的挪威語預訓練語言模型，專門針對挪威語文本處理任務優化，支持多種下游NLP任務。

模型特點

挪威語優化

專門針對挪威語（包括bokmål和nynorsk兩種書面形式）進行預訓練和優化

輕量級設計

作為NorBERT 3系列中最小的模型（15M參數），適合資源受限環境

多功能支持

支持多種下游NLP任務，包括文本分類、命名實體識別等

遠程代碼支持

需要自定義包裝器，通過trust_remote_code=True加載模型

模型能力

掩碼語言建模

序列分類

標記分類

問答系統

多項選擇

使用案例

文本理解

文本補全

預測被掩碼的詞語，如示例中的'[MASK]'位置預測為'ny'

能準確預測符合語境的挪威語詞彙

文本分類

情感分析

對挪威語文本進行情感傾向分類

🚀 NorBERT 3 xs

NorBERT 3 xs是新一代NorBERT語言模型的官方版本，該模型在論文NorBench — A Benchmark for Norwegian Language Models中有詳細描述。若想了解該模型的更多細節，請閱讀此論文。

🚀 快速開始

NorBERT 3 xs是基於挪威語的語言模型，在多種自然語言處理任務中表現出色。它提供了不同大小的版本，以滿足不同場景的需求。

✨ 主要特性

多版本選擇：提供了不同大小的版本，包括xs、small、base和large，可根據實際需求選擇合適的模型。
多種任務支持：實現了多種類，如AutoModel、AutoModelMaskedLM、AutoModelForSequenceClassification等，支持多種自然語言處理任務。

📦 相關模型鏈接

其他尺寸的NorBERT 3模型

生成式NorT5系列模型

💻 使用示例

基礎用法

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ltg/norbert3-xs")
model = AutoModelForMaskedLM.from_pretrained("ltg/norbert3-xs", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("Nå ønsker de seg en[MASK] bolig.", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] Nå ønsker de seg en ny bolig.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))

此模型目前需要來自modeling_norbert.py的自定義包裝器，因此你應該使用trust_remote_code=True來加載模型。

支持的類

目前實現了以下類：AutoModel、AutoModelMaskedLM、AutoModelForSequenceClassification、AutoModelForTokenClassification、AutoModelForQuestionAnswering和AutoModeltForMultipleChoice。

📚 引用信息

@inproceedings{samuel-etal-2023-norbench,
    title = "{N}or{B}ench {--} A Benchmark for {N}orwegian Language Models",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      Touileb, Samia  and
      Velldal, Erik  and
      {\O}vrelid, Lilja  and
      R{\o}nningstad, Egil  and
      Sigdel, Elina  and
      Palatkina, Anna",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.61",
    pages = "618--633",
    abstract = "We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.",
}