xlm - roberta - base - language - detection開源模型 - 免費檢測20種語言文本分類

首頁

Xlm Roberta Base Language Detection

由papluca開發

基於XLM-RoBERTa的多語言檢測模型，支持20種語言的文本分類

文本分類

Transformers

支持多種語言開源協議:MIT #多語言檢測 #高準確率 #文本分類

下載量 2.7M

發布時間 : 3/2/2022

模型概述

該模型是基於XLM-RoBERTa在語言識別數據集上微調後的版本，用於識別文本的語言類別。

模型特點

高準確率

在測試集上達到99.6%的平均準確率

多語言支持

支持20種常見語言的檢測

基於XLM-RoBERTa

利用強大的跨語言預訓練模型作為基礎

模型能力

文本語言識別

多語言文本分類

使用案例

內容分類

多語言網站內容分類

自動識別用戶提交內容的語言類別

準確率高達99.6%

數據預處理

多語言數據集預處理

在NLP任務前自動識別文本語言

提高後續處理效率

🚀 xlm-roberta-base語言檢測模型

本模型是基於Transformer架構的語言檢測模型，它在多語言文本分類任務中表現出色，能精準識別20種不同語言，為跨語言文本處理提供了強大支持。

🚀 快速開始

本模型可直接作為語言檢測器使用，即用於序列分類任務。以下為你提供了兩種使用方式，你可以根據需求選擇。

方式一：使用高級 `pipeline` API

from transformers import pipeline

text = [
    "Brevity is the soul of wit.",
    "Amor, ch'a nullo amato amar perdona."
]

model_ckpt = "papluca/xlm-roberta-base-language-detection"
pipe = pipeline("text-classification", model=model_ckpt)
pipe(text, top_k=1, truncation=True)

方式二：分別使用分詞器和模型

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

text = [
    "Brevity is the soul of wit.",
    "Amor, ch'a nullo amato amar perdona."
]

model_ckpt = "papluca/xlm-roberta-base-language-detection"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)

inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

preds = torch.softmax(logits, dim=-1)

# Map raw predictions to languages
id2lang = model.config.id2label
vals, idxs = torch.max(preds, dim=1)
{id2lang[k.item()]: v.item() for k, v in zip(idxs, vals)}

✨ 主要特性

多語言支持：支持20種語言的檢測，包括阿拉伯語（ar）、保加利亞語（bg）、德語（de）、現代希臘語（el）、英語（en）、西班牙語（es）、法語（fr）、印地語（hi）、意大利語（it）、日語（ja）、荷蘭語（nl）、波蘭語（pl）、葡萄牙語（pt）、俄語（ru）、斯瓦希里語（sw）、泰語（th）、土耳其語（tr）、烏爾都語（ur）、越南語（vi）和中文（zh）。
高精度：在測試集上的平均準確率達到99.6%，與平均宏/加權F1分數相匹配。

📦 安裝指南

文檔未提及安裝步驟，暫不提供相關內容。

💻 使用示例

基礎用法

from transformers import pipeline

text = [
    "Brevity is the soul of wit.",
    "Amor, ch'a nullo amato amar perdona."
]

model_ckpt = "papluca/xlm-roberta-base-language-detection"
pipe = pipeline("text-classification", model=model_ckpt)
pipe(text, top_k=1, truncation=True)

高級用法

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

text = [
    "Brevity is the soul of wit.",
    "Amor, ch'a nullo amato amar perdona."
]

model_ckpt = "papluca/xlm-roberta-base-language-detection"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)

inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

preds = torch.softmax(logits, dim=-1)

# Map raw predictions to languages
id2lang = model.config.id2label
vals, idxs = torch.max(preds, dim=1)
{id2lang[k.item()]: v.item() for k, v in zip(idxs, vals)}

📚 詳細文檔

模型描述

本模型是 xlm-roberta-base 在語言識別數據集上進行微調後的版本。它是一個XLM - RoBERTa變壓器模型，頂部帶有一個分類頭（即在池化輸出之上有一個線性層）。如需更多信息，請參考 xlm-roberta-base 模型卡片或Conneau等人的論文大規模無監督跨語言表徵學習。

預期用途和限制

你可以直接將此模型用作語言檢測器，即用於序列分類任務。目前，它支持以下20種語言：

阿拉伯語 (ar)、保加利亞語 (bg)、德語 (de)、現代希臘語 (el)、英語 (en)、西班牙語 (es)、法語 (fr)、印地語 (hi)、意大利語 (it)、日語 (ja)、荷蘭語 (nl)、波蘭語 (pl)、葡萄牙語 (pt)、俄語 (ru)、斯瓦希里語 (sw)、泰語 (th)、土耳其語 (tr)、烏爾都語 (ur)、越南語 (vi) 和中文 (zh)

訓練和評估數據

該模型在語言識別數據集上進行了微調，該數據集由20種語言的文本序列組成。訓練集包含70k個樣本，而驗證集和測試集各包含10k個樣本。測試集上的平均準確率為 99.6%（這與平均宏/加權F1分數相匹配，因為測試集是完全平衡的）。以下表格提供了更詳細的評估信息。

語言	精確率	召回率	F1分數	支持樣本數
ar	0.998	0.996	0.997	500
bg	0.998	0.964	0.981	500
de	0.998	0.996	0.997	500
el	0.996	1.000	0.998	500
en	1.000	1.000	1.000	500
es	0.967	1.000	0.983	500
fr	1.000	1.000	1.000	500
hi	0.994	0.992	0.993	500
it	1.000	0.992	0.996	500
ja	0.996	0.996	0.996	500
nl	1.000	1.000	1.000	500
pl	1.000	1.000	1.000	500
pt	0.988	1.000	0.994	500
ru	1.000	0.994	0.997	500
sw	1.000	1.000	1.000	500
th	1.000	0.998	0.999	500
tr	0.994	0.992	0.993	500
ur	1.000	1.000	1.000	500
vi	0.992	1.000	0.996	500
zh	1.000	1.000	1.000	500

基準測試

作為與 xlm-roberta-base語言檢測 模型進行比較的基線，我們使用了Python langid 庫。由於它預先在97種語言上進行了訓練，我們使用了其 .set_languages() 方法將語言集限制為我們的20種語言。langid在測試集上的平均準確率為 98.5%。以下表格提供了更多詳細信息。

語言	精確率	召回率	F1分數	支持樣本數
ar	0.990	0.970	0.980	500
bg	0.998	0.964	0.981	500
de	0.992	0.944	0.967	500
el	1.000	0.998	0.999	500
en	1.000	1.000	1.000	500
es	1.000	0.968	0.984	500
fr	0.996	1.000	0.998	500
hi	0.949	0.976	0.963	500
it	0.990	0.980	0.985	500
ja	0.927	0.988	0.956	500
nl	0.980	1.000	0.990	500
pl	0.986	0.996	0.991	500
pt	0.950	0.996	0.973	500
ru	0.996	0.974	0.985	500
sw	1.000	1.000	1.000	500
th	1.000	0.996	0.998	500
tr	0.990	0.968	0.979	500
ur	0.998	0.996	0.997	500
vi	0.971	0.990	0.980	500
zh	1.000	1.000	1.000	500