OpenLID開源語言識別模型 - 高性能支持201種語言，覆蓋範圍超廣！

首頁

Openlid

由laurievb開發

OpenLID是一個高覆蓋率、高性能的語言識別模型，支持201種語言。

文本分類 #多語言識別 #高覆蓋率 #fastText框架

下載量 1,854

發布時間 : 10/24/2023

模型概述

基於fastText框架的文本分類模型，專門用於語言識別任務。

模型特點

高覆蓋率

支持201種語言，覆蓋範圍廣。

高性能

在FLORES-200基準上表現優異。

公開數據集

訓練數據及性能指標公開，促進進一步研究。

模型能力

文本分類

語言識別

使用案例

多語言處理

語言檢測

識別文本的語言類別。

在FLORES-200基準上表現優異。

🚀 OpenLID

OpenLID是一款高覆蓋、高性能的語言識別模型，基於fastText實現，可識別201種語言。其訓練數據和各語言的性能數據均公開，方便後續研究。

🚀 快速開始

OpenLID是一個高覆蓋、高性能的語言識別模型。它是一個fastText模型，涵蓋201種語言。訓練數據和每種語言的性能數據都是公開的，以鼓勵進一步的研究。

以下是使用該模型檢測給定文本語言的示例代碼：

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="laurievb/OpenLID", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

✨ 主要特性

高覆蓋：能夠識別201種語言。
高性能：在語言識別任務上表現出色。
數據公開：訓練數據和各語言性能數據公開，便於進一步研究。

📦 安裝指南

文檔未提及安裝步驟，跳過此章節。

💻 使用示例

基礎用法

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="laurievb/OpenLID", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

高級用法

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

📚 詳細文檔

模型描述

該模型和訓練數據在Burchell et al. (2023)中有詳細描述，原始的fastText實現可以通過github獲取。

侷限性和偏差

語言覆蓋有限：數據集和模型僅覆蓋201種語言，即我們能夠使用FLORES - 200評估基準進行測試的語言。
領域侷限性：由於測試集僅包含來自單一領域（維基文章）的句子，在該測試集上的性能可能無法反映分類器在其他領域的工作效果。未來的工作可以創建一個能代表網絡數據的LID測試集，因為這些分類器通常應用於網絡數據。
數據審核不足：大部分數據沒有像理想情況那樣由母語人士進行審核。該數據集的未來版本應該有更多語言由母語人士進行驗證，尤其關注資源最少的語言。

我們的工作旨在通過讓從業者能夠識別更多語言的相關數據來擴大NLP的覆蓋範圍。然而，我們注意到語言識別本質上是一種規範性活動，存在將少數方言、文字系統或整個微觀語言從宏觀語言中排除的風險。選擇要覆蓋的語言可能會加劇權力不平衡，因為只有部分群體能夠使用NLP技術。此外，語言識別中的錯誤可能會對下游性能產生重大影響，特別是當系統被用作“黑匣子”時（這種情況很常見）。我們的分類器在不同語言上的性能並不均衡，這可能導致特定群體的下游性能更差。我們通過按類別提供指標來緩解這一問題。

訓練數據

該模型在OpenLID數據集上進行訓練，該數據集可通過github倉庫獲取。

訓練過程

該模型使用fastText進行訓練，並設置了以下超參數。所有其他超參數均設置為默認值。

損失函數：softmax
訓練輪數：2
學習率：0.8
單詞最小出現次數：1000
嵌入維度：256
字符n - 元組：2 - 5
單詞n - 元組：1
桶大小：1,000,000
線程數：68

評估數據集

該模型使用Costa - jussà等人（2022）提供的FLORES - 200基準進行評估。更多信息可在論文中獲取。

🔧 技術細節

模型使用fastText進行訓練，通過設置特定的超參數來優化性能。具體超參數設置如下：

損失函數：softmax
訓練輪數：2
學習率：0.8
單詞最小出現次數：1000
嵌入維度：256
字符n - 元組：2 - 5
單詞n - 元組：1
桶大小：1,000,000
線程數：68

📄 許可證

本模型使用的許可證為gpl - 3.0。

BibTeX引用和引用信息

ACL引用（推薦）

@inproceedings{burchell-etal-2023-open,
    title = "An Open Dataset and Model for Language Identification",
    author = "Burchell, Laurie  and
      Birch, Alexandra  and
      Bogoychev, Nikolay  and
      Heafield, Kenneth",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-short.75",
    doi = "10.18653/v1/2023.acl-short.75",
    pages = "865--879",
    abstract = "Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033{\%} across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model{'}s performance, both in comparison to existing open models and by language class.",
}

ArXiv引用

@article{burchell2023open,
  title={An Open Dataset and Model for Language Identification},
  author={Burchell, Laurie and Birch, Alexandra and Bogoychev, Nikolay and Heafield, Kenneth},
  journal={arXiv preprint arXiv:2305.13820},
  year={2023}
}