Albertina 1.5B PTBR 開源語言模型 - 助力巴西葡萄牙語相關應用開發

首頁

Albertina 1b5 Portuguese Ptbr Encoder

由PORTULAN開發

Albertina 1.5B PTBR 是一個面向巴西葡萄牙語變體的基礎大型語言模型，屬於BERT家族的編碼器，基於Transformer神經網絡架構，並在DeBERTa模型基礎上開發。

大型語言模型

Transformers

其他開源協議:MIT #巴西葡萄牙語編碼器 #15億參數大模型 #DeBERTa架構優化

下載量 83

發布時間 : 10/27/2023

模型概述

這是一個專門為巴西葡萄牙語變體設計的大型語言模型，具有15億參數，針對該語言具有最具競爭力的性能。

模型特點

巴西葡萄牙語優化

專門針對巴西葡萄牙語變體進行訓練和優化

大規模參數

擁有15億參數，為巴西葡萄牙語設立了新的技術標杆

高性能

在巴西葡萄牙語任務上表現出最具競爭力的性能

開放許可

在最寬鬆的MIT許可下免費分發

模型能力

文本理解

掩碼語言建模

巴西葡萄牙語文本處理

使用案例

自然語言處理

文本補全

自動補全被掩碼的文本片段

示例中正確預測了'傳統'作為最佳補全詞

語言理解

理解巴西葡萄牙語文本的語義和上下文

🚀 Albertina 1.5B PTBR

Albertina 1.5B PTBR 是一款針對 美式葡萄牙語變體 的基礎大語言模型。它屬於 BERT 家族的 編碼器，基於 Transformer 神經網絡架構，在 DeBERTa 模型的基礎上開發，在該語言領域具有極具競爭力的性能。該模型有不同版本，針對葡萄牙語的不同變體進行訓練，即葡萄牙使用的歐洲變體（PTPT）和巴西使用的美式變體（PTBR），並且在開放許可下免費公開分發。

🚀 快速開始

你可以直接使用此模型進行掩碼語言建模：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-1b5-portuguese-ptbr-encoder')
>>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")

[{'score': 0.8332648277282715, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
{'score': 0.07860890030860901, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
{'score': 0.03278181701898575, 'token': 35277, 'token_str': ' arte', 'sequence': 'A culinária portuguesa é rica em sabores e arte, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009515956044197083, 'token': 9240, 'token_str': ' cor', 'sequence': 'A culinária portuguesa é rica em sabores e cor, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009381960146129131, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'}]

該模型也可通過針對特定任務進行微調來使用：

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset

>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder")
>>> dataset = load_dataset("PORTULAN/glue-ptbr", "rte")

>>> def tokenize_function(examples):
...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)

>>> training_args = TrainingArguments(output_dir="albertina-ptbr-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_datasets["train"],
...     eval_dataset=tokenized_datasets["validation"],
... )

>>> trainer.train()

✨ 主要特性

針對性強：專門為美式葡萄牙語變體設計，能更好地處理該語言的文本。
架構先進：基於 Transformer 架構和 DeBERTa 模型開發，具備優秀的性能。
版本多樣：有針對不同葡萄牙語變體的版本，滿足不同地區的需求。
免費開源：在開放許可下免費公開分發，方便研究和使用。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-1b5-portuguese-ptbr-encoder')
>>> unmasker("A culinária portuguesa é rica em sabores e [MASK], tornando-se um dos maiores tesouros do país.")

[{'score': 0.8332648277282715, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária portuguesa é rica em sabores e costumes, tornando-se um dos maiores tesouros do país.'},
{'score': 0.07860890030860901, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária portuguesa é rica em sabores e cores, tornando-se um dos maiores tesouros do país.'},
{'score': 0.03278181701898575, 'token': 35277, 'token_str': ' arte', 'sequence': 'A culinária portuguesa é rica em sabores e arte, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009515956044197083, 'token': 9240, 'token_str': ' cor', 'sequence': 'A culinária portuguesa é rica em sabores e cor, tornando-se um dos maiores tesouros do país.'},
{'score': 0.009381960146129131, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária portuguesa é rica em sabores e nuances, tornando-se um dos maiores tesouros do país.'}]

高級用法

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> from datasets import load_dataset

>>> model = AutoModelForSequenceClassification.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder", num_labels=2)
>>> tokenizer = AutoTokenizer.from_pretrained("PORTULAN/albertina-1b5-portuguese-ptbr-encoder")
>>> dataset = load_dataset("PORTULAN/glue-ptbr", "rte")

>>> def tokenize_function(examples):
...     return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)

>>> training_args = TrainingArguments(output_dir="albertina-ptbr-rte", evaluation_strategy="epoch")
>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_datasets["train"],
...     eval_dataset=tokenized_datasets["validation"],
... )

>>> trainer.train()

📚 詳細文檔

模型描述

此模型卡片針對 Albertina 1.5B PTBR，擁有 15 億個參數、48 層和 1536 的隱藏層大小。Albertina 1.5B PTBR 根據 MIT 許可證分發，DeBERTa 也根據 MIT 許可證分發。

訓練數據

Albertina 1.5B PTBR 在一個 360 億標記的數據集上進行訓練，該數據集是通過收集來自以下來源的一些公開可用的美式葡萄牙語語料庫得到的：

CulturaX：CulturaX 是一個多語言語料庫，可免費用於研究和人工智能開發，它是通過合併和深度清理另外兩個大型數據集 mC4 和 OSCAR 創建的。它是對 Common Crawl 數據集進行篩選的結果，該數據集是從網絡上爬取的，只保留元數據表明允許爬取的頁面，進行去重處理，並去除一些樣板內容等。由於它不區分葡萄牙語變體，我們進行了額外的過濾，只保留元數據表明互聯網國家代碼頂級域名是葡萄牙的文檔。

預處理

我們使用 BLOOM 預處理管道對 PTBR 語料庫進行過濾。我們跳過了默認的停用詞過濾，因為這會破壞句法結構，同時也跳過了語言識別過濾，因為語料庫已經預先選擇為葡萄牙語。

訓練

作為代碼庫，我們採用了用於英語的 DeBERTa V2 xxlarge。為了訓練 Albertina 1.5B PTBR，數據集使用原始的 DeBERTa 分詞器進行分詞，對 250k 步採用 128 標記的序列截斷和動態填充，對 80k 步採用 256 標記的序列截斷（Albertina 1.5B PTBR 256），最後對 60k 步採用 512 標記的序列截斷。這些步驟分別對應於在 Google Cloud A2 節點 a2-megagpu-16gb 上對 128 標記輸入序列進行 48 小時的計算、對 256 標記輸入序列進行 24 小時的計算以及對 512 標記輸入序列進行 24 小時的計算。我們選擇了 1e - 5 的學習率，採用線性衰減和 10k 預熱步驟。

性能

我們採用了 extraGLUE，這是 GLUE 和 SUPERGLUE 的 PTBR 版本 基準測試。我們使用 DeepL Translate 自動翻譯了 GLUE 和 SUPERGLUE 的任務，該工具特別提供了從英語到 PTPT 或 PTBR 的翻譯選項。

模型	RTE（準確率）	WNLI（準確率）	MRPC（F1 值）	STS - B（皮爾遜係數）	COPA（準確率）	CB（F1 值）	MultiRC（F1 值）	BoolQ（準確率）
Albertina 1.5B PTBR	0.8676	0.4742	0.8622	0.9007	0.7767	0.6372	0.7667	0.8654
Albertina 1.5B PTBR 256	0.8123	0.4225	0.8638	0.8968	0.8533	0.6884	0.6799	0.8509
Albertina 900M PTBR	0.7545	0.4601	0.9071	0.8910	0.7767	0.5799	0.6731	0.8385
BERTimbau (335M)	0.6446	0.5634	0.8873	0.8842	0.6933	0.5438	0.6787	0.7783
Albertina 100M PTBR	0.6582	0.5634	0.8149	0.8489	n.a.	0.4771	0.6469	0.7537
DeBERTa 1.5B (英語)	0.7112	0.5634	0.8545	0.0123	0.5700	0.4307	0.3639	0.6217
DeBERTa 100M (英語)	0.5716	0.5587	0.8060	0.8266	n.a.	0.4739	0.6391	0.6838

🔧 技術細節

Albertina 1.5B PTBR 是基於 Transformer 架構和 DeBERTa 模型開發的編碼器。它在一個 360 億標記的美式葡萄牙語數據集上進行訓練，通過特定的預處理和訓練步驟，調整參數以達到較好的性能。在訓練過程中，採用了不同的序列截斷和動態填充策略，以及特定的學習率和預熱步驟。在性能評估方面，使用了 extraGLUE 基準測試，該測試是 GLUE 和 SUPERGLUE 的 PTBR 版本，通過自動翻譯任務來適應美式葡萄牙語。

📄 許可證

Albertina 1.5B PTBR 根據 MIT 許可證分發，DeBERTa 也根據 MIT 許可證分發。

引用

使用或引用此模型時，請引用以下出版物：

@misc{albertina-pt-fostering,
      title={Fostering the Ecosystem of Open Neural Encoders
            for Portuguese with Albertina PT-* family}, 
      author={Rodrigo Santos and João Rodrigues and Luís Gomes
              and João Silva and António Branco
              and Henrique Lopes Cardoso and Tomás Freitas Osório
              and Bernardo Leite},
      year={2024},
      eprint={2403.01897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

致謝

此處報告的研究部分得到了以下支持：

PORTULAN CLARIN — 語言科學與技術研究基礎設施，由 Lisboa 2020、Alentejo 2020 和 FCT — 科學技術基金會根據贈款 PINFRA/22117/2016 資助。
研究項目 ALBERTINA - 葡萄牙語和人工智能基礎編碼器模型，由 FCT — 科學技術基金會根據贈款 CPCA - IAC/AV/478394/2022 資助。
創新項目 ACCELERAT.AI - 多語言智能聯絡中心，由 IAPMEI, I.P. - 競爭力與創新局根據復甦與韌性計劃的贈款 C625734525 - 00462629 資助，項目編號 RE - C05 - i01.01 – 再工業化動員議程/聯盟。
LIACC - 人工智能與計算機科學實驗室，由 FCT — 科學技術基金會根據贈款 FCT/UID/CEC/0027/2020 資助。