FlauBERT开源法语BERT模型 - 基于大规模语料预训练，助力法语内容处理

首页

Flaubert Base Cased

由 flaubert 开发

FlauBERT是一个基于大规模法语语料库预训练的法语BERT模型，由法国国家科学研究中心开发。

大型语言模型

Transformers

法语开源协议:MIT #法语BERT #无监督预训练 #FLUE基准

下载量 4,253

发布时间 : 3/2/2022

模型简介

FlauBERT是一个面向法语的无监督语言模型，基于BERT架构预训练，适用于各种法语NLP任务。

模型特点

法语专用预训练

专门针对法语语言特性进行预训练，优化法语文本处理能力

多版本选择

提供不同规模的模型版本，从小型到大型满足不同需求

FLUE评估基准

配套提供法语NLP评估框架FLUE，便于模型性能评估

区分大小写选项

提供区分大小写(cased)和不区分大小写(uncased)版本

模型能力

法语文本理解

上下文词向量生成

句子分类

命名实体识别

问答系统

使用案例

学术研究

法语语言学研究

用于分析法语语言特征和语法结构

商业应用

法语客服机器人

构建能理解法语的对话系统

法语内容分类

对法语新闻、评论等内容进行分类

🚀 FlauBERT：面向法语的无监督语言模型预训练

FlauBERT 是一个在非常庞大且多样化的法语语料库上训练的法语 BERT 模型。不同规模的模型借助法国国家科学研究中心（CNRS）的 Jean Zay 超级计算机进行训练。

与 FlauBERT 一同推出的还有 FLUE：这是一个用于法语自然语言处理系统的评估框架，类似于广受欢迎的 GLUE 基准测试。其目标是在未来开展更多可复现的实验，并在法语领域共享模型和研究进展。更多详情请参考官方网站。

✨ 主要特性

基于大规模且异构的法语语料库进行预训练，能更好地适应法语语言特点。
提供不同规模的模型，可根据具体需求选择。
配备类似 GLUE 基准测试的 FLUE 评估框架，方便进行模型评估和比较。

📦 安装指南

文档中未提及具体安装步骤，可参考官方网站获取安装信息。

💻 使用示例

基础用法

import torch
from transformers import FlaubertModel, FlaubertTokenizer

# 从以下选项中选择模型名称
# ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#  'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# 加载预训练模型和分词器
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# 如果使用大小写敏感的模型，do_lowercase=False；如果使用不区分大小写的模型，do_lowercase=True

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# BERT 的 [CLS] 标记对应最后一层的第一个隐藏状态
cls_embedding = last_layer[:, 0, :]

注意事项

⚠️ 重要提示

如果你的 transformers 版本小于等于 2.10.0，modelname 应取以下值之一：

['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

📚 详细文档

FlauBERT 模型

模型名称	层数	注意力头数	嵌入维度	总参数数量
`flaubert-small-cased`	6	8	512	54 M
`flaubert-base-uncased`	12	12	768	137 M
`flaubert-base-cased`	12	12	768	138 M
`flaubert-large-cased`	24	16	1024	373 M

⚠️ 重要提示

flaubert-small-cased 模型是部分训练的，因此不能保证其性能。建议仅将其用于调试目的。

📄 许可证

本项目采用 MIT 许可证。

📖 参考文献

如果你在科学出版物中使用了 FlauBERT 或 FLUE 基准测试，或者认为本仓库中的资源很有用，请引用以下论文之一：

LREC 论文

@InProceedings{le2020flaubert,
  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},
  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2479--2490},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}

TALN 论文

@inproceedings{le2020flaubert,
  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
  pages         = {268--278},
  year          = {2020},
  organization  = {ATALA}
}