roberta-base-wechsel-chinese开源模型 - 实现英文到中文高效跨语言迁移

首页

Roberta Base Wechsel Chinese

由 benjamin 开发

通过WECHSEL方法训练的RoBERTa中文模型，实现从英文到中文的高效跨语言迁移

大型语言模型

Transformers

中文开源协议:MIT #跨语言迁移 #子词嵌入初始化 #中文NLP

下载量 16

发布时间 : 3/2/2022

模型简介

该模型采用WECHSEL方法训练，通过有效初始化子词嵌入实现单语语言模型的跨语言迁移，特别适用于中文自然语言处理任务。

模型特点

高效跨语言迁移

使用WECHSEL方法实现从英文到中文的高效参数迁移，显著降低训练成本

性能优越

在中文NLI和NER任务上表现优于传统方法训练的模型

低资源优化

特别适合低资源语言的模型迁移，减少训练所需计算资源

模型能力

自然语言理解

文本分类

命名实体识别

使用案例

自然语言处理

中文文本分类

对中文文本进行分类任务

在NLI任务上达到78.32分

中文命名实体识别

识别中文文本中的命名实体

在NER任务上达到80.55分

🚀 roberta-base-wechsel-chinese

本项目的模型使用WECHSEL方法进行训练，该方法可有效初始化子词嵌入，用于单语言模型的跨语言迁移。

查看代码请访问：https://github.com/CPJKU/wechsel

查看论文请访问：https://aclanthology.org/2022.naacl-main.293/

🚀 快速开始

本项目提供了使用WECHSEL方法训练的多语言模型，包括法语、德语、中文和斯瓦希里语等。你可以通过上述链接查看代码和论文获取更多信息。

✨ 主要特性

本项目使用WECHSEL方法对单语言模型进行跨语言迁移，提高了模型在不同语言上的性能。通过与其他基准模型对比，展示了该方法在多种语言任务上的有效性。

📚 详细文档

RoBERTa模型性能

模型	NLI得分	NER得分	平均得分
`roberta-base-wechsel-french`	82.43	90.88	86.65
`camembert-base`	80.88	90.26	85.57

模型	NLI得分	NER得分	平均得分
`roberta-base-wechsel-german`	81.79	89.72	85.76
`deepset/gbert-base`	78.64	89.46	84.05

模型	NLI得分	NER得分	平均得分
`roberta-base-wechsel-chinese`	78.32	80.55	79.44
`bert-base-chinese`	76.55	82.05	79.30

模型	NLI得分	NER得分	平均得分
`roberta-base-wechsel-swahili`	75.05	87.39	81.22
`xlm-roberta-base`	69.18	87.37	78.28

GPT2模型性能

模型	困惑度（PPL）
`gpt2-wechsel-french`	19.71
`gpt2`（从头开始重新训练）	20.47

模型	困惑度（PPL）
`gpt2-wechsel-german`	26.8
`gpt2`（从头开始重新训练）	27.63

模型	困惑度（PPL）
`gpt2-wechsel-chinese`	51.97
`gpt2`（从头开始重新训练）	52.98

模型	困惑度（PPL）
`gpt2-wechsel-swahili`	10.14
`gpt2`（从头开始重新训练）	10.58

更多详细信息请参考我们的论文。

📄 许可证

本项目采用MIT许可证。

📖 引用信息

如果你使用了本项目的模型或方法，请引用以下论文：

@inproceedings{minixhofer-etal-2022-wechsel,
    title = "{WECHSEL}: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models",
    author = "Minixhofer, Benjamin  and
      Paischer, Fabian  and
      Rekabsaz, Navid",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.293",
    pages = "3992--4006",
    abstract = "Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method {--} called WECHSEL {--} to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English RoBERTa and GPT-2 models to four languages (French, German, Chinese and Swahili). We also study the benefits of our method on very low-resource languages. WECHSEL improves over proposed methods for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.",
}