ALMA-7B及ALMA-13B-R开源翻译模型 - 媲美GPT-4，实现优质翻译效果

首页

ALMA 7B

由 haoranxu 开发

ALMA-13B-R是基于大语言模型的先进翻译模型，采用对比偏好优化（CPO）进行微调，能够匹配甚至超越GPT-4或WMT比赛的优胜者。

机器翻译

Transformers

开源协议:MIT #大语言模型翻译 #两阶段微调 #对比偏好优化

下载量 256

发布时间 : 9/17/2023

模型简介

ALMA-13B-R是一款基于LLaMA-2-13B的翻译模型，通过两阶段微调（单语数据微调+高质量平行数据优化）和对比偏好优化（CPO）实现高性能机器翻译。

模型特点

两阶段微调

先在单语数据上进行微调，再使用高质量平行数据进行优化，确保强大的翻译性能。

对比偏好优化（CPO）

采用创新的对比偏好优化方法进行LoRA微调，而非传统的监督微调，显著提升翻译质量。

高性能翻译

能够匹配甚至超越GPT-4或WMT比赛的优胜者，提供专业级翻译质量。

模型能力

高质量机器翻译

多语言翻译

专业领域翻译

使用案例

专业翻译

技术文档翻译

将技术文档从一种语言翻译为另一种语言，保持专业术语的准确性。

翻译质量可媲美专业人工翻译

国际会议材料翻译

为国际会议提供高质量的演讲材料和会议记录翻译。

达到WMT比赛优胜者水平

商业应用

跨国企业沟通

帮助企业进行跨语言内部沟通和文档翻译。

提升跨国企业沟通效率

🚀 ALMA（高级基于语言模型的翻译器）

ALMA（Advanced Language Model-based trAnslator）是一个基于大语言模型（LLM）的翻译模型，采用了全新的翻译模型范式：先在单语数据上进行微调，再使用高质量的平行数据进一步优化。这种两步微调过程确保了强大的翻译性能。更多详细信息请参考我们的论文。

@misc{xu2023paradigm,
      title={A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models}, 
      author={Haoran Xu and Young Jin Kim and Amr Sharaf and Hany Hassan Awadalla},
      year={2023},
      eprint={2309.11674},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

ALMA - R（全新发布！） ALMA - R 基于 ALMA 模型构建，与 ALMA 中使用的监督微调不同，它使用我们提出的对比偏好优化（CPO） 进行进一步的 LoRA 微调。CPO 微调需要我们的[三元组偏好数据](https://huggingface.co/datasets/haoranxu/ALMA - R - Preference)进行偏好学习。现在，ALMA - R 的性能可以媲美甚至超越 GPT - 4 或 WMT 获胜者！

@misc{xu2024contrastive,
      title={Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation}, 
      author={Haoran Xu and Amr Sharaf and Yunmo Chen and Weiting Tan and Lingfeng Shen and Benjamin Van Durme and Kenton Murray and Young Jin Kim},
      year={2024},
      eprint={2401.08417},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

✨ 主要特性

模型发布

我们发布了论文中介绍的六个翻译模型：

ALMA - 7B：在 200 亿单语词元上对 LLaMA - 2 - 7B 进行全量权重微调，然后在人工编写的平行数据上进行全量权重微调。
ALMA - 7B - LoRA：在 200 亿单语词元上对 LLaMA - 2 - 7B 进行全量权重微调，然后在人工编写的平行数据上进行LoRA微调。
ALMA - 7B - R（全新发布！）：在 ALMA - 7B - LoRA 的基础上，使用对比偏好优化进行进一步的 LoRA 微调。
ALMA - 13B：在 120 亿单语词元上对 LLaMA - 2 - 7B 进行全量权重微调，然后在人工编写的平行数据上进行全量权重微调。
ALMA - 13B - LoRA（我们的最佳系统）：在 120 亿单语词元上对 LLaMA - 2 - 7B 进行全量权重微调，然后在人工编写的平行数据上进行LoRA微调。
ALMA - 13B - R（全新发布！）：在 ALMA - 13B - LoRA 的基础上，使用对比偏好优化进行进一步的 LoRA 微调。

模型检查点已在 Hugging Face 上发布：

模型	基础模型链接	LoRA 链接
ALMA - 7B	[haoranxu/ALMA - 7B](https://huggingface.co/haoranxu/ALMA - 7B)	-
ALMA - 7B - LoRA	[haoranxu/ALMA - 7B - Pretrain](https://huggingface.co/haoranxu/ALMA - 7B - Pretrain)	[haoranxu/ALMA - 7B - Pretrain - LoRA](https://huggingface.co/haoranxu/ALMA - 7B - Pretrain - LoRA)
ALMA - 7B - R（全新发布！）	[haoranxu/ALMA - 7B - R (LoRA merged)](https://huggingface.co/haoranxu/ALMA - 7B - R)	-
ALMA - 13B	[haoranxu/ALMA - 13B](https://huggingface.co/haoranxu/ALMA - 13B)	-
ALMA - 13B - LoRA	[haoranxu/ALMA - 13B - Pretrain](https://huggingface.co/haoranxu/ALMA - 13B - Pretrain)	[haoranxu/ALMA - 13B - Pretrain - LoRA](https://huggingface.co/haoranxu/ALMA - 13B - Pretrain - LoRA)
ALMA - 13B - R（全新发布！）	[haoranxu/ALMA - 13B - R (LoRA merged)](https://huggingface.co/haoranxu/ALMA - 13B - R)	-

⚠️ 重要提示

请注意，ALMA - 7B - Pretrain 和 ALMA - 13B - Pretrain 不是翻译模型。它们仅经历了第一阶段的单语微调（7B 模型为 200 亿词元，13B 模型为 120 亿词元），应与它们的 LoRA 模型结合使用。

数据集发布

ALMA 和 ALMA - R 使用的数据集也已在 Hugging Face 上发布（全新发布！）

数据集	训练/验证集	测试集
人工编写的平行数据（ALMA）	[训练和验证集](https://huggingface.co/datasets/haoranxu/ALMA - Human - Parallel)	[WMT'22](https://huggingface.co/datasets/haoranxu/WMT22 - Test)
三元组偏好数据	[训练集](https://huggingface.co/datasets/haoranxu/ALMA - R - Preference)	[WMT'22](https://huggingface.co/datasets/haoranxu/WMT22 - Test) 和 [WMT'23](https://huggingface.co/datasets/haoranxu/WMT23 - Test)

🚀 快速开始

以下是使用 ALMA - 13B - LoRA 系统进行翻译的快速入门示例，将“我爱机器翻译。”翻译成英语：

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-13B-Pretrain", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, "haoranxu/ALMA-13B-Pretrain-LoRA")
tokenizer = LlamaTokenizer.from_pretrained("haoranxu/ALMA-13B-Pretrain", padding_side='left')

# Add the source setence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)