X-ALMA-13B-Pretrain開源翻譯模型 - 即插即用支持50種語言翻譯

Home

X ALMA 13B Pretrain

Developed by haoranxu

X-ALMA是基於ALMA-R擴展的多語言機器翻譯模型，支持50種語言，採用即插即用架構和特定語言模塊。

大型語言模型

Transformers

Supports Multiple LanguagesOpen Source License:MIT #多語言機器翻譯 #即插即用架構 #50種語言支持

Downloads 2,928

Release Time : 6/27/2024

Model Overview

X-ALMA是一個多語言機器翻譯模型，通過擴展ALMA-R模型，將支持的語言數量從6種提升到50種。它採用即插即用架構，配備特定語言模塊，並搭配精心設計的訓練方案。

Model Features

多語言支持

支持50種語言，涵蓋多種不同語系的語言。

即插即用架構

採用帶有特定語言模塊的即插即用架構，搭配精心設計的訓練方案。

模塊化設計

支持加載基礎模型和特定語言模塊，或加載合併後的模型，靈活適應不同需求。

Model Capabilities

機器翻譯

多語言開放式問答

Use Cases

機器翻譯

中文到英文翻譯

將中文文本翻譯成英文。

高質量翻譯結果

多語言翻譯

支持50種語言之間的互譯。

廣泛的語言覆蓋和高質量的翻譯

問答系統

多語言開放式問答

支持多種語言的開放式問答。

準確的回答和廣泛的語言支持

🚀 X-ALMA

X-ALMA是基於ALMA-R進行擴展的模型，它將支持的語言數量從6種提升到了50種。該模型採用了即插即用的架構，配備特定語言模塊，並搭配精心設計的訓練方案。此版本發佈了X-ALMA預訓練基礎模型。

🚀 快速開始

有三種方式可以加載X-ALMA進行翻譯。以下是一個將“我愛機器翻譯。”翻譯成英文的示例（X-ALMA也能夠處理多語言開放式問答）。

第一種方式：加載已將特定語言模塊合併到基礎模型中的合併模型（推薦）

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from peft import PeftModel

GROUP2LANG = {
1: ["da", "nl", "de", "is", "no", "sv", "af"],
2: ["ca", "ro", "gl", "it", "pt", "es"],
3: ["bg", "mk", "sr", "uk", "ru"],
4: ["id", "ms", "th", "vi", "mg", "fr"],
5: ["hu", "el", "cs", "pl", "lt", "lv"],
6: ["ka", "zh", "ja", "ko", "fi", "et"],
7: ["gu", "hi", "mr", "ne", "ur"],
8: ["az", "kk", "ky", "tr", "uz", "ar", "he", "fa"],
}
LANG2GROUP = {lang: str(group) for group, langs in GROUP2LANG.items() for lang in langs}
group_id = LANG2GROUP["zh"]

model = AutoModelForCausalLM.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", padding_side='left')

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我愛機器翻譯。\nEnglish:"

# X-ALMA needs chat template but ALMA and ALMA-R don't need it.
chat_style_prompt = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(chat_style_prompt, tokenize=False, add_generation_prompt=True)

input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

第二種方式：加載基礎模型和特定語言模塊（推薦）

model = AutoModelForCausalLM.from_pretrained("haoranxu/X-ALMA-13B-Pretrain", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, f"haoranxu/X-ALMA-13B-Group{group_id}")
tokenizer = AutoTokenizer.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", padding_side='left')

第三種方式：像混合專家模型（MoE）一樣加載包含所有特定語言模塊的基礎模型（需要大顯存GPU）

from modeling_xalma import XALMAForCausalLM
model = XALMAForCausalLM.from_pretrained("haoranxu/X-ALMA", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/X-ALMA", padding_side='left')

# Add `lang="zh"`: specify the language to instruct the model on which group to use for the third loading method during generation.
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9, lang="zh")

✨ 主要特性

多語言支持：在ALMA - R的基礎上，將支持的語言從6種擴展到了50種，涵蓋了多種不同語系的語言。
即插即用架構：採用了帶有特定語言模塊的即插即用架構，並且搭配精心設計的訓練方案。

📦 模型信息

屬性	詳情
基礎模型	haoranxu/ALMA - 13B - Pretrain
訓練數據集	oscar - corpus/OSCAR - 2301、allenai/nllb、Helsinki - NLP/opus - 100
支持語言	英語（en）、丹麥語（da）、荷蘭語（nl）、德語（de）、冰島語（is）、挪威語（no）、瑞典語（sv）、南非荷蘭語（af）、加泰羅尼亞語（ca）、羅馬尼亞語（ro）、加利西亞語（gl）、意大利語（it）、葡萄牙語（pt）、西班牙語（es）、保加利亞語（bg）、馬其頓語（mk）、塞爾維亞語（sr）、烏克蘭語（uk）、俄語（ru）、印尼語（id）、馬來語（ms）、泰語（th）、越南語（vi）、馬達加斯加語（mg）、法語（fr）、匈牙利語（hu）、希臘語（el）、捷克語（cs）、波蘭語（pl）、立陶宛語（lt）、拉脫維亞語（lv）、格魯吉亞語（ka）、中文（zh）、日語（ja）、韓語（ko）、芬蘭語（fi）、愛沙尼亞語（et）、古吉拉特語（gu）、印地語（hi）、馬拉地語（mr）、尼泊爾語（ne）、烏爾都語（ur）、阿塞拜疆語（az）、哈薩克語（kk）、吉爾吉斯語（ky）、土耳其語（tr）、烏茲別克語（uz）、阿拉伯語（ar）、希伯來語（he）、波斯語（fa）

📚 詳細文檔

模型引用

@misc{xu2024xalmaplugplay,
      title={X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale}, 
      author={Haoran Xu and Kenton Murray and Philipp Koehn and Hieu Hoang and Akiko Eriguchi and Huda Khayrallah},
      year={2024},
      eprint={2410.03115},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03115}, 
}

模型鏈接

所有X - ALMA的檢查點都發布在Hugging Face上：

模型	模型鏈接	描述
X - ALMA	[haoranxu/X - ALMA](https://huggingface.co/haoranxu/X - ALMA)	包含所有模塊的X - ALMA模型
X - ALMA - 13B - Pretrain	[haoranxu/X - ALMA - 13B - Pretrain](https://huggingface.co/haoranxu/X - ALMA - 13B - Pretrain)	X - ALMA 13B多語言預訓練基礎模型
X - ALMA - Group1	[haoranxu/X - ALMA - 13B - Group1](https://huggingface.co/haoranxu/X - ALMA - 13B - Group1)	X - ALMA group1特定模塊及合併後的模型
X - ALMA - Group2	[haoranxu/X - ALMA - 13B - Group2](https://huggingface.co/haoranxu/X - ALMA - 13B - Group2)	X - ALMA group2特定模塊及合併後的模型
X - ALMA - Group3	[haoranxu/X - ALMA - 13B - Group3](https://huggingface.co/haoranxu/X - ALMA - 13B - Group3)	X - ALMA group3特定模塊及合併後的模型
X - ALMA - Group4	[haoranxu/X - ALMA - 13B - Group4](https://huggingface.co/haoranxu/X - ALMA - 13B - Group4)	X - ALMA group4特定模塊及合併後的模型
X - ALMA - Group5	[haoranxu/X - ALMA - 13B - Group5](https://huggingface.co/haoranxu/X - ALMA - 13B - Group5)	X - ALMA group5特定模塊及合併後的模型
X - ALMA - Group6	[haoranxu/X - ALMA - 13B - Group6](https://huggingface.co/haoranxu/X - ALMA - 13B - Group6)	X - ALMA group6特定模塊及合併後的模型
X - ALMA - Group7	[haoranxu/X - ALMA - 13B - Group7](https://huggingface.co/haoranxu/X - ALMA - 13B - Group7)	X - ALMA group7特定模塊及合併後的模型
X - ALMA - Group8	[haoranxu/X - ALMA - 13B - Group8](https://huggingface.co/haoranxu/X - ALMA - 13B - Group8)	X - ALMA group8特定模塊及合併後的模型