Awesome-align-with-co開源詞對齊工具 - 從平行語料庫提取詞對齊，微調提升質量

首頁

Awesome Align With Co

由aneuraz開發

AWESOME-align 是一個基於多語言BERT的工具，用於從平行語料庫中提取詞對齊，並可通過微調提升對齊質量。

文本嵌入

Transformers

支持多種語言開源協議:Bsd-3-clause #多語言詞對齊 #BERT微調 #平行語料庫處理

下載量 3,673

發布時間 : 4/29/2022

模型概述

該工具可從多語言BERT（mBERT）中提取詞對齊，並允許在平行語料庫上微調mBERT以獲得更好的對齊質量。支持多種語言間的詞對齊任務。

模型特點

多語言支持

支持包括中文、英語、法語、德語和羅馬尼亞語在內的多種語言間的詞對齊

基於mBERT微調

可在平行語料庫上微調多語言BERT模型，提高詞對齊質量

高效對齊提取

提供直接從預訓練mBERT模型中提取詞對齊的能力，無需額外訓練

模型能力

跨語言詞對齊

平行語料庫處理

多語言詞嵌入分析

使用案例

機器翻譯

雙語詞典構建

利用詞對齊結果構建雙語詞典

提高詞典構建的準確性和效率

自然語言處理研究

跨語言詞嵌入研究

分析不同語言間詞嵌入空間的對齊情況

為跨語言NLP任務提供基礎支持

🚀 AWESOME：多語言編碼器詞嵌入空間對齊

本模型來自以下GitHub倉庫：https://github.com/neulab/awesome-align。

它對應於這篇論文：https://arxiv.org/abs/2101.08231。

如果您決定使用該模型，請引用原始論文：

@inproceedings{dou2021word,
  title={Word Alignment by Fine-tuning Embeddings on Parallel Corpora},
  author={Dou, Zi-Yi and Neubig, Graham},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2021}
}

awesome-align 是一個可以從多語言BERT（mBERT）中提取詞對齊的工具演示，並且允許您在平行語料庫上微調mBERT以獲得更好的對齊質量（更多細節請參閱我們的論文）。

🚀 快速開始

本模型來自指定的GitHub倉庫，對應特定的學術論文。使用該模型時需引用原論文。awesome-align 工具可從多語言BERT中提取詞對齊，並支持在平行語料庫上微調。

✨ 主要特性

能夠從多語言BERT（mBERT）中提取詞對齊。
支持在平行語料庫上微調mBERT，以提升對齊質量。

💻 使用示例

基礎用法

from transformers import AutoModel, AutoTokenizer
import itertools
import torch

# load model
model = AutoModel.from_pretrained("aneuraz/awesome-align-with-co")
tokenizer = AutoTokenizer.from_pretrained("aneuraz/awesome-align-with-co")

# model parameters
align_layer = 8
threshold = 1e-3

# define inputs
src = 'awesome-align is awesome !'
tgt = '牛對齊 是 牛 ！'

# pre-processing
sent_src, sent_tgt = src.strip().split(), tgt.strip().split()
token_src, token_tgt = [tokenizer.tokenize(word) for word in sent_src], [tokenizer.tokenize(word) for word in sent_tgt]
wid_src, wid_tgt = [tokenizer.convert_tokens_to_ids(x) for x in token_src], [tokenizer.convert_tokens_to_ids(x) for x in token_tgt]
ids_src, ids_tgt = tokenizer.prepare_for_model(list(itertools.chain(*wid_src)), return_tensors='pt', model_max_length=tokenizer.model_max_length, truncation=True)['input_ids'], tokenizer.prepare_for_model(list(itertools.chain(*wid_tgt)), return_tensors='pt', truncation=True, model_max_length=tokenizer.model_max_length)['input_ids']
sub2word_map_src = []
for i, word_list in enumerate(token_src):
  sub2word_map_src += [i for x in word_list]
sub2word_map_tgt = []
for i, word_list in enumerate(token_tgt):
  sub2word_map_tgt += [i for x in word_list]
  
# alignment
align_layer = 8
threshold = 1e-3
model.eval()
with torch.no_grad():
  out_src = model(ids_src.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]
  out_tgt = model(ids_tgt.unsqueeze(0), output_hidden_states=True)[2][align_layer][0, 1:-1]

  dot_prod = torch.matmul(out_src, out_tgt.transpose(-1, -2))

  softmax_srctgt = torch.nn.Softmax(dim=-1)(dot_prod)
  softmax_tgtsrc = torch.nn.Softmax(dim=-2)(dot_prod)

  softmax_inter = (softmax_srctgt > threshold)*(softmax_tgtsrc > threshold)

align_subwords = torch.nonzero(softmax_inter, as_tuple=False)
align_words = set()
for i, j in align_subwords:
  align_words.add( (sub2word_map_src[i], sub2word_map_tgt[j]) )
  
print(align_words)