๐ LaBSE
LaBSE is a BERT - based model that generates sentence embeddings for 109 languages, useful for multilingual sentence embedding and bi - text retrieval.
๐ Quick Start
LaBSE, the Language - agnostic BERT Sentence Encoder, is a powerful BERT - based model. It's pre - trained using a combination of masked language modeling and translation language modeling, enabling it to generate high - quality sentence embeddings for 109 languages.
โจ Features
- Multilingual Support: Capable of generating sentence embeddings for 109 languages, including af, am, ar, etc.
- Useful for Multiple Tasks: Ideal for obtaining multilingual sentence embeddings and bi - text retrieval.
- Equivalent Embeddings: The embeddings produced by the migrated PyTorch model are equivalent to the original TensorFlow version.
๐ฆ Installation
No specific installation steps are provided in the original README. If you want to use this model, you need to install relevant Python libraries such as torch
and transformers
. You can use the following command:
pip install torch transformers
๐ป Usage Examples
Basic Usage
import torch
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()
english_sentences = [
"dog",
"Puppies are nice.",
"I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
with torch.no_grad():
english_outputs = model(**english_inputs)
Advanced Usage
To get the sentence embeddings, use the pooler output:
english_embeddings = english_outputs.pooler_output
Output for other languages:
italian_sentences = [
"cane",
"I cuccioli sono carini.",
"Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["็ฌ", "ๅญ็ฌใฏใใใงใ", "็งใฏ็ฌใจไธ็ทใซใใผใใๆฃๆญฉใใใฎใๅฅฝใใงใ"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
with torch.no_grad():
italian_outputs = model(**italian_inputs)
japanese_outputs = model(**japanese_inputs)
italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output
For similarity between sentences, an L2 - norm is recommended before calculating the similarity:
import torch.nn.functional as F
def similarity(embeddings_1, embeddings_2):
normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
return torch.matmul(
normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
)
print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))
๐ Documentation
This is migrated from the v2 model on the TF Hub, which uses dict - based input. The embeddings produced by both the versions of the model are equivalent.
Details about data, training, evaluation and performance metrics are available in the original paper.
๐ License
This model is licensed under the apache - 2.0
license.
BibTeX entry and citation info
@misc{feng2020languageagnostic,
title={Language-agnostic BERT Sentence Embedding},
author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang},
year={2020},
eprint={2007.01852},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
๐ฆ Model Information
Property |
Details |
Model Type |
BERT - based model for sentence embedding |
Training Data |
CommonCrawl, Wikipedia |
Tags |
bert, sentence_embedding, multilingual, google, sentence - similarity |
Pipeline Tag |
sentence - similarity |
Languages |
af, am, ar, as, az, be, bg, bn, bo, bs, ca, ceb, co, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, haw, he, hi, hmn, hr, ht, hu, hy, id, ig, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, or, pa, pl, pt, ro, ru, rw, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, ug, uk, ur, uz, vi, wo, xh, yi, yo, zh, zu |