đ Smaller LaBSE
Smaller LaBSE is a distilled BERT-based model for multilingual sentence similarity, supporting 15 languages.
đ Quick Start
The Smaller Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model. It's distilled from the original LaBSE model to 15 languages (from the original 109 languages). The distillation techniques are described in the paper 'Load What You Need: Smaller Versions of Multilingual BERT' by Ukjae Jeong.
⨠Features
- Multilingual Support: Supports 15 languages including Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Thai, Turkish, and Chinese.
- Sentence Similarity: Ideal for calculating sentence similarity across multiple languages.
đģ Usage Examples
Basic Usage
import torch
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("setu4993/smaller-LaBSE")
model = BertModel.from_pretrained("setu4993/smaller-LaBSE")
model = model.eval()
english_sentences = [
"dog",
"Puppies are nice.",
"I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
with torch.no_grad():
english_outputs = model(**english_inputs)
Getting Sentence Embeddings
english_embeddings = english_outputs.pooler_output
Output for Other Languages
italian_sentences = [
"cane",
"I cuccioli sono carini.",
"Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["įŦ", "åįŦã¯ããã§ã", "į§ã¯įŦã¨ä¸įˇãĢããŧããæŖæŠãããŽãåĨŊãã§ã"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
with torch.no_grad():
italian_outputs = model(**italian_inputs)
japanese_outputs = model(**japanese_inputs)
italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output
Calculating Sentence Similarity
import torch.nn.functional as F
def similarity(embeddings_1, embeddings_2):
normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
return torch.matmul(
normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
)
print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))
đ Documentation
Details about data, training, evaluation and performance metrics are available in the original paper.
BibTeX entry and citation info
@misc{feng2020languageagnostic,
title={Language-agnostic BERT Sentence Embedding},
author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang},
year={2020},
eprint={2007.01852},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
đ License
This model is licensed under the apache-2.0
license.
Property |
Details |
Model Type |
Bert-based Sentence Encoder |
Training Data |
CommonCrawl, Wikipedia |
Supported Languages |
ar, de, en, es, fr, it, ja, ko, nl, pl, pt, ru, th, tr, zh |
Tags |
bert, sentence_embedding, multilingual, google, sentence-similarity, labse |