๐ LEALLA-small
LEALLA is a collection of lightweight language - agnostic sentence embedding models. It supports 109 languages and is distilled from LaBSE. This model is useful for obtaining multilingual sentence embeddings and bi - text retrieval.
๐ Quick Start
LEALLA - small is a powerful model for getting multilingual sentence embeddings. You can easily use it with the following steps.
โจ Features
- Multilingual Support: Supports 109 languages, including Afrikaans, Amharic, Arabic, etc.
- Lightweight: A lightweight alternative distilled from LaBSE, useful for various NLP tasks.
- Sentence Embeddings: Ideal for getting multilingual sentence embeddings and bi - text retrieval.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
import torch
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("setu4993/LEALLA-small")
model = BertModel.from_pretrained("setu4993/LEALLA-small")
model = model.eval()
english_sentences = [
"dog",
"Puppies are nice.",
"I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
with torch.no_grad():
english_outputs = model(**english_inputs)
Advanced Usage
To get the sentence embeddings, use the pooler output:
english_embeddings = english_outputs.pooler_output
Output for other languages:
italian_sentences = [
"cane",
"I cuccioli sono carini.",
"Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["็ฌ", "ๅญ็ฌใฏใใใงใ", "็งใฏ็ฌใจไธ็ทใซใใผใใๆฃๆญฉใใใฎใๅฅฝใใงใ"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
with torch.no_grad():
italian_outputs = model(**italian_inputs)
japanese_outputs = model(**japanese_inputs)
italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output
For similarity between sentences, an L2 - norm is recommended before calculating the similarity:
import torch.nn.functional as F
def similarity(embeddings_1, embeddings_2):
normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
return torch.matmul(
normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
)
print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))
๐ Documentation
This is migrated from the v1 model on the TF Hub. The embeddings produced by both the versions of the model are equivalent. Though, for some of the languages (like Japanese), the LEALLA models appear to require higher tolerances when comparing embeddings and similarities.
Details about data, training, evaluation and performance metrics are available in the original paper.
๐ง Technical Details
The LEALLA model is a collection of lightweight language - agnostic sentence embedding models distilled from LaBSE. It supports 109 languages and is useful for multilingual sentence embeddings and bi - text retrieval. The embeddings produced by different versions of the model are equivalent, but for some languages, higher tolerances may be required when comparing embeddings and similarities.
๐ License
The model is licensed under the Apache - 2.0 license.
Information Table
Property |
Details |
Pipeline Tag |
sentence - similarity |
Supported Languages |
af, am, ar, as, az, be, bg, bn, bo, bs, ca, ceb, co, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, haw, he, hi, hmn, hr, ht, hu, hy, id, ig, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, or, pa, pl, pt, ro, ru, rw, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, ug, uk, ur, uz, vi, wo, xh, yi, yo, zh, zu |
Tags |
bert, sentence_embedding, multilingual, google, sentence - similarity, lealla, labse |
License |
apache - 2.0 |
Datasets |
CommonCrawl, Wikipedia |
BibTeX entry and citation info
@inproceedings{mao-nakagawa-2023-lealla,
title = "{LEALLA}: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation",
author = "Mao, Zhuoyuan and
Nakagawa, Tetsuji",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.138",
doi = "10.18653/v1/2023.eacl-main.138",
pages = "1886--1894",
abstract = "Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.",
}