LEALLA-base Open-Source Model - Supports 109 Languages, Get Multilingual Sentence Embeddings and Bilingual Text Retrieval for Free

LEALLA Base

Developed by setu4993

LEALLA is a collection of lightweight language-agnostic sentence embedding models supporting 109 languages, distilled from LaBSE. Suitable for obtaining multilingual sentence embeddings and bilingual text retrieval.

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual sentence embedding #Lightweight BERT #Cross-language retrieval

Downloads 772

Release Time : 5/21/2023

Model Overview

LEALLA is a set of lightweight language-agnostic sentence embedding models supporting 109 languages, distilled from LaBSE. This model is suitable for obtaining multilingual sentence embeddings and bilingual text retrieval.

Model Features

Multilingual Support

Supports sentence embeddings for 109 languages, suitable for multilingual scenarios.

Lightweight Design

Obtained through knowledge distillation from the LaBSE model, reducing model complexity while maintaining performance.

Language-Agnostic

Generated sentence embeddings are comparable across different languages, suitable for cross-language tasks.

Model Capabilities

Multilingual sentence embedding

Bilingual text retrieval

Sentence similarity calculation

Use Cases

Information Retrieval

Cross-language Document Retrieval

Use sentence embeddings for cross-language similar document retrieval.

Machine Translation

Translation Quality Evaluation

Evaluate translation quality by comparing embeddings of source and target language sentences.

🚀 LEALLA-base

LEALLA is a collection of lightweight language - agnostic sentence embedding models. It supports 109 languages and is distilled from LaBSE. This model is valuable for obtaining multilingual sentence embeddings and bi - text retrieval.

🚀 Quick Start

LEALLA - base is a powerful model for getting sentence embeddings across multiple languages. You can easily load and use it with the transformers library in Python.

✨ Features

Multilingual Support: Supports 109 languages, including Afrikaans (af), Arabic (ar), English (en), Japanese (ja), etc.
Lightweight: A collection of lightweight models, which can potentially reduce inference speed and computation overhead.
Distilled from LaBSE: Distilled from the well - known LaBSE model, ensuring high - quality sentence embeddings.

📦 Installation

To use the LEALLA - base model, you need to install the transformers library. You can install it using pip:

pip install transformers torch

💻 Usage Examples

Basic Usage

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("setu4993/LEALLA-base")
model = BertModel.from_pretrained("setu4993/LEALLA-base")
model = model.eval()

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    english_outputs = model(**english_inputs)

Getting Sentence Embeddings

To get the sentence embeddings, use the pooler output:

english_embeddings = english_outputs.pooler_output

Output for Other Languages

italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    italian_outputs = model(**italian_inputs)
    japanese_outputs = model(**japanese_inputs)

italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output

Calculating Similarity

For similarity between sentences, an L2 - norm is recommended before calculating the similarity:

import torch.nn.functional as F


def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )


print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))

📚 Documentation

Model: HuggingFace's model hub.
Paper: arXiv.
Original model: TensorFlow Hub.
Conversion from TensorFlow to PyTorch: GitHub.

This model is migrated from the v1 model on the TF Hub. The embeddings produced by both versions of the model are equivalent. However, for some languages (such as Japanese), the LEALLA models may require higher tolerances when comparing embeddings and similarities.

🔧 Technical Details

Details about data, training, evaluation and performance metrics are available in the original paper.

BibTeX entry and citation info

@inproceedings{mao-nakagawa-2023-lealla,
    title = "{LEALLA}: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation",
    author = "Mao, Zhuoyuan  and
      Nakagawa, Tetsuji",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-main.138",
    doi = "10.18653/v1/2023.eacl-main.138",
    pages = "1886--1894",
    abstract = "Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.",
}

📄 License

This project is licensed under the apache - 2.0 license.

Information Table

Property	Details
Model Type	Lightweight language - agnostic sentence embedding model
Training Data	CommonCrawl, Wikipedia
Pipeline Tag	sentence - similarity
Supported Languages	af, am, ar, as, az, be, bg, bn, bo, bs, ca, ceb, co, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, haw, he, hi, hmn, hr, ht, hu, hy, id, ig, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, or, pa, pl, pt, ro, ru, rw, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, ug, uk, ur, uz, vi, wo, xh, yi, yo, zh, zu
Tags	bert, sentence_embedding, multilingual, google, sentence - similarity, lealla, labse

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご