Open-source LEALLA-large Model - Multilingual Sentence Embedding and Text Retrieval Supporting 109 Languages

LEALLA Large

Developed by setu4993

LEALLA is a collection of lightweight, language-agnostic sentence embedding models supporting 109 languages, distilled from LaBSE. Suitable for multilingual sentence embeddings and bilingual text retrieval.

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual sentence embeddings #Lightweight model #Bilingual text retrieval

Downloads 37

Release Time : 5/21/2023

Model Overview

LEALLA is a lightweight multilingual sentence embedding model distilled from LaBSE, supporting 109 languages, suitable for obtaining multilingual sentence embeddings and bilingual text retrieval tasks.

Model Features

Multilingual support

Supports sentence embeddings for 109 languages, suitable for multilingual text processing tasks.

Lightweight model

Distilled from LaBSE, the model is more lightweight with faster inference speed.

Language-agnostic

Generates language-agnostic sentence embeddings, facilitating cross-language text comparison and retrieval.

Model Capabilities

Multilingual sentence embeddings

Bilingual text retrieval

Sentence similarity calculation

Use Cases

Text retrieval

Cross-language document retrieval

Use sentence embeddings for cross-language document similarity retrieval.

Semantic similarity

Multilingual sentence similarity calculation

Calculate semantic similarity between sentences in different languages.

🚀 LEALLA-large

LEALLA is a collection of lightweight language - agnostic sentence embedding models. It supports 109 languages and is distilled from LaBSE. This model is useful for getting multilingual sentence embeddings and for bi - text retrieval.

🚀 Quick Start

LEALLA is a collection of lightweight language - agnostic sentence embedding models supporting 109 languages. It can be used to obtain multilingual sentence embeddings and for bi - text retrieval.

✨ Features

Supports 109 languages.
Distilled from LaBSE.
Useful for multilingual sentence embeddings and bi - text retrieval.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("setu4993/LEALLA-large")
model = BertModel.from_pretrained("setu4993/LEALLA-large")
model = model.eval()

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    english_outputs = model(**english_inputs)

Advanced Usage

To get the sentence embeddings, use the pooler output:

english_embeddings = english_outputs.pooler_output

Output for other languages:

italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    italian_outputs = model(**italian_inputs)
    japanese_outputs = model(**japanese_inputs)

italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output

For similarity between sentences, an L2 - norm is recommended before calculating the similarity:

import torch.nn.functional as F


def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )


print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))

📚 Documentation

This is migrated from the v1 model on the TF Hub. The embeddings produced by both the versions of the model are equivalent. Though, for some of the languages (like Japanese), the LEALLA models appear to require higher tolerances when comparing embeddings and similarities.

Details about data, training, evaluation and performance metrics are available in the original paper.

🔧 Technical Details

No specific technical details beyond what's in the documentation section are provided, so this section is skipped.

📄 License

The model is licensed under the apache - 2.0 license.

BibTeX entry and citation info

@inproceedings{mao-nakagawa-2023-lealla,
    title = "{LEALLA}: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation",
    author = "Mao, Zhuoyuan  and
      Nakagawa, Tetsuji",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-main.138",
    doi = "10.18653/v1/2023.eacl-main.138",
    pages = "1886--1894",
    abstract = "Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.",
}

📊 Information Table

Property	Details
Pipeline Tag	sentence - similarity
Languages	af, am, ar, as, az, be, bg, bn, bo, bs, ca, ceb, co, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, haw, he, hi, hmn, hr, ht, hu, hy, id, ig, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, or, pa, pl, pt, ro, ru, rw, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, ug, uk, ur, uz, vi, wo, xh, yi, yo, zh, zu
Tags	bert, sentence_embedding, multilingual, google, sentence - similarity, lealla, labse
License	apache - 2.0
Datasets	CommonCrawl, Wikipedia
Model	HuggingFace's model hub
Paper	arXiv
Original model	TensorFlow Hub
Conversion from TensorFlow to PyTorch	GitHub

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご