NaSE Open-Source Multilingual Sentence Encoder for the News Domain - Supporting Embedding and Similarity Calculation for Over 100 Languages

Nase

Developed by aiana94

NaSE is a news domain-specialized multilingual sentence encoder, based on LaBSE with domain-specific training, supporting sentence embedding and similarity calculation for 100+ languages.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #News Domain Specialized #Multilingual Sentence Embedding #Semantic Similarity Calculation

Downloads 14

Release Time : 6/17/2024

Model Overview

This model is a domain-adapted multilingual sentence encoder, specifically optimized for news text through denoising autoencoding and machine translation objectives, suitable for tasks like sentence similarity and information retrieval.

Model Features

News Domain Adaptation

Domain-specialized training using Polynews and PolyNewsParallel datasets to optimize semantic representation for news text.

Multilingual Support

Supports sentence embeddings for 100+ languages, including low-resource languages, with a language distribution smoothing sampling strategy.

Dual Training Objectives

Combines denoising autoencoding (DAE) and machine translation (MT) objectives to enhance cross-lingual semantic capture capabilities.

Model Capabilities

Multilingual sentence embedding

Cross-lingual sentence similarity calculation

News text semantic retrieval

Multilingual text clustering

Use Cases

Information Retrieval

Cross-Lingual News Recommendation

Utilizes sentence embeddings to calculate semantic similarity between news in different languages for cross-lingual content recommendation.

Text Analysis

Multilingual News Clustering

Performs semantic clustering of global news to identify similar event reports across languages.

🚀 NaSE (News-adapted Sentence Encoder)

NaSE is a news - adapted sentence encoder. It starts from the pre - trained massively multilingual sentence encoder LaBSE, specializing in the news domain. It can output a vector capturing semantic information of the input text, which is useful for sentence similarity, information retrieval, or clustering tasks.

🚀 Quick Start

Here is how to use this model to get the sentence embeddings of a given text in PyTorch:

from transformers import BertModel, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = BertModel.from_pretrained('aiana94/NaSE')

# pepare input
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='pt', padding=True)

# forward pass
with torch.no_grad():
    output = model(**encoded_input)

# to get the sentence embeddings, use the pooler output
sentence_embeddings = output.pooler_output

And in Tensorflow:

from transformers import TFBertModel, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = TFBertModell.from_pretrained('aiana94/NaSE')

# pepare input
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='tf', padding=True)

# forward pass
with torch.no_grad():
    output = model(**encoded_input)

# to get the sentence embeddings, use the pooler output
sentence_embeddings = output.pooler_output

For similarity between sentences, an L2 - norm is recommended before calculating the similarity:

import torch
import torch.nn.functional as F

def cos_sim(a: torch.Tensor, b: torch.Tensor):
    a_norm = F.normalize(a, p=2, dim=1)
    b_norm = F.normalize(b, p=2, dim=1)

    return torch.mm(a_norm, b_norm.transpose(0, 1))

✨ Features

Domain - Adapted: Specialized for the news domain starting from the pre - trained LaBSE model.
Multilingual: Supports a wide range of languages, including af, am, ar, etc.
Useful for Multiple Tasks: Can be used for sentence similarity, information retrieval, or clustering tasks.

📦 Installation

The README does not provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import BertModel, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = BertModel.from_pretrained('aiana94/NaSE')

# pepare input
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='pt', padding=True)

# forward pass
with torch.no_grad():
    output = model(**encoded_input)

# to get the sentence embeddings, use the pooler output
sentence_embeddings = output.pooler_output

Advanced Usage

import torch
import torch.nn.functional as F

def cos_sim(a: torch.Tensor, b: torch.Tensor):
    a_norm = F.normalize(a, p=2, dim=1)
    b_norm = F.normalize(b, p=2, dim=1)

    return torch.mm(a_norm, b_norm.transpose(0, 1))

📚 Documentation

Model Details

Model Description

NaSE is a domain - adapted multilingual sentence encoder, initialized from [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub - redirect=true). It was specialized to the news domain using two multilingual corpora, namely Polynews and [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews - parallel). More specifically, NaSE was pretrained with two objectives: denoising auto - encoding and sequence - to - sequence machine translation.

Intended Uses

Our model is intended to be used as a sentence, and in particular, news encoder. Given an input text, it outputs a vector which captures its semantic information. The sentence vector may be used for sentence similarity, information retrieval or clustering tasks.

Training Details

Training Data

NaSE was domain - adapted using two multilingual datasets: Polynews and the parallel [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews - parallel). We use the following procedure to smoothen the per - language distribution when sampling for model training:

We sample only languages and language - pairs that contain at least 100 texts in PolyNews and PolyNewsParallel, respectively.
We sample texts from language L by sampling from the modified distribution p(L) ~ |L| * alpha, where |L| is the number of examples and L. We use a smooting rate alpha = 0.3 (i.e., we upsample low - resource languages and downsample high - resource languages).

Training Procedure

We initialize NaSE with the pretrained weights of the mulitlingual sentenece encoder [LaBSE](https://huggingface.co/sentence - transformers/LaBSE). Please refer to its [model card](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub - redirect=true) or the corresponding [paper](https://aclanthology.org/2022.acl - long.62.pdf) for more detaled information about the pre - training procedure. We adapt the multilingual sentence encoder to the news domain using two objectives:

Denoising auto - encoding (DAE): reconstructs the original input sentence from its corrupted version obtained by adding discrete noise (see [TSDAE](https://aclanthology.org/2021.findings - emnlp.59.pdf) for details).
Machine translation (MT): generates the taget - language translation from the source - language input sentence (i.e., the source language constitutes the corruption of the target sentence x in the target language, which is to be reconstructed). NaSE is trained sequentially, first on reconstruction, and then on translation, i.e., we continue training the NaSE encoder obtained with the DAE objective for translation on parallel data.

Training Hyperparameters

Training regime: fp16 mixed precision
Training steps: 100k (50K per objective), validating every 5K steps
Learning rate: 3e - 5
Optimizer: AdamW The full training scripts is accessible in the training code.

🔧 Technical Details

The model was pretrained on 1 40GB NVIDIA A100 GPU for a total of 100k steps.

📄 License

The model is licensed under the Apache - 2.0 license.

📚 Citation

BibTeX:

@misc{iana2024news,
      title={News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross - lingual News Recommendation}, 
      author={Andreea Iana and Fabian David Schmidt and Goran Glavaš and Heiko Paulheim},
      year={2024},
      eprint={2406.12634},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2406.12634}
}

📋 Information Table

Property	Details
Model Type	News - adapted multilingual sentence encoder
Training Data	Polynews, [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews - parallel)
License	Apache - 2.0
Pipeline Tag	sentence - similarity
Tags	bert, feature - extraction, sentence - embedding, sentence - similarity, multilingual

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご