๐ NaSE (News-adapted Sentence Encoder)
NaSE is a news - adapted sentence encoder. It starts from the pre - trained massively multilingual sentence encoder LaBSE, specializing in the news domain. It can output a vector capturing semantic information of the input text, which is useful for sentence similarity, information retrieval, or clustering tasks.
๐ Quick Start
Here is how to use this model to get the sentence embeddings of a given text in PyTorch:
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = BertModel.from_pretrained('aiana94/NaSE')
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='pt', padding=True)
with torch.no_grad():
output = model(**encoded_input)
sentence_embeddings = output.pooler_output
And in Tensorflow:
from transformers import TFBertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = TFBertModell.from_pretrained('aiana94/NaSE')
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='tf', padding=True)
with torch.no_grad():
output = model(**encoded_input)
sentence_embeddings = output.pooler_output
For similarity between sentences, an L2 - norm is recommended before calculating the similarity:
import torch
import torch.nn.functional as F
def cos_sim(a: torch.Tensor, b: torch.Tensor):
a_norm = F.normalize(a, p=2, dim=1)
b_norm = F.normalize(b, p=2, dim=1)
return torch.mm(a_norm, b_norm.transpose(0, 1))
โจ Features
- Domain - Adapted: Specialized for the news domain starting from the pre - trained LaBSE model.
- Multilingual: Supports a wide range of languages, including af, am, ar, etc.
- Useful for Multiple Tasks: Can be used for sentence similarity, information retrieval, or clustering tasks.
๐ฆ Installation
The README does not provide installation steps, so this section is skipped.
๐ป Usage Examples
Basic Usage
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = BertModel.from_pretrained('aiana94/NaSE')
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='pt', padding=True)
with torch.no_grad():
output = model(**encoded_input)
sentence_embeddings = output.pooler_output
Advanced Usage
import torch
import torch.nn.functional as F
def cos_sim(a: torch.Tensor, b: torch.Tensor):
a_norm = F.normalize(a, p=2, dim=1)
b_norm = F.normalize(b, p=2, dim=1)
return torch.mm(a_norm, b_norm.transpose(0, 1))
๐ Documentation
Model Details
Model Description
NaSE is a domain - adapted multilingual sentence encoder, initialized from [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub - redirect=true). It was specialized to the news domain using two multilingual corpora, namely Polynews and [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews - parallel). More specifically, NaSE was pretrained with two objectives: denoising auto - encoding and sequence - to - sequence machine translation.
Intended Uses
Our model is intended to be used as a sentence, and in particular, news encoder. Given an input text, it outputs a vector which captures its semantic information. The sentence vector may be used for sentence similarity, information retrieval or clustering tasks.
Training Details
Training Data
NaSE was domain - adapted using two multilingual datasets: Polynews and the parallel [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews - parallel).
We use the following procedure to smoothen the per - language distribution when sampling for model training:
- We sample only languages and language - pairs that contain at least 100 texts in PolyNews and PolyNewsParallel, respectively.
- We sample texts from language L by sampling from the modified distribution p(L) ~ |L| * alpha, where |L| is the number of examples and L. We use a smooting rate alpha = 0.3 (i.e., we upsample low - resource languages and downsample high - resource languages).
Training Procedure
We initialize NaSE with the pretrained weights of the mulitlingual sentenece encoder [LaBSE](https://huggingface.co/sentence - transformers/LaBSE). Please refer to its [model card](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub - redirect=true) or the corresponding [paper](https://aclanthology.org/2022.acl - long.62.pdf) for more detaled information about the pre - training procedure.
We adapt the multilingual sentence encoder to the news domain using two objectives:
- Denoising auto - encoding (DAE): reconstructs the original input sentence from its corrupted version obtained by adding discrete noise (see [TSDAE](https://aclanthology.org/2021.findings - emnlp.59.pdf) for details).
- Machine translation (MT): generates the taget - language translation from the source - language input sentence (i.e., the source language constitutes the corruption of the target sentence x in the target language, which is to be reconstructed).
NaSE is trained sequentially, first on reconstruction, and then on translation, i.e., we continue training the NaSE encoder obtained with the DAE objective for translation on parallel data.
Training Hyperparameters
- Training regime: fp16 mixed precision
- Training steps: 100k (50K per objective), validating every 5K steps
- Learning rate: 3e - 5
- Optimizer: AdamW
The full training scripts is accessible in the training code.
๐ง Technical Details
The model was pretrained on 1 40GB NVIDIA A100 GPU for a total of 100k steps.
๐ License
The model is licensed under the Apache - 2.0 license.
๐ Citation
BibTeX:
@misc{iana2024news,
title={News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross - lingual News Recommendation},
author={Andreea Iana and Fabian David Schmidt and Goran Glavaลก and Heiko Paulheim},
year={2024},
eprint={2406.12634},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2406.12634}
}
๐ Information Table
Property |
Details |
Model Type |
News - adapted multilingual sentence encoder |
Training Data |
Polynews, [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews - parallel) |
License |
Apache - 2.0 |
Pipeline Tag |
sentence - similarity |
Tags |
bert, feature - extraction, sentence - embedding, sentence - similarity, multilingual |