bertin-roberta-base-finetuning-esnli Open-source Model - Optimized for Spanish Natural Language Inference Tasks

Home

Bertin Roberta Base Finetuning Esnli

Developed by somosnlp-hackathon-2022

Spanish sentence embedding model based on BERTIN RoBERTa, optimized for natural language inference tasks

Text Embedding

PyTorch

Spanish#Spanish Sentence Embedding #Semantic Similarity Calculation #Natural Language Inference

Downloads 103

Release Time : 3/28/2022

Model Overview

This model maps Spanish sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as sentence similarity calculation, semantic search, and text clustering.

Model Features

Spanish Language Optimization

Fine-tuned specifically for Spanish text, excelling in Spanish NLI tasks

High-Performance Sentence Embedding

12-16% improvement in similarity metrics compared to similar BETO models

Data Augmentation Training

Uses adversarial sample augmentation to enhance model robustness

Model Capabilities

Sentence vectorization

Semantic similarity calculation

Text clustering

Natural language inference

Use Cases

Text Analysis

Semantic Search

Building a Spanish semantic search engine

Accurately matches documents with similar query intent

Text Deduplication

Identifying semantically similar Spanish documents

Effectively reduces redundant content

Dialogue Systems

Intent Recognition

Determining similarity between user queries and predefined intents

Improves dialogue system understanding accuracy

🚀 bertin-roberta-base-finetuning-esnli

This is a sentence-transformers model trained on a collection of NLI tasks for Spanish. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be used for tasks such as clustering or semantic search. Based on the siamese networks approach from this paper.

📋 Model Information

Property	Details
Model Type	Sentence-transformers model for Spanish NLI tasks
Training Data	ESXNLI (Spanish part), SNLI (automatically translated), MultiNLI (automatically translated). Whole dataset available here

You can see a demo for this model here.

You can find our other model, paraphrase-spanish-distilroberta here and its demo here.

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

🔍 Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["Este es un ejemplo", "Cada oración es transformada"]

model = SentenceTransformer('hackathon-pln-es/bertin-roberta-base-finetuning-esnli')
embeddings = model.encode(sentences)
print(embeddings)

🌟 Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/bertin-roberta-base-finetuning-esnli')
model = AutoModel.from_pretrained('hackathon-pln-es/bertin-roberta-base-finetuning-esnli')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📊 Evaluation Results

Our model was evaluated on the task of Semantic Textual Similarity using the SemEval-2015 Task for Spanish.

	BETO STS	BERTIN STS (this model)	Relative improvement
cosine_pearson	0.609803	0.683188	+12.03
cosine_spearman	0.528776	0.615916	+16.48
euclidean_pearson	0.590613	0.672601	+13.88
euclidean_spearman	0.526529	0.611539	+16.15
manhattan_pearson	0.589108	0.672040	+14.08
manhattan_spearman	0.525910	0.610517	+16.09
dot_pearson	0.544078	0.600517	+10.37
dot_spearman	0.460427	0.521260	+13.21

🔧 Technical Details

🏋️‍ Training

The model was trained with the following parameters:

Dataset: We used a collection of datasets of Natural Language Inference as training data, including ESXNLI (Spanish part only), SNLI (automatically translated), and MultiNLI (automatically translated). The whole dataset used is available here.
DataLoader: sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader of length 1818 with parameters {'batch_size': 64}.
Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters {'scale': 20.0, 'similarity_fct': 'cos_sim'}.
Fit()-Method Parameters:

{
    "epochs": 10,
    "evaluation_steps": 0,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 909,
    "weight_decay": 0.01
}

📐 Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

👥 Authors

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご