Open-source sbert-roberta-large-anli-mnli-snli model - Accurately complete sentence similarity comparison tasks

Sbert Roberta Large Anli Mnli Snli

Developed by usc-isi

A sentence transformation model based on RoBERTa-large, specifically designed for sentence similarity tasks, trained on ANLI, MNLI, and SNLI datasets

Text Embedding

Transformers

English#Sentence Semantic Embedding #NLI Task Optimization #Multi-dataset Training

Downloads 38

Release Time : 3/2/2022

Model Overview

This model can map sentences and paragraphs into a 768-dimensional vector space, suitable for natural language processing tasks such as semantic search and clustering

Model Features

High-Quality Sentence Embeddings

Generates high-quality sentence embeddings based on the RoBERTa-large architecture

Multi-dataset Training

Jointly trained on three authoritative natural language inference datasets: ANLI, MNLI, and SNLI

Efficient Pooling Strategy

Utilizes mean pooling to effectively aggregate word embedding information

Model Capabilities

Sentence Vectorization

Semantic Similarity Calculation

Text Clustering

Semantic Search

Use Cases

Information Retrieval

Semantic Search System

Build a search system based on semantics rather than keywords

Improves the relevance of search results

Text Analysis

Document Clustering

Automatically group semantically similar documents

Enables unsupervised document organization

Natural Language Understanding

Sentence Similarity Calculation

Calculate the semantic similarity between two sentences

Can be used in applications like question-answering systems and paraphrase detection

🚀 sbert-roberta-large-anli-mnli-snli

This is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This model simplifies sentence and paragraph processing by mapping them to a 768-dimensional dense vector space. It's highly useful for clustering and semantic search tasks.

✨ Features

Vector Mapping: Converts sentences and paragraphs into 768-dimensional dense vectors.
Versatile Applications: Ideal for clustering and semantic search.
Robust Training: Initialized with RoBERTa-large weights and trained on ANLI, MNLI, and SNLI.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("usc-isi/sbert-roberta-large-anli-mnli-snli")
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: first, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

import torch
from transformers import AutoModel, AutoTokenizer


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")
model = AutoModel.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

See section 4.1 of our paper for evaluation results.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

🔧 Technical Details

The model is weight initialized by RoBERTa-large and trained on ANLI (Nie et al., 2020), MNLI (Williams et al., 2018), and SNLI (Bowman et al., 2015) using the training_nli.py example script.

Training Details:

Learning rate: 2e-5
Batch size: 8
Pooling: Mean
Training time: ~20 hours on one NVIDIA GeForce RTX 2080 Ti

📄 License

No license information provided in the original document.

📋 Information Table

Property	Details
Model Type	Sentence-transformers model for sentence similarity
Training Data	ANLI, multi_nli, snli

📖 Citing & Authors

For more information about the project, see our paper:

Ciosici, Manuel, et al. "Machine-Assisted Script Curation." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Association for Computational Linguistics, 2021, pp. 8–17. ACLWeb, https://www.aclweb.org/anthology/2021.naacl-demos.2.

📚 References

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. AdversarialNLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご