開源sbert-roberta-large-anli-mnli-snli模型 - 精準完成句子相似度對比任務

首頁

Sbert Roberta Large Anli Mnli Snli

由usc-isi開發

基於RoBERTa-large的句子轉換模型，專為句子相似度任務設計，在ANLI、MNLI和SNLI數據集上訓練

文本嵌入

Transformers

英語#句子語義嵌入 #NLI任務優化 #多數據集訓練

下載量 38

發布時間 : 3/2/2022

模型概述

該模型能將句子和段落映射到768維向量空間，適用於語義搜索、聚類等自然語言處理任務

模型特點

高質量句子嵌入

基於RoBERTa-large架構，生成高質量的句子嵌入表示

多數據集訓練

在ANLI、MNLI和SNLI三個權威自然語言推理數據集上聯合訓練

高效池化策略

採用均值池化方法，有效聚合詞嵌入信息

模型能力

句子向量化

語義相似度計算

文本聚類

語義搜索

使用案例

信息檢索

語義搜索系統

構建基於語義而非關鍵詞的搜索系統

提高搜索結果的相關性

文本分析

文檔聚類

將語義相似的文檔自動分組

實現無監督的文檔組織

自然語言理解

句子相似度計算

計算兩個句子之間的語義相似度

可用於問答系統、複述檢測等應用

🚀 sbert-roberta-large-anli-mnli-snli

這是一個 sentence-transformers 模型，它可以將句子和段落映射到一個 768 維的密集向量空間，可用於聚類或語義搜索等任務。

模型信息

屬性	詳情
模型類型	句子相似度模型
訓練數據	ANLI、Multi NLI、SNLI
標籤	sentence-transformers、feature-extraction、sentence-similarity、transformers

訓練詳情

學習率：2e-5
批量大小：8
池化方法：Mean
訓練時間：在一塊 NVIDIA GeForce RTX 2080 Ti 上訓練約 20 小時

該模型以 RoBERTa-large 進行權重初始化，並使用示例腳本 training_nli.py 在 ANLI（Nie 等人，2020）、MNLI（Williams 等人，2018）和 SNLI（Bowman 等人，2015）上進行訓練。

🚀 快速開始

安裝依賴

若已安裝 sentence-transformers，使用該模型將變得十分簡單：

pip install -U sentence-transformers

使用示例

基礎用法（Sentence-Transformers）

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("usc-isi/sbert-roberta-large-anli-mnli-snli")
embeddings = model.encode(sentences)
print(embeddings)

高級用法（Hugging Face Transformers）

若未安裝 sentence-transformers，可按以下方式使用該模型：首先將輸入傳遞給 Transformer 模型，然後對上下文詞嵌入應用正確的池化操作。

import torch
from transformers import AutoModel, AutoTokenizer


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")
model = AutoModel.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細文檔

評估結果

評估結果請參閱論文的 4.1 節。

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📖 引用與作者

有關該項目的更多信息，請參閱我們的論文：

Ciosici, Manuel, et al. "Machine-Assisted Script Curation." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Association for Computational Linguistics, 2021, pp. 8–17. ACLWeb, https://www.aclweb.org/anthology/2021.naacl-demos.2.

參考文獻

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. AdversarialNLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.