オープンソースのsbert-roberta-large-anli-mnli-snliモデル - 文章の類似度比較タスクを高精度に完遂

ホーム

Sbert Roberta Large Anli Mnli Snli

usc-isiによって開発

RoBERTa-largeベースの文変換モデルで、文の類似性タスク向けに設計され、ANLI、MNLI、SNLIデータセットで訓練

テキスト埋め込み

Transformers

英語#文の意味的埋め込み #NLIタスク最適化 #複数データセット訓練

ダウンロード数 38

リリース時間 : 3/2/2022

モデル概要

このモデルは文や段落を768次元ベクトル空間にマッピングでき、意味検索、クラスタリングなどの自然言語処理タスクに適しています

モデル特徴

高品質な文埋め込み

RoBERTa-largeアーキテクチャに基づき、高品質な文埋め込み表現を生成

複数データセット訓練

ANLI、MNLI、SNLIの3つの権威ある自然言語推論データセットで共同訓練

効率的なプーリング戦略

平均プーリング手法を採用し、単語埋め込み情報を効果的に集約

モデル能力

文のベクトル化

意味的類似度計算

テキストクラスタリング

意味検索

使用事例

情報検索

意味検索システム

キーワードではなく意味に基づく検索システムを構築

検索結果の関連性向上

テキスト分析

文書クラスタリング

意味的に類似した文書を自動的にグループ化

教師なし文書組織化を実現

自然言語理解

文の類似度計算

2つの文間の意味的類似度を計算

質問応答システム、言い換え検出などのアプリケーションに利用可能

🚀 sbert-roberta-large-anli-mnli-snli

このモデルはsentence-transformersを用いたもので、文章や段落を768次元の密ベクトル空間にマッピングし、クラスタリングや意味検索などのタスクに利用できます。

モデルはRoBERTa-largeで重み初期化され、ANLI (Nie et al., 2020)、MNLI (Williams et al., 2018)、およびSNLI (Bowman et al., 2015) を使用して、training_nli.py のサンプルスクリプトで学習されています。

学習の詳細:

学習率: 2e-5
バッチサイズ: 8
プーリング: Mean
学習時間: NVIDIA GeForce RTX 2080 Ti 1台で約20時間

🚀 クイックスタート

📦 インストール

sentence-transformers をインストールすると、このモデルの使用が簡単になります。

pip install -U sentence-transformers

💻 使用例

基本的な使用法 (Sentence-Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("usc-isi/sbert-roberta-large-anli-mnli-snli")
embeddings = model.encode(sentences)
print(embeddings)

高度な使用法 (Hugging Face Transformers)

sentence-transformers を使用せずにモデルを使用するには、まず入力をTransformerモデルに通し、その後文脈化された単語埋め込みに対して適切なプーリング操作を適用する必要があります。

import torch
from transformers import AutoModel, AutoTokenizer


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")
model = AutoModel.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 ドキュメント

評価結果

評価結果については、論文のセクション4.1を参照してください。

モデルのアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

引用と著者

このプロジェクトの詳細については、以下の論文を参照してください。

Ciosici, Manuel, et al. "Machine-Assisted Script Curation." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Association for Computational Linguistics, 2021, pp. 8–17. ACLWeb, https://www.aclweb.org/anthology/2021.naacl-demos.2.

参考文献

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. AdversarialNLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.

📄 情報テーブル

属性	详情
パイプラインタグ	文章の類似度
タグ	sentence-transformers, feature-extraction, sentence-similarity, transformers
学習データセット	anli, multi_nli, snli