miCSE開源句子嵌入模型 - 低樣本量下高效計算句子相似度

首頁

Micse

由sap-ai-research開發

miCSE是一個專為句子相似度計算而設計的低樣本量句子嵌入模型，通過注意力模式對齊和正則化自注意力分佈實現高效的樣本學習。

文本嵌入

Transformers

英語開源協議:Apache-2.0 #低樣本學習 #句子相似度 #自注意力對齊

下載量 30

發布時間 : 11/9/2022

模型概述

miCSE模型通過學習強制實現不同dropout增強視角間的句法一致性，通過正則化自注意力分佈使表徵學習更具樣本效率，特別適合訓練數據有限的場景。

模型特點

低樣本量學習

通過正則化自注意力分佈實現高效的樣本學習，適合訓練數據有限的場景。

注意力模式對齊

通過強制不同數據增強視角間的注意力模式對齊進行訓練，提高模型性能。

句法一致性

學習強制實現不同dropout增強視角間的句法一致性，提升表徵質量。

模型能力

句子相似度計算

文本檢索

句子聚類

使用案例

信息檢索

文檔檢索

使用miCSE生成的句子嵌入進行相似文檔檢索。

文本分析

句子聚類

利用miCSE對句子進行嵌入後，使用聚類算法對相似句子進行分組。

如示例2所示，能夠有效區分不同情感傾向的推文。

語義相似度計算

計算兩個句子之間的餘弦相似度，評估其語義相似性。

如示例1所示，能夠準確計算句子間的相似度。

🚀 互信息對比句嵌入模型（miCSE）：小樣本場景下的句嵌入方案

miCSE 是一種專為小樣本句嵌入設計的語言模型，通過正則化自注意力分佈，實現句法一致性學習，提升樣本利用效率，尤其適用於訓練數據有限的現實應用場景。它可用於句子或短段落編碼，在檢索、句子相似度比較和聚類等任務中表現出色。

🚀 快速開始

安裝依賴

確保你已經安裝了所需的 Python 庫，如 transformers、torch、numpy、tqdm、datasets、umap-learn、sentence-transformers 等。你可以使用以下命令進行安裝：

pip install transformers torch numpy tqdm datasets umap-learn sentence-transformers

模型使用

以下是幾個使用 miCSE 模型的示例：

句子相似度比較

from transformers import AutoTokenizer, AutoModel
import torch.nn as nn

tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/miCSE")

model = AutoModel.from_pretrained("sap-ai-research/miCSE")

# 對列表中的句子進行編碼，預設最大 token 長度 (max_length)
max_length = 32

sentences = [
    "This is a sentence for testing miCSE.", 
    "This is yet another test sentence for the mutual information Contrastive Sentence Embeddings model."
]

batch = tokenizer.batch_encode_plus(
                sentences,
                return_tensors='pt',
                padding=True,
                max_length=max_length,
                truncation=True
            )

# 計算嵌入並僅保留 [CLS] 嵌入（第一個 token）
# 獲取原始嵌入（無梯度）
with torch.no_grad():
    outputs = model(**batch, output_hidden_states=True, return_dict=True)

embeddings = outputs.last_hidden_state[:,0]

# 定義相似度度量，例如餘弦相似度
sim = nn.CosineSimilarity(dim=-1)

# 計算第一句和第二句之間的相似度
cos_sim = sim(embeddings.unsqueeze(1),
             embeddings.unsqueeze(0))
             
print(f"Distance: {cos_sim[0,1].detach().item()}")

聚類

from transformers import AutoTokenizer, AutoModel
import torch.nn as nn
import torch
import numpy as np
import tqdm
from datasets import load_dataset
import umap
import umap.plot as umap_plot

# 確定可用的硬件
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    
# 加載分詞器和模型
tokenizer = AutoTokenizer.from_pretrained("/Users/d065243/miCSE")
model = AutoModel.from_pretrained("/Users/d065243/miCSE")
model.to(device);

# 加載 Twitter 數據進行情感聚類
dataset = load_dataset("tweet_eval", "sentiment")

# 計算推文的嵌入
# 設置批量大小和最大推文 token 長度
batch_size = 50
max_length = 128

iterations = int(np.floor(len(dataset['train'])/batch_size))*batch_size

embedding_stack = []
classes = []
for i in tqdm.notebook.tqdm(range(0,iterations,batch_size)):
    # 創建批次
    batch = tokenizer.batch_encode_plus(
                    dataset['train'][i:i+batch_size]['text'],
                    return_tensors='pt',
                    padding=True,
                    max_length=max_length,
                    truncation=True
                ).to(device)
    classes = classes + dataset['train'][i:i+batch_size]['label'] 

    # 無梯度的模型推理
    with torch.no_grad():
        outputs = model(**batch, output_hidden_states=True, return_dict=True)
        
        embeddings = outputs.last_hidden_state[:,0]
        
        embedding_stack.append( embeddings.cpu().clone() )

embeddings = torch.vstack(embedding_stack)

# 使用 UMAP 在二維空間中對嵌入進行聚類
umap_model = umap.UMAP(n_neighbors=250,
                    n_components=2,
                    min_dist=1.0e-9,
                    low_memory=True,
                    angular_rp_forest=True,
                    metric='cosine')
umap_model.fit(embeddings)

# 繪製結果
umap_plot.points(umap_model, labels = np.array(classes),theme='fire')

UMAP Cluster

使用 SentenceTransformers

from sentence_transformers import SentenceTransformer, util
from sentence_transformers import models
import torch.nn as nn

# 使用帶有 [CLS] 嵌入的模型
model_name = 'sap-ai-research/miCSE'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# 使用餘弦相似度作為度量
cos_sim = nn.CosineSimilarity(dim=-1)

# 用於比較的句子列表
sentences_1 = ["This is a sentence for testing miCSE.", 
    "This is using mutual information Contrastive Sentence Embeddings model."]

sentences_2 = ["This is testing miCSE.", 
    "Similarity with miCSE"]

# 計算兩個列表的嵌入
embeddings_1 = model.encode(sentences_1, convert_to_tensor=True)
embeddings_2 = model.encode(sentences_2, convert_to_tensor=True)

# 計算餘弦相似度
cosine_sim_scores = cos_sim(embeddings_1, embeddings_2)

# 輸出結果
for i in range(len(sentences_1)):
    print(f"Similarity {cosine_sim_scores[i][i]:.2f}: {sentences_1[i]} << vs. >> {sentences_2[i]}")

✨ 主要特性

句法一致性學習：在對比學習過程中，miCSE 模型通過正則化自注意力分佈，強制不同隨機失活增強視圖之間的句法一致性，使表示學習更加高效。
樣本高效性：通過正則化自注意力，模型在訓練過程中對樣本的利用效率大大提高，即使在訓練集規模有限的情況下，也能實現自監督學習。
適用於現實應用：由於現實應用中訓練數據通常有限，miCSE 的小樣本學習能力使其在實際場景中具有很高的應用價值。

📦 安裝指南

安裝所需的 Python 庫，可使用以下命令：

pip install transformers torch numpy tqdm datasets umap-learn sentence-transformers

💻 使用示例

基礎用法

上述的句子相似度比較示例展示了 miCSE 模型的基礎用法，通過輸入句子，模型輸出句子的嵌入向量，進而計算句子之間的相似度。

高級用法

聚類示例展示瞭如何使用 miCSE 模型對文本數據進行聚類分析，通過 UMAP 算法將高維的句子嵌入映射到二維空間，直觀地展示聚類結果。

📚 詳細文檔

模型用途

miCSE 模型旨在對句子或短段落進行編碼，輸入文本後，模型會生成一個捕捉語義的向量嵌入。句子表示對應於 [CLS] 標記的嵌入，該嵌入可用於多種任務，如檢索、句子相似度比較或聚類。

訓練數據

全量訓練數據：模型在從維基百科隨機收集的英語句子上進行訓練，全量訓練文件可點擊此處獲取。
小樣本訓練數據：小樣本訓練數據由 SimCSE 訓練語料庫的不同大小（從 10% 到 0.0064%）的數據分割組成。每個分割大小包含 5 個文件，文件名後綴表示不同的隨機種子。數據可點擊此處下載。

模型訓練

若要利用 miCSE 的小樣本學習能力，需要在自己的數據上對模型進行訓練。論文中使用的源代碼和數據分割可點擊此處獲取。

🔧 技術細節

Schematic illustration of attention mutual information (AMI) computation miCSE 語言模型通過在對比學習過程中對不同視圖（增強嵌入）的注意力模式進行對齊訓練，用於句子相似度計算。直觀地說，使用 miCSE 學習句子嵌入需要在隨機失活增強視圖之間強制實現句法一致性。實際上，這是通過正則化自注意力分佈來實現的。在訓練過程中對自注意力進行正則化，使得表示學習更加高效，即使訓練集規模有限，自監督學習也變得可行。這種特性使得 miCSE 在訓練數據通常有限的現實應用中特別有吸引力。

📄 許可證

本項目採用 Apache 2.0 許可證。

📊 基準測試

模型在 SentEval 基準測試中的結果如下：

點擊展開

+-------+-------+-------+-------+-------+--------------+-----------------+--------+                                               
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | S.Avg. |                                               
+-------+-------+-------+-------+-------+--------------+-----------------+--------+                                               
| 71.71 | 83.09 | 75.46 | 83.13 | 80.22 |    79.70     |      73.62      | 78.13  |                                               
+-------+-------+-------+-------+-------+--------------+-----------------+--------+

📖 引用

如果您在研究中使用了此代碼或引用我們的工作，請引用以下文獻：

@inproceedings{klein-nabi-2023-micse,
    title = "mi{CSE}: Mutual Information Contrastive Learning for Low-shot Sentence Embeddings",
    author = "Klein, Tassilo  and
      Nabi, Moin",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.339",
    pages = "6159--6177",
    abstract = "This paper presents miCSE, a mutual information-based contrastive learning framework that significantly advances the state-of-the-art in few-shot sentence embedding.The proposed approach imposes alignment between the attention pattern of different views during contrastive learning. Learning sentence embeddings with miCSE entails enforcing the structural consistency across augmented views for every sentence, making contrastive self-supervised learning more sample efficient. As a result, the proposed approach shows strong performance in the few-shot learning domain. While it achieves superior results compared to state-of-the-art methods on multiple benchmarks in few-shot learning, it is comparable in the full-shot scenario. This study opens up avenues for efficient self-supervised learning methods that are more robust than current contrastive methods for sentence embedding.",
}