LLM2Vecオープンソーステキストエンコードモデル - 無料で大規模言語モデルをエンコーダに変換してテキストエンコードを実現

ホーム

Llm2vec Meta Llama 31 8B Instruct Mntp Unsup Simcse

McGill-NLPによって開発

LLM2Vecは、デコーダのみのアーキテクチャを持つ大規模言語モデルをテキストエンコーダに変換するソリューションで、双方向アテンション、マスクされた次単語予測、教師なしコントラスト学習を有効にすることで変換を実現します。

テキスト埋め込み

Safetensors

英語オープンソースライセンス:MIT #デコーダからエンコーダへの変換 #教師なしコントラスト学習 #命令認識埋め込み

ダウンロード数 55

リリース時間 : 10/8/2024

モデル概要

このモデルは3段階の変換プロセスで大規模言語モデルをテキストエンコーダに変換し、テキスト埋め込み、情報検索などのタスクをサポートし、さらに微調整して性能を向上させることが可能です。

モデル特徴

双方向アテンション機構

双方向アテンション機構を有効化することで、モデルの文脈理解能力を強化

教師なしコントラスト学習

教師なしコントラスト学習手法を採用し、テキスト表現の品質を向上

微調整互換性

業界トップレベルの性能達成に向けたさらなる微調整をサポート

モデル能力

テキスト埋め込み生成

情報検索

テキスト意味類似度計算

テキスト分類

テキストクラスタリング

使用事例

情報検索

ウェブ検索クエリマッチング

ユーザークエリと関連ドキュメントをマッチングして検索

クエリと関連ドキュメントのコサイン類似度が0.6に達する例を表示

質問応答システム

タンパク質摂取量に関するQA

女性の1日あたりタンパク質摂取量に関する質問に回答

CDCガイドライン関連内容を正確にマッチング可能

🚀 LLM2Vec

LLM2Vecは、デコーダーのみの大規模言語モデル（LLM）をテキストエンコーダーに変換するシンプルな手法です。双方向注意機構の有効化、マスクされた次のトークン予測、教師なし対照学習の3つの手順で構成されています。このモデルは、微調整することで最先端の性能を達成することができます。

🚀 クイックスタート

LLM2Vecは、デコーダーのみの大規模言語モデル（LLM）をテキストエンコーダーに変換する手法です。このREADMEでは、LLM2Vecのインストール方法と使用例を紹介します。

✨ 主な機能

デコーダーのみのLLMをテキストエンコーダーに変換することができます。
双方向注意機構、マスクされた次のトークン予測、教師なし対照学習の3つの手順で構成されています。
微調整することで最先端の性能を達成することができます。

📦 インストール

pip install llm2vec

💻 使用例

基本的な使用法

from llm2vec import LLM2Vec

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel

# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
tokenizer = AutoTokenizer.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp"
)
config = AutoConfig.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(
    model,
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
)
model = model.merge_and_unload()  # This can take several minutes on cpu

# Loading unsupervised SimCSE model. This loads the trained LoRA weights on top of MNTP model. Hence the final weights are -- Base model + MNTP (LoRA) + SimCSE (LoRA).
model = PeftModel.from_pretrained(
    model, "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse"
)

# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.6007, 0.3518],
        [0.4131, 0.4855]])
"""