FollowIR-7Bオープンソース指令検索モデル - 検索再ランキングに特化し、検索効果を効率的に向上させる

ホーム

Followir 7B

jhu-clspによって開発

FollowIR-7B は Mistral-7B-Instruct-v0.2 を基にファインチューニングされた命令検索モデルで、検索タスクにおける再ランキング機能に特化しています。

大規模言語モデル

Transformers

英語オープンソースライセンス:Apache-2.0 #命令検索の再ランキング #TREC評価の最適化 #クエリとドキュメントの関連性評価

ダウンロード数 39

リリース時間 : 3/17/2024

モデル概要

FollowIR-7B は、検索における再ランキングタスク用に命令ファインチューニングされた言語モデルです。TREC評価タスクから得られたFollowIRデータセットの検索データと人間が作成した命令でファインチューニングされています。

モデル特徴

命令ファインチューニング

モデルはFollowIRデータセットの検索データと人間が作成した命令でファインチューニングされており、検索タスクにおける命令をより良く理解し実行できます。

再ランキング能力

検索タスクにおける再ランキング機能に特化しており、命令に基づいて検索結果を最適に並べ替えることができます。

高性能

複数の命令検索タスクで優れた性能を発揮し、他の検索モデルを上回ります。

モデル能力

命令検索

再ランキング

クエリ-ドキュメント類似度計算

使用事例

情報検索

映画情報検索

クエリ命令に基づいて特定の監督や脚本家に関連する映画情報を検索します。

クエリ命令に関連するドキュメントを正確に識別できます。例えば、James Cameronが監督した映画を識別できます。

🚀 FollowIR-7B

FollowIR-7Bは、検索における再ランキングに使用される命令調整済み言語モデルです。このモデルは、FollowIRデータセットからの命令付き検索データでMistral-7B-Instruct-v0.2をファインチューニングしたものです。これらの命令はTRECトラックから取得され、人間によって書かれています。FollowIR-7Bは、命令に従う能力において他のすべての検索モデルを上回っています。詳細は論文を参照してください。

🚀 クイックスタート

FollowIR-7Bは、検索における再ランキングタスクに特化した命令調整済み言語モデルです。以下のセクションでは、このモデルの使用方法や訓練方法について説明します。

✨ 主な機能

命令付きの検索データでファインチューニングされた言語モデル。
検索における再ランキングタスクに特化している。
命令に従う能力において他の検索モデルを上回っている。

📦 インストール

このREADMEには具体的なインストール手順が記載されていないため、このセクションは省略します。

💻 使用例

基本的な使用法

以下は、クエリとドキュメントのペアの類似度スコアを計算する例です。

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
import torch

# model loading and setup
model_name = "jhu-clsp/FollowIR-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_name
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    model_name, padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
token_false_id = tokenizer.get_vocab()["false"]
token_true_id = tokenizer.get_vocab()["true"]
template = """<s> [INST] You are an expert Google searcher, whose job is to determine if the following document is relevant to the query (true/false). Answer using only one word, one of those two choices.

Query: {query}
Document: {text}
Relevant (only output one word, either "true" or "false"): [/INST] """


## Lets define some example queries with instructions in the query and the passage
query1 = "What movies were written by James Cameron? A relevant document would describe a movie that was written by James Cameron only and not with anyone else"
query2 = "What movies were directed by James Cameron? A relevant document would describe any movie that was directed by James Cameron"
passages = ["Avatar: The Way of Water is a 2022 American epic science fiction film co-produced and directed by James Cameron, who co-wrote the screenplay with Rick Jaffa and Amanda Silver from a story the trio wrote with Josh Friedman and Shane Salerno. Distributed by 20th Century Studios, it is the sequel to Avatar (2009) and the second installment in the Avatar film series."] * 2

prompts = [
    template.format(query=query, text=text) for (query, text) in zip([query1, query2], passages)
]
tokens = tokenizer(
    prompts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    pad_to_multiple_of=None,
)

# move to cuda if desired
for key in tokens:
    tokens[key] = tokens[key].cuda()

# calculate the scores by comparing true and false tokens
batch_scores = model(**tokens).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
print(scores) # [0.0020704232156276703, 0.9999990463256836] first document is not relevant, as expected

📚 ドキュメント

モデル情報

属性	详情
モデルタイプ	命令調整済み言語モデル
訓練データ	jhu-clsp/FollowIR-train

評価結果

タスク	データセット	p-MRR
InstructionRetrieval	MTEB Core17InstructionRetrieval	16.47851858684521
InstructionRetrieval	MTEB News21InstructionRetrieval	6.2615989256510005
InstructionRetrieval	MTEB Robust04InstructionRetrieval	13.717553757582253

🔧 技術詳細

訓練方法

LLaMA-Factoryを使用して、MistralをファインチューニングしてFollowIR-7Bを作成しました。以下は訓練スクリプトの例です。

#!/bin/bash
accelerate launch src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
    --dataset followIR-train \
    --template mistral \
    --output_dir OUTPUT \
    --finetuning_type lora \
    --lora_target q_proj,v_proj,o_proj,k_proj \
    --overwrite_cache \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 2 \
    --save_steps 29 \
    --learning_rate 3e-5 \
    --num_train_epochs 8.0 \
    --plot_loss \
    --max_length 2048 \
    --lora_rank 8 \
    --lora_alpha 16 \
    --bf16

📄 ライセンス

このモデルはApache-2.0ライセンスの下で提供されています。

📖 引用

@misc{weller2024followir,
      title={FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions}, 
      author={Orion Weller and Benjamin Chang and Sean MacAvaney and Kyle Lo and Arman Cohan and Benjamin Van Durme and Dawn Lawrie and Luca Soldaini},
      year={2024},
      eprint={2403.15246},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}