FollowIR-7B開源指令檢索模型 - 專注檢索重排序，高效提升檢索效果

首頁

Followir 7B

由jhu-clsp開發

FollowIR-7B 是一個基於 Mistral-7B-Instruct-v0.2 微調的指令檢索模型，專注於檢索任務中的重排序功能。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #指令檢索重排序 #TREC評測優化 #查詢文檔相關性評估

下載量 39

發布時間 : 3/17/2024

模型概述

FollowIR-7B 是一個經過指令微調的語言模型，用於檢索中的重排序任務。它在 FollowIR 數據集的檢索數據和人類編寫的指令上進行了微調，這些指令來自 TREC 評測任務。

模型特點

指令微調

模型在 FollowIR 數據集的檢索數據和人類編寫的指令上進行了微調，使其能夠更好地理解和執行檢索任務中的指令。

重排序能力

專注於檢索任務中的重排序功能，能夠根據指令對檢索結果進行優化排序。

高性能

在多個指令檢索任務中表現出色，優於其他檢索模型。

模型能力

指令檢索

重排序

查詢-文檔相似度計算

使用案例

信息檢索

電影信息檢索

根據查詢指令檢索與特定導演或編劇相關的電影信息。

能夠準確識別與查詢指令相關的文檔，如識別由 James Cameron 執導的電影。

🚀 FollowIR-7B

FollowIR-7B 是一個經過指令調優的語言模型，用於檢索重排序。它基於 Mistral-7B-Instruct-v0.2，在 FollowIR 數據集中帶有指令的檢索數據上進行了微調。這些指令來自 TREC 賽道，由人工編寫。FollowIR-7B 在遵循指令方面優於所有其他檢索模型。更多詳情請參閱論文。

✨ 主要特性

指令調優：基於 Mistral-7B-Instruct-v0.2 進行微調，能更好地遵循指令。
性能優越：在遵循指令方面優於其他檢索模型。

📦 安裝指南

文檔未提供安裝步驟，暫不展示。

💻 使用示例

基礎用法

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
import torch

# model loading and setup
model_name = "jhu-clsp/FollowIR-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_name
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    model_name, padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
token_false_id = tokenizer.get_vocab()["false"]
token_true_id = tokenizer.get_vocab()["true"]
template = """<s> [INST] You are an expert Google searcher, whose job is to determine if the following document is relevant to the query (true/false). Answer using only one word, one of those two choices.

Query: {query}
Document: {text}
Relevant (only output one word, either "true" or "false"): [/INST] """


## Lets define some example queries with instructions in the query and the passage
query1 = "What movies were written by James Cameron? A relevant document would describe a movie that was written by James Cameron only and not with anyone else"
query2 = "What movies were directed by James Cameron? A relevant document would describe any movie that was directed by James Cameron"
passages = ["Avatar: The Way of Water is a 2022 American epic science fiction film co-produced and directed by James Cameron, who co-wrote the screenplay with Rick Jaffa and Amanda Silver from a story the trio wrote with Josh Friedman and Shane Salerno. Distributed by 20th Century Studios, it is the sequel to Avatar (2009) and the second installment in the Avatar film series."] * 2

prompts = [
    template.format(query=query, text=text) for (query, text) in zip([query1, query2], passages)
]
tokens = tokenizer(
    prompts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    pad_to_multiple_of=None,
)

# move to cuda if desired
for key in tokens:
    tokens[key] = tokens[key].cuda()

# calculate the scores by comparing true and false tokens
batch_scores = model(**tokens).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
print(scores) # [0.0020704232156276703, 0.9999990463256836] first document is not relevant, as expected

📚 詳細文檔

模型信息

屬性	詳情
模型類型	經過指令調優的語言模型，用於檢索重排序
訓練數據	jhu-clsp/FollowIR-train

模型結果

任務類型	數據集名稱	p-MRR 值
InstructionRetrieval	MTEB Core17InstructionRetrieval	16.47851858684521
InstructionRetrieval	MTEB News21InstructionRetrieval	6.2615989256510005
InstructionRetrieval	MTEB Robust04InstructionRetrieval	13.717553757582253

🔧 技術細節

我們使用 LLaMA-Factory 對 Mistral 進行微調以創建 FollowIR-7B。在微調之前，我們將數據轉換為適合其格式（模板中輸入“查詢” + “指令”，輸出為標籤，且指令位於模板開頭），並使用以下訓練腳本：

#!/bin/bash
accelerate launch src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
    --dataset followIR-train \
    --template mistral \
    --output_dir OUTPUT \
    --finetuning_type lora \
    --lora_target q_proj,v_proj,o_proj,k_proj \
    --overwrite_cache \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 2 \
    --save_steps 29 \
    --learning_rate 3e-5 \
    --num_train_epochs 8.0 \
    --plot_loss \
    --max_length 2048 \
    --lora_rank 8 \
    --lora_alpha 16 \
    --bf16

📄 許可證

本項目採用 Apache-2.0 許可證。

📖 引用

@misc{weller2024followir,
      title={FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions}, 
      author={Orion Weller and Benjamin Chang and Sean MacAvaney and Kyle Lo and Arman Cohan and Benjamin Van Durme and Dawn Lawrie and Luca Soldaini},
      year={2024},
      eprint={2403.15246},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}