LdIR-Qwen2-reranker-1.5B開源模型 - 高效助力中文醫療問答及通用文本重排序

首頁

Ldir Qwen2 Reranker 1.5B

由neofung開發

基於Qwen2-1.5B的下游任務模型，專注於重排序任務，在中文醫療問答和通用文本重排序任務中表現優異。

文本嵌入

Transformers

支持多種語言開源協議:Apache-2.0 #中文問答重排序 #醫療信息檢索 #1.5B參數規模

下載量 51

發布時間 : 8/13/2024

模型概述

該模型是基於Qwen2-1.5B開發的重排序模型，主要用於提升檢索系統的相關性排序效果，特別優化了中文醫療問答場景下的性能。

模型特點

中文醫療問答優化

在CMedQA醫療問答數據集上表現出色，MAP指標達到86.5以上

多任務適配

支持多種重排序任務，包括通用文本和醫療領域

高效推理

支持FP16加速和多GPU並行計算

模型能力

文本相關性重排序

醫療問答優化

跨語言重排序

使用案例

信息檢索

醫療問答系統

提升醫療問答系統中答案的排序質量

在CMedQAv1數據集上MRR達到88.91

搜索引擎優化

改進搜索引擎結果的相關性排序

在MMarco數據集上MAP達到39.35

🚀 LdIR-Qwen2-reranker-1.5B

本模型是基於Qwen/Qwen2 - 1.5B的下游任務模型。我們借鑑了FlagEmbedding reranker的工作，並使用Qwen2 - 1.5B作為預訓練模型進行實現。

🚀 快速開始

依賴安裝

transformers==4.41.2
flash-attn==2.5.7

代碼使用

from typing import cast, List, Union, Tuple, Dict, Optional
import numpy as np
import torch
from tqdm import tqdm
import transformers
from transformers import AutoTokenizer, PreTrainedModel, PreTrainedTokenizer, DataCollatorWithPadding
from transformers.models.qwen2 import Qwen2Config, Qwen2ForSequenceClassification
from transformers.trainer_pt_utils import LabelSmoother
IGNORE_TOKEN_ID = LabelSmoother.ignore_index

def preprocess(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
    max_len: int = 1024,
) -> Dict:

    # Apply prompt templates
    input_ids, attention_masks = [], []
    for i, source in enumerate(sources):
        messages = [
            {"role": "user",
            "content": "\n\n".join(source)}
        ]
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = tokenizer([text])
        input_id = model_inputs['input_ids'][0]
        attention_mask = model_inputs['attention_mask'][0]
        if len(input_id) > max_len:
            ## last five tokens: <|im_end|>(151645), \n(198), <|im_start|>(151644), assistant(77091), \n(198)
            diff = len(input_id) - max_len
            input_id = input_id[:-5-diff] + input_id[-5:]
            attention_mask = attention_mask[:-5-diff] + attention_mask[-5:]
            assert len(input_id) == max_len
        input_ids.append(input_id)
        attention_masks.append(attention_mask)

    return dict(
        input_ids=input_ids,
        attention_mask=attention_masks
    )

class FlagRerankerCustom:
    def __init__(
            self,
            model: PreTrainedModel,
            tokenizer: PreTrainedTokenizer,
            use_fp16: bool = False
    ) -> None:
        self.tokenizer = tokenizer
        self.model = model
        self.data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

        if torch.cuda.is_available():
            self.device = torch.device('cuda')
        elif torch.backends.mps.is_available():
            self.device = torch.device('mps')
        else:
            self.device = torch.device('cpu')
            use_fp16 = False
        if use_fp16:
            self.model.half()

        self.model = self.model.to(self.device)

        self.model.eval()

        self.num_gpus = torch.cuda.device_count()
        if self.num_gpus > 1:
            print(f"----------using {self.num_gpus}*GPUs----------")
            self.model = torch.nn.DataParallel(self.model)

    @torch.no_grad()
    def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]], batch_size: int = 64,
                      max_length: int = 1024) -> List[float]:
        
        if self.num_gpus > 0:
            batch_size = batch_size * self.num_gpus

        assert isinstance(sentence_pairs, list)
        if isinstance(sentence_pairs[0], str):
            sentence_pairs = [sentence_pairs]

        all_scores = []
        for start_index in tqdm(range(0, len(sentence_pairs), batch_size), desc="Compute Scores",
                                disable=True):
            sentences_batch = sentence_pairs[start_index:start_index + batch_size]
            inputs = preprocess(sources=sentences_batch, tokenizer=self.tokenizer, max_len=max_length)
            inputs = [dict(zip(inputs, t)) for t in zip(*inputs.values())]
            inputs = self.data_collator(inputs).to(self.device)
            scores = self.model(**inputs, return_dict=True).logits
            scores = scores.squeeze()
            all_scores.extend(scores.detach().to(torch.float).cpu().numpy().tolist())

        if len(all_scores) == 1:
            return all_scores[0]
        return all_scores

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "neofung/LdIR-Qwen2-reranker-1.5B",
    padding_side="right",
)

config = Qwen2Config.from_pretrained(
    "neofung/LdIR-Qwen2-reranker-1.5B",
    trust_remote_code=True,
    bf16=True,
)

model = Qwen2ForSequenceClassification.from_pretrained(
    "neofung/LdIR-Qwen2-reranker-1.5B",
    config = config,
    trust_remote_code = True,
)

model = FlagRerankerCustom(model=model, tokenizer=tokenizer, use_fp16=False)

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

model.compute_score(pairs)

# [-2.655318021774292, 11.7670316696167]

在C - MTEB上的評估

from C_MTEB.tasks import *
from mteb import MTEB

save_name = "LdIR-Qwen2-reranker-1.5B"

evaluation = MTEB(
    task_types=["Reranking"], task_langs=['zh', 'zh2en', 'en2zh']
    )

evaluation.run(model, output_folder=f"reranker_results/{save_name}")

📊 評估結果

任務類型	數據集	評估指標	數值
重排序	MTEB CMedQAv1	MAP	86.50438688414654
重排序	MTEB CMedQAv1	MRR	88.91170634920635
重排序	MTEB CMedQAv2	MAP	87.10592353383732
重排序	MTEB CMedQAv2	MRR	89.10178571428571
重排序	MTEB MMarcoReranking	MAP	39.354813242907133
重排序	MTEB MMarcoReranking	MRR	39.075793650793655
重排序	MTEB T2Reranking	MAP	68.83696915006163
重排序	MTEB T2Reranking	MRR	79.77644651857584