abstract - sim - sentence开源模型 - 实现抽象句子描述到匹配句子的映射

首页

Abstract Sim Sentence

由 biu-nlp 开发

一个将抽象句子描述映射到符合描述的句子的模型，基于维基百科训练，采用双编码器架构。

文本嵌入

Transformers

英语#抽象句子匹配 #双编码器架构 #维基百科训练

下载量 51

发布时间 : 5/13/2023

模型简介

该模型用于将抽象句子描述映射到符合描述的句子，主要用于句子相似度计算和特征提取任务。

模型特点

双编码器架构

采用独立的查询编码器和句子编码器，分别处理查询和句子，提高匹配精度。

基于维基百科训练

模型在维基百科数据上进行训练，能够处理广泛的语义信息。

高效特征提取

能够高效提取句子特征，用于相似度计算或其他下游任务。

模型能力

句子特征提取

句子相似度计算

抽象句子匹配

使用案例

信息检索

公司关系查询

根据抽象查询（如'一家公司是更大公司的一部分'）匹配相关句子。

能够准确匹配描述公司关系的句子，如子公司、母公司等。

语义搜索

抽象查询匹配

将抽象查询映射到具体的相关句子。

能够有效区分相关和不相关的句子，排序结果符合预期。

🚀 抽象句子映射模型

该模型用于将抽象的句子描述映射到符合这些描述的句子。它基于维基百科数据进行训练。使用load_finetuned_model加载查询和句子编码器，并使用encode_batch()方法对句子进行编码。

🚀 快速开始

此模型可将抽象的句子描述映射到符合描述的句子，在维基百科数据上进行训练。使用load_finetuned_model加载查询和句子编码器，使用encode_batch()方法对句子进行编码。

注意：该方法采用双编码器架构。这是句子编码器，应与查询编码器配合使用。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModel
import torch
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

def load_finetuned_model():


        sentence_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-sentence")
        query_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-query")
        tokenizer = AutoTokenizer.from_pretrained("biu-nlp/abstract-sim-sentence")

        return tokenizer, query_encoder, sentence_encoder


def encode_batch(model, tokenizer, sentences: List[str], device: str):
    input_ids = tokenizer(sentences, padding=True, max_length=512, truncation=True, return_tensors="pt",
                          add_special_tokens=True).to(device)
    features = model(**input_ids)[0]
    features =  torch.sum(features[:,1:,:] * input_ids["attention_mask"][:,1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids["attention_mask"][:,1:], dim=1, keepdims=True), min=1e-9)
    return features

高级用法

tokenizer, query_encoder, sentence_encoder = load_finetuned_model()
relevant_sentences = ["Fingersoft's parent company is the Finger Group.",
                      "WHIRC – a subsidiary company of Wright-Hennepin",
                      "CK Life Sciences International (Holdings) Inc. (), or CK Life Sciences, is a subsidiary of CK Hutchison Holdings",
                      "EM Microelectronic-Marin (subsidiary of The Swatch Group).",
                      "The company is currently a division of the corporate group Jam Industries.",
                      "Volt Technical Resources is a business unit of Volt Workforce Solutions, a subsidiary of Volt Information Sciences (currently trading over-the-counter as VISI.)."
             ]

irrelevant_sentences = ["The second company is deemed to be a subsidiary of the parent company.",
                        "The company has gone through more than one incarnation.",
                        "The company is owned by its employees.",
                        "Larger companies compete for market share by acquiring smaller companies that may own a particular market sector.",
                        "A parent company is a company that owns 51% or more voting stock in another firm (or subsidiary).",
                        "It is a holding company that provides services through its subsidiaries in the following areas: oil and gas, industrial and infrastructure, government and power.",
                        "RXVT Technologies is no longer a subsidiary of the parent company."
                        ]

all_sentences = relevant_sentences + irrelevant_sentences
query = "<query>: A company is a part of a larger company."
    
embeddings = encode_batch(sentence_encoder, tokenizer, all_sentences, "cpu").detach().cpu().numpy()
query_embedding = encode_batch(query_encoder, tokenizer, [query], "cpu").detach().cpu().numpy()

sims = cosine_similarity(query_embedding, embeddings)[0]
sentences_sims = list(zip(all_sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)

for s, sim in sentences_sims:
    print(s, sim)

预期输出

WHIRC – a subsidiary company of Wright-Hennepin 0.9396286
EM Microelectronic-Marin (subsidiary of The Swatch Group). 0.93929046
Fingersoft's parent company is the Finger Group. 0.936247
CK Life Sciences International (Holdings) Inc. (), or CK Life Sciences, is a subsidiary of CK Hutchison Holdings 0.9350312
The company is currently a division of the corporate group Jam Industries. 0.9273489
Volt Technical Resources is a business unit of Volt Workforce Solutions, a subsidiary of Volt Information Sciences (currently trading over-the-counter as VISI.). 0.9005086
The second company is deemed to be a subsidiary of the parent company. 0.6723645
It is a holding company that provides services through its subsidiaries in the following areas: oil and gas, industrial and infrastructure, government and power. 0.60081375
A parent company is a company that owns 51% or more voting stock in another firm (or subsidiary). 0.59490484
The company is owned by its employees. 0.55286574
RXVT Technologies is no longer a subsidiary of the parent company. 0.4321953
The company has gone through more than one incarnation. 0.38889483
Larger companies compete for market share by acquiring smaller companies that may own a particular market sector. 0.25472647