abstract - sim - query开源模型 - 把抽象句子描述精准映射到匹配句子

首页

Abstract Sim Query

由 biu-nlp 开发

一个将抽象句子描述映射到符合描述的句子的模型，基于维基百科训练，采用双编码器架构。

文本嵌入

Transformers

英语#抽象句子匹配 #双编码器架构 #维基百科训练

下载量 53

发布时间 : 5/13/2023

模型简介

该模型用于将抽象的查询句子编码为向量表示，以便与句子编码器生成的向量进行相似度比较，从而找到与查询描述匹配的句子。

模型特点

双编码器架构

采用查询编码器和句子编码器分离的双编码器设计，分别优化不同类型文本的表示

抽象描述匹配

专门针对抽象查询描述与具体句子之间的匹配任务进行优化

基于维基百科训练

使用维基百科数据进行训练，适合处理百科类文本的相似度计算

模型能力

句子向量化

语义相似度计算

抽象查询匹配

使用案例

信息检索

公司关系查询

根据抽象描述（如'一家公司是更大公司的一部分'）查找符合描述的子公司关系句子

能准确检索出描述子公司关系的句子，相似度得分高于无关句子

知识库构建

关系事实提取

从文本中提取符合特定关系模式的句子

🚀 抽象句子匹配模型

这是一个用于将抽象句子描述映射到符合该描述的句子的模型。该模型在维基百科数据上进行训练，能有效完成句子匹配任务。

🚀 快速开始

此模型用于将抽象句子描述映射到符合描述的句子，在维基百科数据上进行训练。可以使用 load_finetuned_model 加载查询编码器和句子编码器，并使用 encode_batch() 对句子进行编码。

注意：该方法采用双编码器架构。这是 查询编码器，它应与 句子编码器 一起使用。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModel
import torch
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

def load_finetuned_model():

        sentence_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-sentence")
        query_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-query")
        tokenizer = AutoTokenizer.from_pretrained("biu-nlp/abstract-sim-sentence")

        return tokenizer, query_encoder, sentence_encoder


def encode_batch(model, tokenizer, sentences: List[str], device: str):
    input_ids = tokenizer(sentences, padding=True, max_length=512, truncation=True, return_tensors="pt",
                          add_special_tokens=True).to(device)
    features = model(**input_ids)[0]
    features =  torch.sum(features[:,1:,:] * input_ids["attention_mask"][:,1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids["attention_mask"][:,1:], dim=1, keepdims=True), min=1e-9)
    return features

高级用法

tokenizer, query_encoder, sentence_encoder = load_finetuned_model()
relevant_sentences = ["Fingersoft's parent company is the Finger Group.",
                      "WHIRC – a subsidiary company of Wright-Hennepin",
                      "CK Life Sciences International (Holdings) Inc. (), or CK Life Sciences, is a subsidiary of CK Hutchison Holdings",
                      "EM Microelectronic-Marin (subsidiary of The Swatch Group).",
                      "The company is currently a division of the corporate group Jam Industries.",
                      "Volt Technical Resources is a business unit of Volt Workforce Solutions, a subsidiary of Volt Information Sciences (currently trading over-the-counter as VISI.)."
             ]

irrelevant_sentences = ["The second company is deemed to be a subsidiary of the parent company.",
                        "The company has gone through more than one incarnation.",
                        "The company is owned by its employees.",
                        "Larger companies compete for market share by acquiring smaller companies that may own a particular market sector.",
                        "A parent company is a company that owns 51% or more voting stock in another firm (or subsidiary).",
                        "It is a holding company that provides services through its subsidiaries in the following areas: oil and gas, industrial and infrastructure, government and power.",
                        "RXVT Technologies is no longer a subsidiary of the parent company."
                        ]

all_sentences = relevant_sentences + irrelevant_sentences
query = "<query>: A company is a part of a larger company."
    
embeddings = encode_batch(sentence_encoder, tokenizer, all_sentences, "cpu").detach().cpu().numpy()
query_embedding = encode_batch(query_encoder, tokenizer, [query], "cpu").detach().cpu().numpy()

sims = cosine_similarity(query_embedding, embeddings)[0]
sentences_sims = list(zip(all_sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)

for s, sim in sentences_sims:
    print(s, sim)

预期输出

WHIRC – a subsidiary company of Wright-Hennepin 0.9396286
EM Microelectronic-Marin (subsidiary of The Swatch Group). 0.93929046
Fingersoft's parent company is the Finger Group. 0.936247
CK Life Sciences International (Holdings) Inc. (), or CK Life Sciences, is a subsidiary of CK Hutchison Holdings 0.9350312
The company is currently a division of the corporate group Jam Industries. 0.9273489
Volt Technical Resources is a business unit of Volt Workforce Solutions, a subsidiary of Volt Information Sciences (currently trading over-the-counter as VISI.). 0.9005086
The second company is deemed to be a subsidiary of the parent company. 0.6723645
It is a holding company that provides services through its subsidiaries in the following areas: oil and gas, industrial and infrastructure, government and power. 0.60081375
A parent company is a company that owns 51% or more voting stock in another firm (or subsidiary). 0.59490484
The company is owned by its employees. 0.55286574
RXVT Technologies is no longer a subsidiary of the parent company. 0.4321953
The company has gone through more than one incarnation. 0.38889483
Larger companies compete for market share by acquiring smaller companies that may own a particular market sector. 0.25472647

📄 模型信息

属性	详情
模型类型	用于将抽象句子描述映射到符合描述的句子的模型
训练数据	Wikipedia
标签	特征提取、句子相似度
数据集	biu-nlp/abstract-sim
小部件	句子相似度、特征提取