Pearl Small開源字符串嵌入模型 - 免費生成優質向量處理語義相似度計算

首頁

Pearl Small

由Lihuchen開發

珍珠小模型是一款輕量級字符串嵌入模型，專門用於處理字符串語義相似度計算，為字符串匹配、實體檢索等任務生成優質嵌入向量。

文本嵌入

Transformers

英語開源協議:Apache-2.0 #短語相似度計算 #輕量級字符串嵌入 #實體檢索優化

下載量 1,824

發布時間 : 2/4/2024

模型概述

該模型融合了短語類型信息和詞形特徵，能更精準捕捉字符串的形態變化。基於E5-small微調而成，可為短語和字符串生成更優質的向量表示。

模型特點

高質量短語表徵

學習高質量通用短語表徵，優於傳統句子嵌入模型

輕量級設計

僅3400萬參數，內存佔用小，推理速度快

形態感知

融合詞形特徵，能精準捕捉字符串的形態變化

模型能力

計算字符串語義相似度

生成短語嵌入向量

實體檢索

字符串匹配

實體聚類

模糊連接

使用案例

信息檢索

實體鏈接

將文本中提到的實體鏈接到知識庫中的標準實體

在YAGO數據集上達到48.1分

字符串匹配

匹配不同來源但語義相似的字符串

在PPDB數據集上達到97.0分

數據集成

模糊連接

連接不同數據源中表示相同實體的記錄

在AutoFJ任務上達到75.2分

🚀 🦪⚪ PEARL-small

PEARL-small是一個輕量級的字符串嵌入模型，可用於字符串的語義相似度計算，能為字符串匹配、實體檢索、實體聚類、模糊連接等任務生成出色的嵌入表示。它與典型的句子嵌入器不同，通過結合短語類型信息和形態特徵，能更好地捕捉字符串的變化。

🚀 快速開始

PEARL-small是E5-small的一個變體，在我們構建的無上下文數據集上進行了微調，從而為短語和字符串生成更好的表示。

相關鏈接： 🤗 PEARL-small 🤗 PEARL-base 📐 PEARL基準測試 🏆 PEARL排行榜

✨ 主要特性

輕量級：是輕量級的字符串嵌入模型。
結合特殊信息：結合了短語類型信息和形態特徵，能更好地捕捉字符串的變化。
微調優化：基於E5-small在特定數據集上微調，為短語和字符串生成更好的表示。

📊 模型對比

模型	大小	平均得分	PPDB	過濾後的PPDB	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ
FastText	-	40.3	94.4	61.2	59.6	58.9	16.9	14.5	3.0	0.2	53.6
Sentence-BERT	110M	50.1	94.6	66.8	50.4	62.6	21.6	23.6	25.5	48.4	57.2
Phrase-BERT	110M	54.5	96.8	68.7	57.2	68.8	23.7	26.1	35.4	59.5	66.9
E5-small	34M	57.0	96.0	56.8	55.9	63.1	43.3	42.0	27.6	53.7	74.8
E5-base	110M	61.1	95.4	65.6	59.4	66.3	47.3	44.0	32.0	69.3	76.1
PEARL-small	34M	62.5	97.0	70.2	57.9	68.1	48.1	44.5	42.4	59.3	75.2
PEARL-base	110M	64.8	97.3	72.2	59.7	72.6	50.7	45.8	39.3	69.4	77.1

📈 成本對比

模型	平均得分	估計內存	GPU速度	CPU速度
FastText	40.3	1200MB	-	57ms
PEARL-small	62.5	68MB	42ms	446ms
PEARL-base	64.8	220MB	89ms	1394ms

💻 使用示例

基礎用法 - Sentence Transformers

PEARL與Sentence Transformers庫集成，可以這樣使用：

from sentence_transformers import SentenceTransformer, util

query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]

高級用法 - Transformers

也可以使用transformers庫來使用PEARL，以下是一個實體檢索的示例：

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]

📚 詳細文檔

關於訓練和評估的詳細信息，請查看我們在Github上的代碼。

📄 許可證

本項目採用Apache-2.0許可證。

📖 引用

如果您覺得我們的工作有用，請引用以下文獻：

@inproceedings{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Gael and Suchanek, Fabian},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
  pages={983--994},
  year={2024}
}