🚀 🦪⚪ PEARL-small
PEARL-small是一个轻量级的字符串嵌入模型,可用于字符串的语义相似度计算,能为字符串匹配、实体检索、实体聚类、模糊连接等任务生成出色的嵌入表示。它与典型的句子嵌入器不同,通过结合短语类型信息和形态特征,能更好地捕捉字符串的变化。
🚀 快速开始
PEARL-small是E5-small的一个变体,在我们构建的无上下文数据集上进行了微调,从而为短语和字符串生成更好的表示。
相关链接:
🤗 PEARL-small
🤗 PEARL-base
📐 PEARL基准测试
🏆 PEARL排行榜
✨ 主要特性
- 轻量级:是轻量级的字符串嵌入模型。
- 结合特殊信息:结合了短语类型信息和形态特征,能更好地捕捉字符串的变化。
- 微调优化:基于E5-small在特定数据集上微调,为短语和字符串生成更好的表示。
📊 模型对比
模型 |
大小 |
平均得分 |
PPDB |
过滤后的PPDB |
Turney |
BIRD |
YAGO |
UMLS |
CoNLL |
BC5CDR |
AutoFJ |
FastText |
- |
40.3 |
94.4 |
61.2 |
59.6 |
58.9 |
16.9 |
14.5 |
3.0 |
0.2 |
53.6 |
Sentence-BERT |
110M |
50.1 |
94.6 |
66.8 |
50.4 |
62.6 |
21.6 |
23.6 |
25.5 |
48.4 |
57.2 |
Phrase-BERT |
110M |
54.5 |
96.8 |
68.7 |
57.2 |
68.8 |
23.7 |
26.1 |
35.4 |
59.5 |
66.9 |
E5-small |
34M |
57.0 |
96.0 |
56.8 |
55.9 |
63.1 |
43.3 |
42.0 |
27.6 |
53.7 |
74.8 |
E5-base |
110M |
61.1 |
95.4 |
65.6 |
59.4 |
66.3 |
47.3 |
44.0 |
32.0 |
69.3 |
76.1 |
PEARL-small |
34M |
62.5 |
97.0 |
70.2 |
57.9 |
68.1 |
48.1 |
44.5 |
42.4 |
59.3 |
75.2 |
PEARL-base |
110M |
64.8 |
97.3 |
72.2 |
59.7 |
72.6 |
50.7 |
45.8 |
39.3 |
69.4 |
77.1 |
📈 成本对比
模型 |
平均得分 |
估计内存 |
GPU速度 |
CPU速度 |
FastText |
40.3 |
1200MB |
- |
57ms |
PEARL-small |
62.5 |
68MB |
42ms |
446ms |
PEARL-base |
64.8 |
220MB |
89ms |
1394ms |
💻 使用示例
基础用法 - Sentence Transformers
PEARL与Sentence Transformers库集成,可以这样使用:
from sentence_transformers import SentenceTransformer, util
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
高级用法 - Transformers
也可以使用transformers
库来使用PEARL,以下是一个实体检索的示例:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def encode_text(model, input_texts):
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')
embeddings = encode_text(model, input_texts)
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
📚 详细文档
关于训练和评估的详细信息,请查看我们在Github上的代码。
📄 许可证
本项目采用Apache-2.0许可证。
📖 引用
如果您觉得我们的工作有用,请引用以下文献:
@inproceedings{chen2024learning,
title={Learning High-Quality and General-Purpose Phrase Representations},
author={Chen, Lihu and Varoquaux, Gael and Suchanek, Fabian},
booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
pages={983--994},
year={2024}
}