🚀 🦪⚪ PEARL-small
PEARL-small是一個輕量級的字符串嵌入模型,可用於字符串的語義相似度計算,能為字符串匹配、實體檢索、實體聚類、模糊連接等任務生成出色的嵌入表示。它與典型的句子嵌入器不同,通過結合短語類型信息和形態特徵,能更好地捕捉字符串的變化。
🚀 快速開始
PEARL-small是E5-small的一個變體,在我們構建的無上下文數據集上進行了微調,從而為短語和字符串生成更好的表示。
相關鏈接:
🤗 PEARL-small
🤗 PEARL-base
📐 PEARL基準測試
🏆 PEARL排行榜
✨ 主要特性
- 輕量級:是輕量級的字符串嵌入模型。
- 結合特殊信息:結合了短語類型信息和形態特徵,能更好地捕捉字符串的變化。
- 微調優化:基於E5-small在特定數據集上微調,為短語和字符串生成更好的表示。
📊 模型對比
模型 |
大小 |
平均得分 |
PPDB |
過濾後的PPDB |
Turney |
BIRD |
YAGO |
UMLS |
CoNLL |
BC5CDR |
AutoFJ |
FastText |
- |
40.3 |
94.4 |
61.2 |
59.6 |
58.9 |
16.9 |
14.5 |
3.0 |
0.2 |
53.6 |
Sentence-BERT |
110M |
50.1 |
94.6 |
66.8 |
50.4 |
62.6 |
21.6 |
23.6 |
25.5 |
48.4 |
57.2 |
Phrase-BERT |
110M |
54.5 |
96.8 |
68.7 |
57.2 |
68.8 |
23.7 |
26.1 |
35.4 |
59.5 |
66.9 |
E5-small |
34M |
57.0 |
96.0 |
56.8 |
55.9 |
63.1 |
43.3 |
42.0 |
27.6 |
53.7 |
74.8 |
E5-base |
110M |
61.1 |
95.4 |
65.6 |
59.4 |
66.3 |
47.3 |
44.0 |
32.0 |
69.3 |
76.1 |
PEARL-small |
34M |
62.5 |
97.0 |
70.2 |
57.9 |
68.1 |
48.1 |
44.5 |
42.4 |
59.3 |
75.2 |
PEARL-base |
110M |
64.8 |
97.3 |
72.2 |
59.7 |
72.6 |
50.7 |
45.8 |
39.3 |
69.4 |
77.1 |
📈 成本對比
模型 |
平均得分 |
估計內存 |
GPU速度 |
CPU速度 |
FastText |
40.3 |
1200MB |
- |
57ms |
PEARL-small |
62.5 |
68MB |
42ms |
446ms |
PEARL-base |
64.8 |
220MB |
89ms |
1394ms |
💻 使用示例
基礎用法 - Sentence Transformers
PEARL與Sentence Transformers庫集成,可以這樣使用:
from sentence_transformers import SentenceTransformer, util
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
高級用法 - Transformers
也可以使用transformers
庫來使用PEARL,以下是一個實體檢索的示例:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def encode_text(model, input_texts):
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')
embeddings = encode_text(model, input_texts)
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
📚 詳細文檔
關於訓練和評估的詳細信息,請查看我們在Github上的代碼。
📄 許可證
本項目採用Apache-2.0許可證。
📖 引用
如果您覺得我們的工作有用,請引用以下文獻:
@inproceedings{chen2024learning,
title={Learning High-Quality and General-Purpose Phrase Representations},
author={Chen, Lihu and Varoquaux, Gael and Suchanek, Fabian},
booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
pages={983--994},
year={2024}
}