🚀 PEARL-small
PEARL-small is a lightweight string embedding model. It excels at semantic similarity computation for strings and is ideal for tasks like string matching, entity retrieval, entity clustering, and fuzzy join. Unlike typical sentence embedders, it incorporates phrase type information and morphological features, better capturing string variations. It's a variant of E5-small, finetuned on a context-free dataset to generate superior phrase and string representations.
🚀 Quick Start
PEARL-small offers an efficient solution for string - related tasks. It's based on the research paper Learning High - Quality and General - Purpose Phrase Representations and is accepted by EACL Findings 2024. The model is developed by Lihu Chen, [Gaël Varoquaux](https://gael - varoquaux.info/), and Fabian M. Suchanek.
✨ Features
- Lightweight: With a relatively small size, it can be deployed efficiently.
- Innovative Design: Incorporates phrase type information and morphological features for better string representation.
- Versatile: Suitable for multiple string - related tasks such as string matching, entity retrieval, etc.
📚 Documentation
Model Comparison
Model |
Size |
Avg |
PPDB |
PPDB filtered |
Turney |
BIRD |
YAGO |
UMLS |
CoNLL |
BC5CDR |
AutoFJ |
FastText |
- |
40.3 |
94.4 |
61.2 |
59.6 |
58.9 |
16.9 |
14.5 |
3.0 |
0.2 |
53.6 |
Sentence - BERT |
110M |
50.1 |
94.6 |
66.8 |
50.4 |
62.6 |
21.6 |
23.6 |
25.5 |
48.4 |
57.2 |
Phrase - BERT |
110M |
54.5 |
96.8 |
68.7 |
57.2 |
68.8 |
23.7 |
26.1 |
35.4 |
59.5 |
66.9 |
E5 - small |
34M |
57.0 |
96.0 |
56.8 |
55.9 |
63.1 |
43.3 |
42.0 |
27.6 |
53.7 |
74.8 |
E5 - base |
110M |
61.1 |
95.4 |
65.6 |
59.4 |
66.3 |
47.3 |
44.0 |
32.0 |
69.3 |
76.1 |
PEARL - small |
34M |
62.5 |
97.0 |
70.2 |
57.9 |
68.1 |
48.1 |
44.5 |
42.4 |
59.3 |
75.2 |
PEARL - base |
110M |
64.8 |
97.3 |
72.2 |
59.7 |
72.6 |
50.7 |
45.8 |
39.3 |
69.4 |
77.1 |
Cost Comparison
Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples
. The FastText model here is crawl - 300d - 2M - subword.bin
.
Model |
Avg Score |
Estimated Memory |
Speed GPU |
Speed CPU |
FastText |
40.3 |
1200MB |
- |
57ms |
PEARL - small |
62.5 |
68MB |
42ms |
446ms |
PEARL - base |
64.8 |
220MB |
89ms |
1394ms |
💻 Usage Examples
Basic Usage
Sentence Transformers
from sentence_transformers import SentenceTransformer, util
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
Transformers
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def encode_text(model, input_texts):
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')
embeddings = encode_text(model, input_texts)
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
📄 License
This project is licensed under the Apache - 2.0 license.
📚 Training and Evaluation
Have a look at our code on Github
📚 Citation
If you find our work useful, please give us a citation:
@inproceedings{chen2024learning,
title={Learning High - Quality and General - Purpose Phrase Representations},
author={Chen, Lihu and Varoquaux, Gael and Suchanek, Fabian},
booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
pages={983--994},
year={2024}
}
Useful Links
🤗 PEARL - small 🤗 PEARL - base
📐 PEARL Benchmark 🏆 PEARL Leaderboard