Pearl Small Open-Source String Embedding Model - Generate High-Quality Vectors for Free and Handle Semantic Similarity Calculation

Pearl Small

Developed by Lihuchen

The Pearl Small Model is a lightweight string embedding model specifically designed for string semantic similarity calculation, generating high-quality embedding vectors for tasks such as string matching and entity retrieval.

Text Embedding

Transformers

EnglishOpen Source License:Apache-2.0 #Phrase similarity calculation #Lightweight string embedding #Entity retrieval optimization

Downloads 1,824

Release Time : 2/4/2024

Model Overview

This model integrates phrase type information and morphological features to more accurately capture variations in string forms. Fine-tuned based on E5-small, it can generate superior vector representations for phrases and strings.

Model Features

High-quality phrase representation

Learns high-quality universal phrase representations, outperforming traditional sentence embedding models.

Lightweight design

Only 34 million parameters, with low memory usage and fast inference speed.

Morphology-aware

Incorporates morphological features to accurately capture variations in string forms.

Model Capabilities

Calculate string semantic similarity

Generate phrase embedding vectors

Entity retrieval

String matching

Entity clustering

Fuzzy joining

Use Cases

Information retrieval

Entity linking

Link entities mentioned in text to standard entities in a knowledge base.

Achieved 48.1 points on the YAGO dataset.

String matching

Match strings from different sources that are semantically similar.

Achieved 97.0 points on the PPDB dataset.

Data integration

Fuzzy joining

Join records from different data sources that represent the same entity.

Achieved 75.2 points on the AutoFJ task.

🚀 PEARL-small

PEARL-small is a lightweight string embedding model. It excels at semantic similarity computation for strings and is ideal for tasks like string matching, entity retrieval, entity clustering, and fuzzy join. Unlike typical sentence embedders, it incorporates phrase type information and morphological features, better capturing string variations. It's a variant of E5-small, finetuned on a context-free dataset to generate superior phrase and string representations.

🚀 Quick Start

PEARL-small offers an efficient solution for string - related tasks. It's based on the research paper Learning High - Quality and General - Purpose Phrase Representations and is accepted by EACL Findings 2024. The model is developed by Lihu Chen, [Gaël Varoquaux](https://gael - varoquaux.info/), and Fabian M. Suchanek.

✨ Features

Lightweight: With a relatively small size, it can be deployed efficiently.
Innovative Design: Incorporates phrase type information and morphological features for better string representation.
Versatile: Suitable for multiple string - related tasks such as string matching, entity retrieval, etc.

📚 Documentation

Model Comparison

Model	Size	Avg	PPDB	PPDB filtered	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ
FastText	-	40.3	94.4	61.2	59.6	58.9	16.9	14.5	3.0	0.2	53.6
Sentence - BERT	110M	50.1	94.6	66.8	50.4	62.6	21.6	23.6	25.5	48.4	57.2
Phrase - BERT	110M	54.5	96.8	68.7	57.2	68.8	23.7	26.1	35.4	59.5	66.9
E5 - small	34M	57.0	96.0	56.8	55.9	63.1	43.3	42.0	27.6	53.7	74.8
E5 - base	110M	61.1	95.4	65.6	59.4	66.3	47.3	44.0	32.0	69.3	76.1
PEARL - small	34M	62.5	97.0	70.2	57.9	68.1	48.1	44.5	42.4	59.3	75.2
PEARL - base	110M	64.8	97.3	72.2	59.7	72.6	50.7	45.8	39.3	69.4	77.1

Cost Comparison

Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples. The FastText model here is crawl - 300d - 2M - subword.bin.

Model	Avg Score	Estimated Memory	Speed GPU	Speed CPU
FastText	40.3	1200MB	-	57ms
PEARL - small	62.5	68MB	42ms	446ms
PEARL - base	64.8	220MB	89ms	1394ms

💻 Usage Examples

Basic Usage

Sentence Transformers

from sentence_transformers import SentenceTransformer, util

query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]

Transformers

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Training and Evaluation

Have a look at our code on Github

📚 Citation

If you find our work useful, please give us a citation:

@inproceedings{chen2024learning,
  title={Learning High - Quality and General - Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Gael and Suchanek, Fabian},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
  pages={983--994},
  year={2024}
}

Useful Links

🤗 PEARL - small 🤗 PEARL - base 📐 PEARL Benchmark 🏆 PEARL Leaderboard

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご