🚀 Graphlet-AI/eridu
This model is a deep fuzzy matching system for person and company names, leveraging representation learning for multilingual entity resolution. It outperforms traditional string distance methods and can be easily integrated into Python projects.
🚀 Quick Start
First, install the sentence-transformers
library:
pip install -U sentence-transformers
Then, you can load the model and run inference:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Graphlet-AI/eridu")
sentences = [
'Schori i Lidingö',
'Yordan Canev',
'ကားပေါ့ အန်နာတိုလီ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
✨ Features
- Deep Fuzzy Matching: Capable of matching people and company names across languages and character sets.
- Representation Learning: Utilizes pre - trained text embeddings fine - tuned with contrastive learning.
- Easy Integration: Can be used in any Python project with just a few lines of code.
📦 Installation
To use this model, you need to install the sentence-transformers
library:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Graphlet-AI/eridu")
names = [
"Russell Jurney",
"Russ Jurney",
"–†—É—Å—Å –î–∂–µ—Ä–Ω–∏",
]
embeddings = model.encode(names)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
print(similarities.numpy())
📚 Documentation
Project Eridu Overview
This project is a deep fuzzy matching system for person and company names for entity resolution using representation learning. It uses a pre - trained text embedding model from HuggingFace, fine - tuned with contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a CLI utility for training the model and comparing name pairs using cosine similarity.
Model Description
Model Sources
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
🔧 Technical Details
Evaluation
Metrics
Binary Classification
Metric |
Value |
cosine_accuracy |
0.9843 |
cosine_accuracy_threshold |
0.7421 |
cosine_f1 |
0.9761 |
cosine_f1_threshold |
0.7421 |
cosine_precision |
0.9703 |
cosine_recall |
0.9819 |
cosine_ap |
0.9956 |
cosine_mcc |
0.9644 |
Training Details
Training Dataset
Unnamed Dataset
-
Size: 2,130,621 training samples
-
Columns: sentence1
, sentence2
, and label
-
Approximate statistics based on the first 1000 samples:
|
sentence1 |
sentence2 |
label |
type |
string |
string |
float |
details |
- min: 3 tokens
- mean: 9.32 tokens
- max: 57 tokens
|
- min: 3 tokens
- mean: 9.16 tokens
- max: 54 tokens
|
- min: 0.0
- mean: 0.34
- max: 1.0
|
-
Samples:
sentence1 |
sentence2 |
label |
캐스린 설리번 |
Kathryn D. Sullivanov√° |
1.0 |
ଶିବରାଜ ଅଧାଲରାଓ ପାଟିଲ |
Aleksander Lubocki |
0.0 |
–ü—ã—Ä–≤–∞–Ω–æ–≤, –ì–µ–æ—Ä–≥–∏ |
アナトーリー・セルジュコフ |
0.0 |
-
Loss: ContrastiveLoss
with these parameters:
{
"distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
"margin": 0.5,
"size_average": true
}
Evaluation Dataset
Unnamed Dataset
-
Size: 2,663,276 evaluation samples
-
Columns: sentence1
, sentence2
, and label
-
Approximate statistics based on the first 1000 samples:
|
sentence1 |
sentence2 |
label |
type |
string |
string |
float |
details |
- min: 3 tokens
- mean: 9.34 tokens
- max: 102 tokens
|
- min: 4 tokens
- mean: 9.11 tokens
- max: 100 tokens
|
- min: 0.0
- mean: 0.33
- max: 1.0
|
📄 License
This model is licensed under the apache-2.0
license.