Eridu Open-Source Deep Fuzzy Matching System - A Free-to-Deploy Good Helper for Cross-Language Name Resolution of People and Companies

Eridu

Developed by Graphlet-AI

A representation learning-based deep fuzzy matching system, specifically designed for cross-lingual person and company name entity resolution

Text Embedding

Safetensors

EnglishOpen Source License:Apache-2.0 #Cross-lingual Entity Matching #Deep Fuzzy Matching #Name Resolution

Downloads 17

Release Time : 5/14/2025

Model Overview

This model is a fine-tuned sentence transformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, trained using Open Sanctions matching training data, suitable for deep fuzzy entity resolution workflows.

Model Features

Cross-lingual Support

Capable of handling person and company name matching across multiple languages

Deep Fuzzy Matching

Compared to traditional string distance methods, it can more accurately process deep semantic features of person and company names

Large-scale Training Data

Fine-tuned using contrastive learning with over 2 million labeled person/company name pairs

Model Capabilities

Cross-lingual Entity Resolution

Person Name Similarity Calculation

Company Name Similarity Calculation

Sentence Embedding Generation

Use Cases

Compliance & Risk Management

Sanctions List Matching

Identify individuals and companies on sanctions lists across different languages and spelling variations

Improved matching accuracy and reduced false positives

Data Cleansing & Integration

Cross-database Entity Resolution

Merge records of the same entity from different sources

Enhanced data quality and reduced duplicates

🚀 Graphlet-AI/eridu

This model is a deep fuzzy matching system for person and company names, leveraging representation learning for multilingual entity resolution. It outperforms traditional string distance methods and can be easily integrated into Python projects.

🚀 Quick Start

First, install the sentence-transformers library:

pip install -U sentence-transformers

Then, you can load the model and run inference:

from sentence_transformers import SentenceTransformer

# Download from the Hugging Face Hub
model = SentenceTransformer("Graphlet-AI/eridu")
# Run inference
sentences = [
    'Schori i Liding√∂',
    'Yordan Canev',
    '·ÄÄ·Ä¨·Ä∏·Äï·Ä±·Ä´·Ä∑ ·Ä°·Äî·Ä∫·Äî·Ä¨·Äê·Ä≠·ÄØ·Äú·ÄÆ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ Features

Deep Fuzzy Matching: Capable of matching people and company names across languages and character sets.
Representation Learning: Utilizes pre - trained text embeddings fine - tuned with contrastive learning.
Easy Integration: Can be used in any Python project with just a few lines of code.

📦 Installation

To use this model, you need to install the sentence-transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the Hugging Face Hub
model = SentenceTransformer("Graphlet-AI/eridu")

names = [
    "Russell Jurney",
    "Russ Jurney",
    "–†—É—Å—Å –î–∂–µ—Ä–Ω–∏",
]

embeddings = model.encode(names)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

print(similarities.numpy())
# [[0.9999999  0.99406826 0.99406105]
#  [0.9940683  1.         0.9969202 ]
#  [0.99406105 0.9969202  1.        ]]

📚 Documentation

Project Eridu Overview

This project is a deep fuzzy matching system for person and company names for entity resolution using representation learning. It uses a pre - trained text embedding model from HuggingFace, fine - tuned with contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a CLI utility for training the model and comparing name pairs using cosine similarity.

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Maximum Sequence Length	128 tokens
Output Dimensionality	384 dimensions
Similarity Function	Cosine Similarity
Language	en
License	apache - 2.0

Model Sources

Documentation: Graphlet-AI/eridu Documentation
Repository: Graphlet-AI/eridu on GitHub
Hugging Face: Graphlet-AI/eridu on Hugging Face
PyPi Package: Graphlet-AI/eridu on PyPi

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

🔧 Technical Details

Evaluation

Metrics

Binary Classification

Dataset: sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2
Evaluated with BinaryClassificationEvaluator

Metric	Value
cosine_accuracy	0.9843
cosine_accuracy_threshold	0.7421
cosine_f1	0.9761
cosine_f1_threshold	0.7421
cosine_precision	0.9703
cosine_recall	0.9819
cosine_ap	0.9956
cosine_mcc	0.9644

Training Details

Training Dataset

Unnamed Dataset

Size: 2,130,621 training samples
Columns: sentence1, sentence2, and label
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label
type string string float
details
min: 3 tokens
mean: 9.32 tokens
max: 57 tokens
min: 3 tokens
mean: 9.16 tokens
max: 54 tokens
min: 0.0
mean: 0.34
max: 1.0

	sentence1	sentence2	label
type	string	string	float
details	min: 3 tokens mean: 9.32 tokens max: 57 tokens	min: 3 tokens mean: 9.16 tokens max: 54 tokens	min: 0.0 mean: 0.34 max: 1.0

Samples:

sentence1	sentence2	label
`Ï∫êÏä§Î¶∞ ÏÑ§Î¶¨Î≤à`	`Kathryn D. Sullivanov√°`	`1.0`
`‡¨∂‡¨ø‡¨¨‡¨∞‡¨æ‡¨ú ‡¨Ö‡¨ß‡¨æ‡¨≤‡¨∞‡¨æ‡¨ì ‡¨™‡¨æ‡¨ü‡¨ø‡¨≤`	`Aleksander Lubocki`	`0.0`
`–ü—ã—Ä–≤–∞–Ω–æ–≤, –ì–µ–æ—Ä–≥–∏`	`„Ç¢„Éä„Éà„Éº„É™„Éº„Éª„Çª„É´„Ç∏„É•„Ç≥„Éï`	`0.0`

Loss: ContrastiveLoss with these parameters:

{
    "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
    "margin": 0.5,
    "size_average": true
}

Evaluation Dataset

Unnamed Dataset

Size: 2,663,276 evaluation samples
Columns: sentence1, sentence2, and label
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label
type string string float
details
min: 3 tokens
mean: 9.34 tokens
max: 102 tokens
min: 4 tokens
mean: 9.11 tokens
max: 100 tokens
min: 0.0
mean: 0.33
max: 1.0

	sentence1	sentence2	label
type	string	string	float
details	min: 3 tokens mean: 9.34 tokens max: 102 tokens	min: 4 tokens mean: 9.11 tokens max: 100 tokens	min: 0.0 mean: 0.33 max: 1.0

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご