lt-wikidata-comp-en Open-source Model - A Magical Tool for Entity Matching Supporting Clustering, Deduplication, and Linking

Lt Wikidata Comp En

Developed by dell-research-harvard

This is a LinkTransformer model based on the Sentence Transformers framework, specifically designed for record linkage (entity matching) tasks, supporting operations such as clustering, deduplication, and linking.

Text Embedding

Safetensors

English#Enterprise Name Matching #Multilingual Entity Linking #High-Dimensional Semantic Vectors

Downloads 272

Release Time : 8/11/2023

Model Overview

This model maps sentences and paragraphs into a 768-dimensional dense vector space, which can be used for tasks like clustering or semantic search. It was fine-tuned on the Wikidata company alias dataset based on the multi-qa-mpnet-base-dot-v1 model.

Model Features

Efficient Record Linkage

Optimized for entity matching tasks, supporting fast company name matching and linking

Versatile Applications

In addition to record linkage, it can also be used for various NLP tasks such as clustering, deduplication, and semantic search

Easy to Use

Provides a simple API through the LinkTransformer package for quick deployment and application

Model Capabilities

Sentence similarity calculation

Entity matching

Text clustering

Semantic search

Data deduplication

Use Cases

Enterprise Data Management

Company Name Matching

Matching different name variants of the same company across various data sources

Improves efficiency in enterprise data integration

Data Cleaning

Data Deduplication

Identifying and merging duplicate records in datasets

Enhances data quality

🚀 {MODEL_NAME}

This is a LinkTransformer model. At its core, it's a sentence-transformers model, simply wrapped around the class. It's designed for quick and easy record linkage (entity-matching) via the LinkTransformer package, with tasks including clustering, deduplication, linking, aggregation, and more. Moreover, it can be used for any sentence similarity task within the sentence-transformers framework. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be applied to tasks like clustering or semantic search. Check the documentation of sentence-transformers if you want to use this model beyond what our applications support.

This model has been fine-tuned on the model: multi-qa-mpnet-base-dot-v1 and is pretrained for the English language.

This model was trained on a dataset of company aliases from wiki data using the LinkTransformer framework. It was trained for 100 epochs with other defaults found in the repo's LinkTransformer config file - LT_training_config.json.

🚀 Quick Start

✨ Features

It's a LinkTransformer model, which is essentially a sentence-transformers model wrapper.
Designed for quick and easy record linkage (entity-matching), including tasks like clustering, deduplication, linking, and aggregation.
Can be used for any sentence similarity task within the sentence-transformers framework.
Maps sentences and paragraphs to a 768-dimensional dense vector space for tasks like clustering or semantic search.

📦 Installation

Using this model becomes easy when you have LinkTransformer installed:

pip install -U linktransformer

💻 Usage Examples

Basic Usage

import linktransformer as lt
import pandas as pd

##Load the two dataframes that you want to link. For example, 2 dataframes with company names that are written differently
df1=pd.read_csv("data/df1.csv") ###This is the left dataframe with key CompanyName for instance
df2=pd.read_csv("data/df2.csv") ###This is the right dataframe with key CompanyName for instance

###Merge the two dataframes on the key column!
df_merged = lt.merge(df1, df2, on="CompanyName", how="inner")

##Done! The merged dataframe has a column called "score" that contains the similarity score between the two company names

🔧 Technical Details

This model has been fine-tuned on the model: multi-qa-mpnet-base-dot-v1. It is pretrained for the English language.

This model was trained on a dataset consisting of company aliases from wiki data using the LinkTransformer framework. It was trained for 100 epochs using other defaults that can be found in the repo's LinkTransformer config file - LT_training_config.json.

DataLoader: torch.utils.data.dataloader.DataLoader of length 2087 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: linktransformer.modified_sbert.losses.SupConLoss_wandb

Parameters of the fit()-Method:

{
    "epochs": 100,
    "evaluation_steps": 1044,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 208700,
    "weight_decay": 0.01
}

LinkTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}) )

Training your own LinkTransformer model

Any Sentence Transformers can be used as a backbone by simply adding a pooling layer. Any other transformer on HuggingFace can also be used by specifying the option add_pooling_layer==True. The model was trained using SupCon loss. Usage can be found in the package docs. The training config can be found in the repo with the name LT_training_config.json. To replicate the training, you can download the file and specify the path in the config_path argument of the training function. You can also override the config by specifying the training_args argument. Here is an example:

##Consider the example in the paper that has a dataset of Mexican products and their tariff codes from 1947 and 1948 and we want train a model to link the two tariff codes.
saved_model_path = train_model(
        model_path="hiiamsid/sentence_similarity_spanish_es",
        dataset_path=dataset_path,
        left_col_names=["description47"],
        right_col_names=['description48'],
        left_id_name=['tariffcode47'],
        right_id_name=['tariffcode48'],
        log_wandb=False,
        config_path=LINKAGE_CONFIG_PATH,
        training_args={"num_epochs": 1}
    )

You can also use this package for deduplication (clusters a df on the supplied key column). Merging a fine class (like product) to a coarse class (like HS code) is also possible. Read our paper and the documentation for more!

Evaluation Results

You can evaluate the model using the LinkTransformer package's inference functions. We have provided a few datasets in the package for you to try out. We plan to host more datasets on Huggingface and our website (Coming soon) that you can take a look at.

📄 License

No license information provided in the original document.

Citing & Authors

@misc{arora2023linktransformer,
                  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
                  author={Abhishek Arora and Melissa Dell},
                  year={2023},
                  eprint={2309.00789},
                  archivePrefix={arXiv},
                  primaryClass={cs.CL}
                }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご