đ {MODEL_NAME}
This is a LinkTransformer model. At its core, it's a sentence-transformers model, simply wrapped around the class. It's designed for quick and easy record linkage (entity-matching) via the LinkTransformer package, with tasks including clustering, deduplication, linking, aggregation, and more. Moreover, it can be used for any sentence similarity task within the sentence-transformers framework. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be applied to tasks like clustering or semantic search. Check the documentation of sentence-transformers if you want to use this model beyond what our applications support.
This model has been fine-tuned on the model: multi-qa-mpnet-base-dot-v1 and is pretrained for the English language.
This model was trained on a dataset of company aliases from wiki data using the LinkTransformer framework. It was trained for 100 epochs with other defaults found in the repo's LinkTransformer config file - LT_training_config.json.
đ Quick Start
⨠Features
- It's a LinkTransformer model, which is essentially a sentence-transformers model wrapper.
- Designed for quick and easy record linkage (entity-matching), including tasks like clustering, deduplication, linking, and aggregation.
- Can be used for any sentence similarity task within the sentence-transformers framework.
- Maps sentences and paragraphs to a 768-dimensional dense vector space for tasks like clustering or semantic search.
đĻ Installation
Using this model becomes easy when you have LinkTransformer installed:
pip install -U linktransformer
đģ Usage Examples
Basic Usage
import linktransformer as lt
import pandas as pd
df1=pd.read_csv("data/df1.csv")
df2=pd.read_csv("data/df2.csv")
df_merged = lt.merge(df1, df2, on="CompanyName", how="inner")
đ§ Technical Details
This model has been fine-tuned on the model: multi-qa-mpnet-base-dot-v1. It is pretrained for the English language.
This model was trained on a dataset consisting of company aliases from wiki data using the LinkTransformer framework. It was trained for 100 epochs using other defaults that can be found in the repo's LinkTransformer config file - LT_training_config.json.
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 2087 with parameters:
{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
linktransformer.modified_sbert.losses.SupConLoss_wandb
Parameters of the fit()-Method:
{
"epochs": 100,
"evaluation_steps": 1044,
"evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 208700,
"weight_decay": 0.01
}
LinkTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
Training your own LinkTransformer model
Any Sentence Transformers can be used as a backbone by simply adding a pooling layer. Any other transformer on HuggingFace can also be used by specifying the option add_pooling_layer==True. The model was trained using SupCon loss. Usage can be found in the package docs. The training config can be found in the repo with the name LT_training_config.json. To replicate the training, you can download the file and specify the path in the config_path argument of the training function. You can also override the config by specifying the training_args argument. Here is an example:
saved_model_path = train_model(
model_path="hiiamsid/sentence_similarity_spanish_es",
dataset_path=dataset_path,
left_col_names=["description47"],
right_col_names=['description48'],
left_id_name=['tariffcode47'],
right_id_name=['tariffcode48'],
log_wandb=False,
config_path=LINKAGE_CONFIG_PATH,
training_args={"num_epochs": 1}
)
You can also use this package for deduplication (clusters a df on the supplied key column). Merging a fine class (like product) to a coarse class (like HS code) is also possible. Read our paper and the documentation for more!
Evaluation Results
You can evaluate the model using the LinkTransformer package's inference functions. We have provided a few datasets in the package for you to try out. We plan to host more datasets on Huggingface and our website (Coming soon) that you can take a look at.
đ License
No license information provided in the original document.
Citing & Authors
@misc{arora2023linktransformer,
title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
author={Abhishek Arora and Melissa Dell},
year={2023},
eprint={2309.00789},
archivePrefix={arXiv},
primaryClass={cs.CL}
}