đ {MODEL_NAME}
This is a LinkTransformer model designed for quick and easy record linkage and sentence similarity tasks. It maps sentences and paragraphs to a 768-dimensional dense vector space.
đ Quick Start
This is a LinkTransformer model. Essentially, it's a sentence transformer model based on sentence-transformers, just wrapped around the class. It's designed for quick and easy record linkage (entity-matching) through the LinkTransformer package, with tasks including clustering, deduplication, linking, aggregation, etc. Moreover, it can be used for any sentence similarity task within the sentence-transformers framework. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be applied to tasks like clustering or semantic search.
If you want to use this model for more than what our applications support, refer to the documentation of sentence-transformers.
This model has been fine-tuned on the model sentence-transformers/paraphrase-multilingual-mpnet-base-v2
and is pretrained for the following languages: de, en, zh, ja, hi, ar, bn, pt, ru, es, fr, ko.
This model was trained on a dataset of company aliases from wiki data using the LinkTransformer framework. It was trained for 70 epochs with other defaults specified in the repo's LinkTransformer config file - LT_training_config.json.
⨠Features
- Multilingual Support: Supports multiple languages including de, en, zh, ja, hi, ar, bn, pt, ru, es, fr, ko.
- Versatile Usage: Can be used for record linkage, sentence similarity tasks, clustering, and semantic search.
- Easy Integration: Can be easily integrated with the LinkTransformer package.
đĻ Installation
Using this model becomes easy when you have LinkTransformer installed:
pip install -U linktransformer
đģ Usage Examples
Basic Usage
import linktransformer as lt
import pandas as pd
df1=pd.read_csv("data/df1.csv")
df2=pd.read_csv("data/df2.csv")
df_merged = lt.merge(df1, df2, on="CompanyName", how="inner")
Advanced Usage
saved_model_path = train_model(
model_path="hiiamsid/sentence_similarity_spanish_es",
dataset_path=dataset_path,
left_col_names=["description47"],
right_col_names=['description48'],
left_id_name=['tariffcode47'],
right_id_name=['tariffcode48'],
log_wandb=False,
config_path=LINKAGE_CONFIG_PATH,
training_args={"num_epochs": 1}
)
đ§ Technical Details
Training
The model was trained with the following parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 5966 with parameters:
{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
linktransformer.modified_sbert.losses.SupConLoss_wandb
Parameters of the fit()-Method:
{
"epochs": 70,
"evaluation_steps": 2983,
"evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 417620,
"weight_decay": 0.01
}
LinkTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
Model Architecture
The model architecture consists of a Transformer layer followed by a Pooling layer.
Training Data
The model was trained on a dataset of company aliases from wiki data.
đ Documentation
Training your own LinkTransformer model
Any Sentence Transformers can be used as a backbone by simply adding a pooling layer. Any other transformer on HuggingFace can also be used by specifying the option add_pooling_layer==True
. The model was trained using SupCon loss. Usage details can be found in the package docs. The training config can be found in the repo with the name LT_training_config.json. To replicate the training, you can download the file and specify the path in the config_path
argument of the training function. You can also override the config by specifying the training_args
argument.
Evaluation Results
You can evaluate the model using the LinkTransformer package's inference functions. We have provided a few datasets in the package for you to try out. We plan to host more datasets on Huggingface and our website (Coming soon) that you can take a look at.
đ License
No license information provided in the original README.
Citing & Authors
@misc{arora2023linktransformer,
title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
author={Abhishek Arora and Melissa Dell},
year={2023},
eprint={2309.00789},
archivePrefix={arXiv},
primaryClass={cs.CL}
}