lt-wikidata-comp-multi Open-source Model - Multilingual Sentence Similarity Tool Supporting Semantic Matching in 12 Languages

Lt Wikidata Comp Multi

Developed by dell-research-harvard

A multilingual sentence similarity model fine-tuned based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2, supporting semantic matching tasks in 12 languages

Text Embedding

Safetensors

Supports Multiple Languages#Multilingual Entity Matching #Enterprise Name Linking #Cross-language Similarity

Downloads 415

Release Time : 8/29/2023

Model Overview

This model is specifically designed for record linkage (entity matching) tasks, suitable for clustering, deduplication, and association scenarios, supporting sentence similarity calculation in 12 languages including German, English, and Chinese

Model Features

Multilingual Support

Supports sentence similarity calculation in 12 major languages, covering major European and Asian languages

Entity Matching Optimization

Specially optimized for entity linking tasks such as company alias matching

Efficient Inference

Provides fast sentence embedding computation based on the optimized sentence-transformers framework

Model Capabilities

Multilingual sentence similarity calculation

Entity matching and linking

Text clustering analysis

Semantic search

Record deduplication

Use Cases

Enterprise Data Management

Company Name Standardization

Matching company name variants from different sources to standard names

Improves the cleanliness and consistency of enterprise databases

Multilingual Applications

Cross-language Document Retrieval

Finding semantically similar content in documents of different languages

Supports knowledge discovery in multilingual environments

🚀 {MODEL_NAME}

This is a LinkTransformer model designed for quick and easy record linkage and sentence similarity tasks. It maps sentences and paragraphs to a 768-dimensional dense vector space.

🚀 Quick Start

This is a LinkTransformer model. Essentially, it's a sentence transformer model based on sentence-transformers, just wrapped around the class. It's designed for quick and easy record linkage (entity-matching) through the LinkTransformer package, with tasks including clustering, deduplication, linking, aggregation, etc. Moreover, it can be used for any sentence similarity task within the sentence-transformers framework. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be applied to tasks like clustering or semantic search.

If you want to use this model for more than what our applications support, refer to the documentation of sentence-transformers.

This model has been fine-tuned on the model sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and is pretrained for the following languages: de, en, zh, ja, hi, ar, bn, pt, ru, es, fr, ko.

This model was trained on a dataset of company aliases from wiki data using the LinkTransformer framework. It was trained for 70 epochs with other defaults specified in the repo's LinkTransformer config file - LT_training_config.json.

✨ Features

Multilingual Support: Supports multiple languages including de, en, zh, ja, hi, ar, bn, pt, ru, es, fr, ko.
Versatile Usage: Can be used for record linkage, sentence similarity tasks, clustering, and semantic search.
Easy Integration: Can be easily integrated with the LinkTransformer package.

📦 Installation

Using this model becomes easy when you have LinkTransformer installed:

pip install -U linktransformer

💻 Usage Examples

Basic Usage

import linktransformer as lt
import pandas as pd

##Load the two dataframes that you want to link. For example, 2 dataframes with company names that are written differently
df1=pd.read_csv("data/df1.csv") ###This is the left dataframe with key CompanyName for instance
df2=pd.read_csv("data/df2.csv") ###This is the right dataframe with key CompanyName for instance

###Merge the two dataframes on the key column!
df_merged = lt.merge(df1, df2, on="CompanyName", how="inner")

##Done! The merged dataframe has a column called "score" that contains the similarity score between the two company names

Advanced Usage

##Consider the example in the paper that has a dataset of Mexican products and their tariff codes from 1947 and 1948 and we want train a model to link the two tariff codes.
saved_model_path = train_model(
        model_path="hiiamsid/sentence_similarity_spanish_es",
        dataset_path=dataset_path,
        left_col_names=["description47"],
        right_col_names=['description48'],
        left_id_name=['tariffcode47'],
        right_id_name=['tariffcode48'],
        log_wandb=False,
        config_path=LINKAGE_CONFIG_PATH,
        training_args={"num_epochs": 1}
    )

🔧 Technical Details

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 5966 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: linktransformer.modified_sbert.losses.SupConLoss_wandb

Parameters of the fit()-Method:

{
    "epochs": 70,
    "evaluation_steps": 2983,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 417620,
    "weight_decay": 0.01
}

LinkTransformer( (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}) )

Model Architecture

The model architecture consists of a Transformer layer followed by a Pooling layer.

Training Data

The model was trained on a dataset of company aliases from wiki data.

📚 Documentation

Training your own LinkTransformer model

Any Sentence Transformers can be used as a backbone by simply adding a pooling layer. Any other transformer on HuggingFace can also be used by specifying the option add_pooling_layer==True. The model was trained using SupCon loss. Usage details can be found in the package docs. The training config can be found in the repo with the name LT_training_config.json. To replicate the training, you can download the file and specify the path in the config_path argument of the training function. You can also override the config by specifying the training_args argument.

Evaluation Results

You can evaluate the model using the LinkTransformer package's inference functions. We have provided a few datasets in the package for you to try out. We plan to host more datasets on Huggingface and our website (Coming soon) that you can take a look at.

📄 License

No license information provided in the original README.

Citing & Authors

@misc{arora2023linktransformer,
                  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
                  author={Abhishek Arora and Melissa Dell},
                  year={2023},
                  eprint={2309.00789},
                  archivePrefix={arXiv},
                  primaryClass={cs.CL}
                }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご