Open-source Tamil - sentence - similarity - sbert model - Accurately calculate the similarity of Tamil sentences

Home

Tamil Sentence Similarity Sbert

Developed by l3cube-pune

This is a Tamil SBERT model fine-tuned on the STS dataset for calculating sentence similarity.

Text Embedding

Transformers

Other#Tamil sentence similarity #Multilingual SBERT fine-tuning #Indian language processing

Downloads 312

Release Time : 2/25/2023

Model Overview

This model is a Tamil sentence embedding model fine-tuned on the STS dataset, capable of calculating semantic similarity between Tamil sentences. Released as part of the MahaNLP project.

Model Features

Tamil-specific

Sentence embedding model specifically optimized for Tamil language.

Semantic similarity calculation

Accurately calculates semantic similarity between Tamil sentences.

Fine-tuned on STS

Fine-tuned using the STS dataset to optimize similarity calculation performance.

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Semantic feature extraction

Use Cases

Natural Language Processing

Semantic search

Used for building Tamil semantic search engines

Improves relevance of search results

Text clustering

Semantic-based clustering analysis for Tamil texts

Achieves more accurate text classification

Question answering systems

Used for question matching in Tamil QA systems

Improves QA accuracy

🚀 TamilSBERT-STS

This is a TamilSBERT model (l3cube-pune/tamil-sentence-bert-nli) fine-tuned on the STS dataset. It is released as a part of project MahaNLP: MahaNLP on GitHub. A multilingual version of this model supporting major Indic languages and cross-lingual sentence similarity is available here.

More details on the dataset, models, and baseline results can be found in our paper.

Model Information

Property	Details
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers
License	cc-by-4.0
Language	ta

Widget Examples

The widget provides several examples to demonstrate sentence similarity:

Example 1:
- Source Sentence: "மக்கள் குழு பாடுகிறது"
- Comparison Sentences:
  - "சிலர் பாடுகிறார்கள்"
  - "ஒரு இளைஞன் பியானோ பாடுகிறான்"
  - "மனிதன் ஒரு கடிதம் எழுதுகிறான்"
Example 2:
- Source Sentence: "நாய் பொம்மையை குரைக்கிறது"
- Comparison Sentences:
  - "ஒரு நாய் ஒரு பொம்மையில் குரைக்கிறது"
  - "ஒரு பூனை பால் குடிக்கிறது"
  - "ஒரு நாய் ஒரு பந்தைத் துரத்துகிறது"
Example 3:
- Source Sentence: "நான் முதல் முறையாக விமானத்தில் அமர்ந்தேன்"
- Comparison Sentences:
  - "அது எனது முதல் விமானப் பயணம் "
  - "முதல் முறையாக ரயிலில் அமர்ந்தேன்"
  - "புதிய இடங்களுக்கு பயணம் செய்வது எனக்கு மிகவும் பிடிக்கும்"

Citations

@article{deode2023l3cube,
  title={L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT},
  author={Deode, Samruddhi and Gadre, Janhavi and Kajale, Aditi and Joshi, Ananya and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2304.11434},
  year={2023}
}

@article{joshi2022l3cubemahasbert,
  title={L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi},
  author={Joshi, Ananya and Kajale, Aditi and Gadre, Janhavi and Deode, Samruddhi and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2211.11187},
  year={2022}
}

Other Similarity Models

Other Monolingual Indic Sentence BERT Models

🚀 Quick Start

Prerequisites

You need to have sentence-transformers installed to use this model easily. You can install it using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence-Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, then apply the right pooling-operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📄 License

This model is released under the cc-by-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご