Open-source Tamil model tamil-sentence-bert-nli - Free to calculate sentence similarity and extract features

Home

Tamil Sentence Bert Nli

Developed by l3cube-pune

This is a Tamil BERT model trained on NLI datasets for sentence similarity computation and feature extraction.

Text Embedding

Transformers

Other#Tamil sentence embeddings #Multilingual NLI training #Sentence similarity computation

Downloads 214

Release Time : 3/4/2023

Model Overview

This model is a sentence BERT model trained based on Tamil BERT (l3cube-pune/tamil-bert), specifically designed for computing sentence similarity and extracting sentence features. Released as part of the MahaNLP project.

Model Features

Tamil-specific

Sentence embedding model specifically optimized for Tamil

NLI-trained

Trained using Natural Language Inference (NLI) datasets to improve sentence representation quality

Multilingual support

Corresponding multilingual versions support major Indian languages and cross-lingual capabilities

Model Capabilities

Sentence similarity computation

Sentence feature extraction

Semantic search

Use Cases

Information retrieval

Semantic search

Using sentence embeddings for more accurate semantic search

Text analysis

Document clustering

Document clustering based on sentence similarity

🚀 TamilSBERT

TamilSBERT is a sentence similarity model. It is a TamilBERT model (l3cube - pune/tamil - bert) trained on the NLI dataset. This model is released as a part of project MahaNLP. A multilingual version supporting major Indic languages and cross - lingual capabilities is also available.

📋 Metadata

Property	Details
Pipeline Tag	sentence - similarity
Tags	sentence - transformers, feature - extraction, sentence - similarity, transformers
License	cc - by - 4.0
Language	ta

🎛️ Widget Examples

Example 1

Source Sentence: "மக்கள் குழு பாடுகிறது"
Comparison Sentences:
- "சிலர் பாடுகிறார்கள்"
- "ஒரு இளைஞன் பியானோ பாடுகிறான்"
- "மனிதன் ஒரு கடிதம் எழுதுகிறான்"

Example 2

Source Sentence: "நாய் பொம்மையை குரைக்கிறது"
Comparison Sentences:
- "ஒரு நாய் ஒரு பொம்மையில் குரைக்கிறது"
- "ஒரு பூனை பால் குடிக்கிறது"
- "ஒரு நாய் ஒரு பந்தைத் துரத்துகிறது"

Example 3

Source Sentence: "நான் முதல் முறையாக விமானத்தில் அமர்ந்தேன்"
Comparison Sentences:
- "அது எனது முதல் விமானப் பயணம் "
- "முதல் முறையாக ரயிலில் அமர்ந்தேன்"
- "புதிய இடங்களுக்கு பயணம் செய்வது எனக்கு மிகவும் பிடிக்கும்"

🚀 Quick Start

This model can be used in two ways, with or without the sentence - transformers library.

📦 Installation

If you want to use the sentence - transformers library, you need to install it first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage with Sentence - Transformers

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage without Sentence - Transformers

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Additional Information

This model is part of the project MahaNLP: https://github.com/l3cube - pune/MarathiNLP.

A multilingual version of this model supporting major Indic languages and cross - lingual capabilities is available at indic - sentence - bert - nli .
A better sentence similarity model (fine - tuned version of this model) is available at: https://huggingface.co/l3cube - pune/tamil - sentence - similarity - sbert.

📄 References

More details on the dataset, models, and baseline results can be found in our [paper] (https://arxiv.org/abs/2304.11434).

@article{deode2023l3cube,
  title={L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT},
  author={Deode, Samruddhi and Gadre, Janhavi and Kajale, Aditi and Joshi, Ananya and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2304.11434},
  year={2023}
}

@article{joshi2022l3cubemahasbert,
  title={L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi},
  author={Joshi, Ananya and Kajale, Aditi and Gadre, Janhavi and Deode, Samruddhi and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2211.11187},
  year={2022}
}

🔗 Related Models

Monolingual Indic sentence BERT models:
Monolingual similarity models:

📄 License

This model is released under the cc - by - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご