Telugu - Sentence Similarity - SBERT Open - source Model - Free Calculation of Telugu Sentence Similarity

Home

Telugu Sentence Similarity Sbert

Developed by l3cube-pune

This is a Telugu SBERT model fine-tuned on STS dataset for calculating sentence similarity.

Text Embedding

Transformers

Other#Telugu sentence similarity #Multilingual BERT fine-tuning #Indian language processing

Downloads 100

Release Time : 2/25/2023

Model Overview

This model is a Telugu-based sentence transformer model specifically designed for feature extraction and sentence similarity calculation. It is part of the MahaNLP project, supporting Telugu text processing.

Model Features

Telugu-specific

A sentence similarity calculation model specifically optimized for Telugu language.

SBERT-based architecture

Utilizes Sentence-BERT architecture to generate high-quality sentence embeddings.

Fine-tuned on STS dataset

Fine-tuned on STS (Semantic Textual Similarity) dataset to optimize similarity calculation performance.

Model Capabilities

Sentence feature extraction

Sentence similarity calculation

Telugu text processing

Use Cases

Text similarity analysis

Semantic search

Used for building Telugu semantic search engines.

Q&A systems

Used for matching questions with similar answers.

Text clustering

Document classification

Automatic document classification based on content similarity.

🚀 TeluguSBERT-STS

This is a TeluguSBERT model (l3cube-pune/telugu-sentence-bert-nli) fine-tuned on the STS dataset. It is released as a part of project MahaNLP: https://github.com/l3cube-pune/MarathiNLP. A multilingual version of this model supporting major Indic languages and cross-lingual sentence similarity is shared here.

More details on the dataset, models, and baseline results can be found in our paper.

📄 License

The model is released under the CC BY 4.0 license.

📚 Documentation

BibTeX Citations

@article{deode2023l3cube,
  title={L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT},
  author={Deode, Samruddhi and Gadre, Janhavi and Kajale, Aditi and Joshi, Ananya and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2304.11434},
  year={2023}
}

@article{joshi2022l3cubemahasbert,
  title={L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi},
  author={Joshi, Ananya and Kajale, Aditi and Gadre, Janhavi and Deode, Samruddhi and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2211.11187},
  year={2022}
}

Other Monolingual Similarity Models

Other Monolingual Indic Sentence BERT Models

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage with Sentence-Transformers

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage with HuggingFace Transformers

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📋 Model Details

Property	Details
Model Type	Fine-tuned TeluguSBERT model on STS dataset
Training Data	STS dataset
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers

🎛️ Widget Examples

The model comes with some widget examples to demonstrate its usage:

Example 1:
- Source Sentence: "ఒక మహిళ ఉల్లిపాయను కత్తిస్తోంది"
- Comparison Sentences:
  - "ఒక స్త్రీ ఉల్లిపాయలు కోస్తోంది"
  - "ఒక స్త్రీ బంగాళాదుంపను తొక్కడం"
  - "ఒక పిల్లి ఇంటి చుట్టూ నడుస్తోంది"
Example 2:
- Source Sentence: "పిల్లల బృందం జంపింగ్ పోటీని నిర్వహిస్తోంది"
- Comparison Sentences:
  - "పిల్లల గుంపు సరదాగా గడుపుతోంది"
  - "పిల్లలు పార్కులో ఆడుకోవడానికి ఇష్టపడతారు"
  - "ముగ్గురు అబ్బాయిలు నడుస్తున్నారు"
Example 3:
- Source Sentence: "మీ రెండు ప్రశ్నలకు అవుననే సమాధానం వస్తుంది"
- Comparison Sentences:
  - "రెండు ప్రశ్నలకు అవుననే సమాధానం వస్తోంది"
  - "మేము మీ అన్ని ప్రశ్నలకు సమాధానమిచ్చాము"
  - "నేను ఈ ప్రశ్నకు సమాధానం ఇస్తాను"

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご