ЁЯЪА HindSBERT-STS
This is a model designed for sentence similarity tasks. It is a fine - tuned version of the HindSBERT model ( l3cube - pune/hindi - sentence - bert - nli ) on the STS dataset. It was released as part of the project MahaNLP: https://github.com/l3cube - pune/MarathiNLP. A multilingual version of this model, which supports major Indic languages and cross - lingual sentence similarity, is available at indic - sentence - similarity - sbert .
ЁЯЪА Quick Start
Prerequisites
This is a sentence - transformers model. It maps sentences and paragraphs to a 768 - dimensional dense vector space and can be used for tasks such as clustering or semantic search.
Installation
Using this model becomes easy when you have sentence - transformers installed:
pip install -U sentence-transformers
Basic Usage
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Without sentence - transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
ЁЯУЪ Documentation
Model Details
More details on the dataset, models, and baseline results can be found in our paper
@article{joshi2022l3cubemahasbert,
title={L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi},
author={Joshi, Ananya and Kajale, Aditi and Gadre, Janhavi and Deode, Samruddhi and Joshi, Raviraj},
journal={arXiv preprint arXiv:2211.11187},
year={2022}
}
Related Papers
Related Models
Monolingual Similarity Models
Monolingual Indic Sentence BERT Models
ЁЯУД License
This model is released under the cc - by - 4.0 license.
ЁЯФН Widget Examples
Example 1
- Source Sentence: "рдПрдХ рдЖрджрдореА рдПрдХ рд░рд╕реНрд╕реА рдкрд░ рдЪрдврд╝ рд░рд╣рд╛ рд╣реИ"
- Comparison Sentences:
- "рдПрдХ рдЖрджрдореА рдПрдХ рд░рд╕реНрд╕реА рдкрд░ рдЪрдврд╝рддрд╛ рд╣реИ"
- "рдПрдХ рдЖрджрдореА рдПрдХ рджреАрд╡рд╛рд░ рдкрд░ рдЪрдврд╝ рд░рд╣рд╛ рд╣реИ"
- "рдПрдХ рдЖрджрдореА рдмрд╛рдВрд╕реБрд░реА рдмрдЬрд╛рддрд╛ рд╣реИ"
Example 2
- Source Sentence: "рдХреБрдЫ рд▓реЛрдЧ рдЧрд╛ рд░рд╣реЗ рд╣реИрдВ"
- Comparison Sentences:
- "рд▓реЛрдЧреЛрдВ рдХрд╛ рдПрдХ рд╕рдореВрд╣ рдЧрд╛рддрд╛ рд╣реИ"
- "рдмрд┐рд▓реНрд▓реА рджреВрдз рдкреА рд░рд╣реА рд╣реИ"
- "рджреЛ рдЖрджрдореА рд▓рдбрд╝ рд░рд╣реЗ рд╣реИрдВ"
Example 3
- Source Sentence: "рдлреЗрдбрд░рд░ рдиреЗ 7рд╡рд╛рдВ рд╡рд┐рдВрдмрд▓рдбрди рдЦрд┐рддрд╛рдм рдЬреАрдд рд▓рд┐рдпрд╛ рд╣реИ"
- Comparison Sentences:
- "рдлреЗрдбрд░рд░ рдЕрдкрдиреЗ рдХрд░рд┐рдпрд░ рдореЗрдВ рдХреБрд▓ 20 рдЧреНрд░реИрдВрдбрд╕реНрд▓реИрдо рдЦрд┐рддрд╛рдм рдЬреАрдд рдЪреБрдХреЗ рд╣реИ "
- "рдлреЗрдбрд░рд░ рдиреЗ рд╕рд┐рддрдВрдмрд░ рдореЗрдВ рдЕрдкрдиреЗ рдирд┐рд╡реГрддреНрддрд┐ рдХреА рдШреЛрд╖рдгрд╛ рдХреА"
- "рдПрдХ рдЖрджрдореА рдХреБрдЫ рдЦрд╛рдирд╛ рдкрдХрд╛рдиреЗ рдХрд╛ рддреЗрд▓ рдПрдХ рдмрд░реНрддрди рдореЗрдВ рдбрд╛рд▓рддрд╛ рд╣реИ"