ЁЯЪА Bhasha embed v0 model
This is an embedding model designed to embed texts in Hindi (Devanagari script), English, and Romanized Hindi. Many existing multilingual embedding models perform well on Hindi and English texts separately, but they lack the following capabilities:
- Romanized Hindi support: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn).
- Cross - lingual alignment: The model outputs language - agnostic embeddings, enabling queries in a multilingual candidate pool that includes a mix of Hindi, English, and Romanized Hindi texts.
тЬи Features
- Supported Languages: Hindi, English, Romanized Hindi
- Base model: [google/muril - base - cased](https://huggingface.co/google/muril - base - cased)
- Training GPUs: 1xRTX4090
- Training methodology: Distillation from an English embedding model and fine - tuning on triplet data.
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Repository: [github_link](https://github.com/akshita - sukhlecha/bhasha - embed)
- Developer: [Akshita Sukhlecha](https://www.linkedin.com/in/akshita - sukhlecha/)
ЁЯУЪ Documentation
Results

Results for English - Hindi cross - lingual alignment: Tasks with a corpus containing both Hindi and English texts

Results for Romanized Hindi tasks: Tasks with Romanized Hindi texts

Results for retrieval tasks with a multilingual corpus: Retrieval tasks with a corpus containing Hindi, English, and Romanized Hindi texts

Results for Hindi tasks: Tasks with Hindi (Devanagari script) texts

Additional information
- Some task dataset links: Belebele, MLQA, XQuAD, SemRel24
- hin_Latn tasks: Most hin_Latn tasks have been created by transliterating Hindi texts using the [indic - trans library](https://github.com/libindic/indic - trans)
- Detailed results: [github_link](https://github.com/akshita - sukhlecha/bhasha - embed/blob/main/eval/results/all_results.csv)
- Script to reproduce the results: [github_link](https://github.com/akshita - sukhlecha/bhasha - embed/blob/main/eval/evaluator.py)
Sample outputs
Example 1

Example 2

Example 3

Example 4

ЁЯТ╗ Usage Examples
Basic Usage
Below are examples to encode queries and passages and compute similarity scores using Sentence Transformers and ЁЯдЧ Transformers.
Using Sentence Transformers
First, install the Sentence Transformers library (pip install -U sentence-transformers
), and then run the following code:
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("AkshitaS/bhasha-embed-v0")
queries = [
"рдкреНрд░рдгрд╡ рдиреЗ рдХрд╛рдиреВрди рдХреА рдкрдврд╝рд╛рдИ рдХреА рдФрд░ рейреж рдХреА рдЙрдореНрд░ рдореЗрдВ рд░рд╛рдЬрдиреАрддрд┐ рд╕реЗ рдЬреБрдбрд╝ рдЧрдП",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye"
]
documents = [
"рдкреНрд░рдгрд╡ рдиреЗ рдХрд╛рдиреВрди рдХреА рдкрдврд╝рд╛рдИ рдХреА рдФрд░ рейреж рдХреА рдЙрдореНрд░ рдореЗрдВ рд░рд╛рдЬрдиреАрддрд┐ рд╕реЗ рдЬреБрдбрд╝ рдЧрдП",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye",
"рдкреНрд░рдгрд╡ рдХрд╛ рдЬрдиреНрдо рд░рд╛рдЬрдиреАрддрд┐рдЬреНрдЮреЛрдВ рдХреЗ рдкрд░рд┐рд╡рд╛рд░ рдореЗрдВ рд╣реБрдЖ рдерд╛",
"Pranav was born in a family of politicians",
"Pranav ka janm rajneetigyon ke parivar mein hua tha"
]
query_embeddings = model.encode(queries, normalize_embeddings=True)
document_embeddings = model.encode(documents, normalize_embeddings=True)
similarity_matrix = (query_embeddings @ document_embeddings.T)
print(similarity_matrix.shape)
print(np.round(similarity_matrix, 2))
Using ЁЯдЧ Transformers
import numpy as np
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
model_id = "AkshitaS/bhasha-embed-v0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
queries = [
"рдкреНрд░рдгрд╡ рдиреЗ рдХрд╛рдиреВрди рдХреА рдкрдврд╝рд╛рдИ рдХреА рдФрд░ рейреж рдХреА рдЙрдореНрд░ рдореЗрдВ рд░рд╛рдЬрдиреАрддрд┐ рд╕реЗ рдЬреБрдбрд╝ рдЧрдП",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye"
]
documents = [
"рдкреНрд░рдгрд╡ рдиреЗ рдХрд╛рдиреВрди рдХреА рдкрдврд╝рд╛рдИ рдХреА рдФрд░ рейреж рдХреА рдЙрдореНрд░ рдореЗрдВ рд░рд╛рдЬрдиреАрддрд┐ рд╕реЗ рдЬреБрдбрд╝ рдЧрдП",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye",
"рдкреНрд░рдгрд╡ рдХрд╛ рдЬрдиреНрдо рд░рд╛рдЬрдиреАрддрд┐рдЬреНрдЮреЛрдВ рдХреЗ рдкрд░рд┐рд╡рд╛рд░ рдореЗрдВ рд╣реБрдЖ рдерд╛",
"Pranav was born in a family of politicians",
"Pranav ka janm rajneetigyon ke parivar mein hua tha"
]
input_texts = queries + documents
batch_dict = tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
similarity_matrix = (embeddings[:len(queries)] @ embeddings[len(queries):].T).detach().numpy()
print(similarity_matrix.shape)
print(np.round(similarity_matrix, 2))
Citation
To cite this model:
@misc{sukhlecha_2024_bhasha_embed_v0,
author = {Sukhlecha, Akshita},
title = {Bhasha-embed-v0},
howpublished = {Hugging Face},
month = {June},
year = {2024},
url = {https://huggingface.co/AkshitaS/bhasha-embed-v0}
}
ЁЯУД License
This project is licensed under the apache - 2.0 license.