bhasha-embed-v0 Open-Source Multilingual Model - Supports Hindi and English Text Embedding and Cross-Lingual Alignment

Bhasha Embed V0

Developed by AkshitaS

This is a multilingual model supporting embeddings for Hindi (Devanagari), English, and Romanized Hindi text, with special support for Romanized Hindi and cross-lingual alignment.

Text Embedding

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Romanized Hindi Embedding #Cross-lingual Alignment #Multilingual Sentence Similarity

Downloads 203

Release Time : 6/24/2024

Model Overview

The model can embed Hindi (Devanagari), English, and Romanized Hindi text, supporting cross-lingual alignment and Romanized Hindi, suitable for multilingual text similarity computation and feature extraction.

Model Features

Romanized Hindi Support

The first embedding model supporting Romanized Hindi (transliterated Hindi/hin_Latn).

Cross-lingual Alignment

Outputs language-agnostic embeddings, supporting queries across multilingual candidate pools containing Hindi, English, and Romanized Hindi text.

Model Capabilities

Text Embedding

Sentence Similarity Computation

Multilingual Feature Extraction

Use Cases

Text Similarity

Multilingual Sentence Similarity

Compute similarity between sentences in different languages (Hindi, English, Romanized Hindi).

Supports cross-lingual sentence alignment and similarity scoring.

Information Retrieval

Multilingual Search

Retrieve texts similar to the query sentence from a multilingual candidate pool.

Supports mixed retrieval in Hindi, English, and Romanized Hindi.

🚀 Bhasha embed v0 model

This is an embedding model designed to embed texts in Hindi (Devanagari script), English, and Romanized Hindi. Many existing multilingual embedding models perform well on Hindi and English texts separately, but they lack the following capabilities:

Romanized Hindi support: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn).
Cross - lingual alignment: The model outputs language - agnostic embeddings, enabling queries in a multilingual candidate pool that includes a mix of Hindi, English, and Romanized Hindi texts.

✨ Features

Supported Languages: Hindi, English, Romanized Hindi
Base model: [google/muril - base - cased](https://huggingface.co/google/muril - base - cased)
Training GPUs: 1xRTX4090
Training methodology: Distillation from an English embedding model and fine - tuning on triplet data.
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Repository: [github_link](https://github.com/akshita - sukhlecha/bhasha - embed)
Developer: [Akshita Sukhlecha](https://www.linkedin.com/in/akshita - sukhlecha/)

📚 Documentation

Results

Results Model Legend

Results for English - Hindi cross - lingual alignment: Tasks with a corpus containing both Hindi and English texts Results Cross - lingual Alignment

Results for Romanized Hindi tasks: Tasks with Romanized Hindi texts

Results for retrieval tasks with a multilingual corpus: Retrieval tasks with a corpus containing Hindi, English, and Romanized Hindi texts

Results for Hindi tasks: Tasks with Hindi (Devanagari script) texts

Additional information

Some task dataset links: Belebele, MLQA, XQuAD, SemRel24
hin_Latn tasks: Most hin_Latn tasks have been created by transliterating Hindi texts using the [indic - trans library](https://github.com/libindic/indic - trans)
Detailed results: [github_link](https://github.com/akshita - sukhlecha/bhasha - embed/blob/main/eval/results/all_results.csv)
Script to reproduce the results: [github_link](https://github.com/akshita - sukhlecha/bhasha - embed/blob/main/eval/evaluator.py)

Sample outputs

Example 1

Example 2

Example 3

Example 4

💻 Usage Examples

Basic Usage

Below are examples to encode queries and passages and compute similarity scores using Sentence Transformers and 🤗 Transformers.

Using Sentence Transformers

First, install the Sentence Transformers library (pip install -U sentence-transformers), and then run the following code:

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("AkshitaS/bhasha-embed-v0")

queries = [
    "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
    "Pranav studied law and became a politician at the age of 30.",
    "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye"
]
documents = [
    "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
    "Pranav studied law and became a politician at the age of 30.",
    "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye",
    "प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था",
    "Pranav was born in a family of politicians",
    "Pranav ka janm rajneetigyon ke parivar mein hua tha"
]

query_embeddings = model.encode(queries, normalize_embeddings=True)
document_embeddings = model.encode(documents, normalize_embeddings=True)

similarity_matrix = (query_embeddings @ document_embeddings.T)
print(similarity_matrix.shape)
# (3, 6)
print(np.round(similarity_matrix, 2))
#[[1.00  0.97  0.97  0.92  0.90  0.91]
# [0.97  1.00  0.96  0.90  0.91  0.91]
# [0.97  0.96  1.00  0.89  0.90  0.92]]

Using 🤗 Transformers

import numpy as np
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


model_id = "AkshitaS/bhasha-embed-v0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

queries = [
    "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
    "Pranav studied law and became a politician at the age of 30.",
    "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye"
]
documents = [
    "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
    "Pranav studied law and became a politician at the age of 30.",
    "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye",
    "प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था",
    "Pranav was born in a family of politicians",
    "Pranav ka janm rajneetigyon ke parivar mein hua tha"
]

input_texts = queries + documents
batch_dict = tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

embeddings = F.normalize(embeddings, p=2, dim=1)
similarity_matrix = (embeddings[:len(queries)] @ embeddings[len(queries):].T).detach().numpy()
print(similarity_matrix.shape)
# (3, 6)
print(np.round(similarity_matrix, 2))
#[[1.00  0.97  0.97  0.92  0.90  0.91]
# [0.97  1.00  0.96  0.90  0.91  0.91]
# [0.97  0.96  1.00  0.89  0.90  0.92]]

Citation

To cite this model:

@misc{sukhlecha_2024_bhasha_embed_v0,
  author = {Sukhlecha, Akshita},
  title = {Bhasha-embed-v0},
  howpublished = {Hugging Face},
  month = {June},
  year = {2024},
  url = {https://huggingface.co/AkshitaS/bhasha-embed-v0}
}

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご