bge-m3_en_ru Open Source Model - Streamlined Vocabulary for Efficient English and Russian Embedding Tasks

Bge M3 En Ru

Developed by TatonkaHF

A bge-m3 model optimized for English and Russian, featuring a streamlined vocabulary that retains only English and Russian words, reducing the vocabulary size to 21% of the original while maintaining the quality of embeddings for both languages. The total model parameters are 63.3% of the original version.

Text Embedding

Transformers

Supports Multiple Languages#English-Russian Bilingual Embedding #Vocabulary Optimization #Multi-granularity Semantic Encoding

Downloads 1,174

Release Time : 6/14/2024

Model Overview

This is a streamlined vocabulary version of the bge-m3 model, specifically optimized for English and Russian, suitable for sentence similarity computation and feature extraction tasks.

Model Features

Streamlined Vocabulary

Only English and Russian words are retained, reducing the vocabulary size to 21% of the original, with model parameters at 63.3% of the original.

Multilingual Support

Specifically optimized for English and Russian without compromising embedding quality for these languages.

Efficient Embedding

Excellent performance for sentence similarity computation and feature extraction tasks.

Model Capabilities

Sentence Embedding

Feature Extraction

Sentence Similarity Computation

Use Cases

Natural Language Processing

Sentence Similarity Computation

Compute the similarity between two sentences, suitable for search, recommendation systems, and other scenarios.

Feature Extraction

Convert sentences into high-dimensional vectors for subsequent machine learning tasks.

🚀 BGE-M3 Model for English and Russian

This model is designed for sentence similarity tasks, offering feature extraction capabilities for both English and Russian languages. It's a tokenizer - shrinked version of [BAAI/bge - m3](https://huggingface.co/BAAI/bge - m3), which retains only English and Russian tokens in the vocabulary. As a result, the vocabulary size is 21% of the original, and the number of parameters in the whole model is 63.3% of the original, without any loss in the quality of English and Russian embeddings.

🚀 Quick Start

✨ Features

Tokenizer - shrinked version for English and Russian.
Maintains embedding quality with reduced vocabulary and parameters.

📦 Installation

If you want to use this model, you need to install sentence - transformers first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Using the sentence - transformers library:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('TatonkaHF/bge-m3_en_ru')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without the sentence - transformers library, using the transformers library directly:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('TatonkaHF/bge-m3_en_ru')
model = AutoModel.from_pretrained('TatonkaHF/bge-m3_en_ru')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Specs

Other bge - m3 models are also shrinked.

Property	Details
Model Type	Shrunken tokenizer version for English and Russian
Related Models	[bge - m3 - retromae_en_ru](https://huggingface.co/TatonkaHF/bge - m3 - retromae_en_ru), [bge - m3 - unsupervised_en_ru](https://huggingface.co/TatonkaHF/bge - m3 - unsupervised_en_ru), [bge - m3_en_ru](https://huggingface.co/TatonkaHF/bge - m3_en_ru)

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

🔧 Technical Details

The model is based on the concept of self - knowledge distillation. The tokenizer shrinking process is inspired by [LaBSE - en - ru](https://huggingface.co/cointegrated/LaBSE - en - ru) and [https://discuss.huggingface.co/t/tokenizer - shrinking - recipes/8564/1](https://discuss.huggingface.co/t/tokenizer - shrinking - recipes/8564/1).

📄 License

This model is licensed under the MIT license.

Reference

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu. BGE M3 - Embedding: Multi - Lingual, Multi - Functionality, Multi - Granularity Text Embeddings Through Self - Knowledge Distillation.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご