๐ Multilingual-E5-base (sentence-transformers)
This model is the sentence-transformers version of intfloat/multilingual-e5-base. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.
๐ Quick Start
โจ Features
- Maps sentences & paragraphs to a 768-dimensional dense vector space.
- Suitable for tasks like clustering or semantic search.
๐ฆ Installation
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
๐ป Usage Examples
Basic Usage (Sentence-Transformers)
from sentence_transformers import SentenceTransformer
sentences = ['query: how much protein should a female eat',
'query: ๅ็็ๅฎถๅธธๅๆณ',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: 1.ๆธ
็ๅ็ไธ ๅๆ:ๅซฉๅ็ๅไธช ่ฐๆ:่ฑใ็ใ็ฝ็ณใ้ธก็ฒพ ๅๆณ: 1ใๅ็็จๅ่่็ๅๅป่กจ้ขไธๅฑ็ฎ,็จๅบๅญๅฎๅป็ค 2ใๆฆๆ็ปไธ(ๆฒกๆๆฆ่ๆฟๅฐฑ็จๅๆ
ขๆ
ขๅๆ็ปไธ) 3ใ้
็ง็ญๆพๆฒน,ๅ
ฅ่ฑ่ฑ็
ธๅบ้ฆๅณ 4ใๅ
ฅๅ็ไธๅฟซ้็ฟป็ไธๅ้ๅทฆๅณ,ๆพ็ใไธ็น็ฝ็ณๅ้ธก็ฒพ่ฐๅณๅบ้
2.้ฆ่ฑ็ๅ็ ๅๆ:ๅ็1ๅช ่ฐๆ:้ฆ่ฑใ่ๆซใๆฉๆฆๆฒนใ็ ๅๆณ: 1ใๅฐๅ็ๅป็ฎ,ๅๆ็ 2ใๆฒน้
8ๆ็ญๅ,ๅฐ่ๆซๆพๅ
ฅ็้ฆ 3ใ็้ฆๅ,ๅฐๅ็็ๆพๅ
ฅ,็ฟป็ 4ใๅจ็ฟป็็ๅๆถ,ๅฏไปฅไธๆถๅฐๅพ้
้ๅ ๆฐด,ไฝไธ่ฆๅคชๅค 5ใๆพๅ
ฅ็,็ๅ 6ใๅ็ๅทฎไธๅค่ฝฏๅ็ปตไบไนๅ,ๅฐฑๅฏไปฅๅ
ณ็ซ 7ใๆๅ
ฅ้ฆ่ฑ,ๅณๅฏๅบ้
"]
model = SentenceTransformer('embaas/sentence-transformers-multilingual-e5-base')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage (Huggingface)
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = ['query: how much protein should a female eat',
'query: ๅ็็ๅฎถๅธธๅๆณ',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: 1.ๆธ
็ๅ็ไธ ๅๆ:ๅซฉๅ็ๅไธช ่ฐๆ:่ฑใ็ใ็ฝ็ณใ้ธก็ฒพ ๅๆณ: 1ใๅ็็จๅ่่็ๅๅป่กจ้ขไธๅฑ็ฎ,็จๅบๅญๅฎๅป็ค 2ใๆฆๆ็ปไธ(ๆฒกๆๆฆ่ๆฟๅฐฑ็จๅๆ
ขๆ
ขๅๆ็ปไธ) 3ใ้
็ง็ญๆพๆฒน,ๅ
ฅ่ฑ่ฑ็
ธๅบ้ฆๅณ 4ใๅ
ฅๅ็ไธๅฟซ้็ฟป็ไธๅ้ๅทฆๅณ,ๆพ็ใไธ็น็ฝ็ณๅ้ธก็ฒพ่ฐๅณๅบ้
2.้ฆ่ฑ็ๅ็ ๅๆ:ๅ็1ๅช ่ฐๆ:้ฆ่ฑใ่ๆซใๆฉๆฆๆฒนใ็ ๅๆณ: 1ใๅฐๅ็ๅป็ฎ,ๅๆ็ 2ใๆฒน้
8ๆ็ญๅ,ๅฐ่ๆซๆพๅ
ฅ็้ฆ 3ใ็้ฆๅ,ๅฐๅ็็ๆพๅ
ฅ,็ฟป็ 4ใๅจ็ฟป็็ๅๆถ,ๅฏไปฅไธๆถๅฐๅพ้
้ๅ ๆฐด,ไฝไธ่ฆๅคชๅค 5ใๆพๅ
ฅ็,็ๅ 6ใๅ็ๅทฎไธๅค่ฝฏๅ็ปตไบไนๅ,ๅฐฑๅฏไปฅๅ
ณ็ซ 7ใๆๅ
ฅ้ฆ่ฑ,ๅณๅฏๅบ้
"]
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-base')
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
Using with API
You can use the embaas API to encode your input. Get your free API key from embaas.io
import requests
url = "https://api.embaas.io/v1/embeddings/"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer ${YOUR_API_KEY}"
}
data = {
"texts": ["This is an example sentence.", "Here is another sentence."],
"instruction": "query",
"model": "multilingual-e5-base"
}
response = requests.post(url, json=data, headers=headers)
๐ Documentation
Evaluation Results
You can find the MTEB results here.
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
(2): Normalize()
)
๐ง Technical Details
The model maps sentences and paragraphs to a 768-dimensional dense vector space. It uses a specific architecture with a Transformer model for encoding and a pooling layer for aggregating the embeddings. The pooling mode is set to calculate the mean of tokens.
๐ License
No license information provided in the original document.