🚀 多語言E5基礎模型 (句子轉換器)
這是 intfloat/multilingual-e5-base 模型的句子轉換器版本。它可以將句子和段落映射到一個768維的密集向量空間,可用於聚類或語義搜索等任務。
🚀 快速開始
安裝依賴
使用此模型前,你需要安裝 sentence-transformers:
pip install -U sentence-transformers
使用示例
基礎用法
from sentence_transformers import SentenceTransformer
sentences = ['query: how much protein should a female eat',
'query: 南瓜的家常做法',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: 1.清炒南瓜絲 原料:嫩南瓜半個 調料:蔥、鹽、白糖、雞精 做法: 1、南瓜用刀薄薄的削去表面一層皮,用勺子颳去瓤 2、擦成細絲(沒有擦菜板就用刀慢慢切成細絲) 3、鍋燒熱放油,入蔥花煸出香味 4、入南瓜絲快速翻炒一分鐘左右,放鹽、一點白糖和雞精調味出鍋 2.香蔥炒南瓜 原料:南瓜1只 調料:香蔥、蒜末、橄欖油、鹽 做法: 1、將南瓜去皮,切成片 2、油鍋8成熱後,將蒜末放入爆香 3、爆香後,將南瓜片放入,翻炒 4、在翻炒的同時,可以不時地往鍋里加水,但不要太多 5、放入鹽,炒勻 6、南瓜差不多軟和綿了之後,就可以關火 7、撒入香蔥,即可出鍋"]
model = SentenceTransformer('embaas/sentence-transformers-multilingual-e5-base')
embeddings = model.encode(sentences)
print(embeddings)
高級用法
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = ['query: how much protein should a female eat',
'query: 南瓜的家常做法',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: 1.清炒南瓜絲 原料:嫩南瓜半個 調料:蔥、鹽、白糖、雞精 做法: 1、南瓜用刀薄薄的削去表面一層皮,用勺子颳去瓤 2、擦成細絲(沒有擦菜板就用刀慢慢切成細絲) 3、鍋燒熱放油,入蔥花煸出香味 4、入南瓜絲快速翻炒一分鐘左右,放鹽、一點白糖和雞精調味出鍋 2.香蔥炒南瓜 原料:南瓜1只 調料:香蔥、蒜末、橄欖油、鹽 做法: 1、將南瓜去皮,切成片 2、油鍋8成熱後,將蒜末放入爆香 3、爆香後,將南瓜片放入,翻炒 4、在翻炒的同時,可以不時地往鍋里加水,但不要太多 5、放入鹽,炒勻 6、南瓜差不多軟和綿了之後,就可以關火 7、撒入香蔥,即可出鍋"]
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-base')
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
使用API進行編碼
你可以使用 embaas API 對輸入進行編碼。從 embaas.io 獲取你的免費API密鑰。
import requests
url = "https://api.embaas.io/v1/embeddings/"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer ${YOUR_API_KEY}"
}
data = {
"texts": ["This is an example sentence.", "Here is another sentence."],
"instruction": "query",
"model": "multilingual-e5-base"
}
response = requests.post(url, json=data, headers=headers)
📚 詳細文檔
評估結果
你可以在 這裡 找到MTEB評估結果。
完整模型架構
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
(2): Normalize()
)
引用與作者
文檔中未詳細描述相關信息。