🚀 Snowflake的Arctic-embed-m-v2.0
Snowflake的Arctic-embed-m-v2.0是一款专为多语言工作负载设计的嵌入模型,在检索性能和推理效率方面进行了优化。它能在不牺牲英语性能的前提下,实现高质量的多语言文本检索,适用于需要大规模、可靠的企业级多语言搜索和检索的应用场景。
🚀 快速开始
使用Sentence Transformers
from sentence_transformers import SentenceTransformer
model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
model = SentenceTransformer(model_name, trust_remote_code=True)
queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
scores = model.similarity(query_embeddings, document_embeddings)
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
print("Query:", query)
for document, score in doc_score_pairs:
print(score, document)
使用Huggingface Transformers
import torch
from transformers import AutoModel, AutoTokenizer
model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, add_pooling_layer=False, trust_remote_code=True)
model.eval()
query_prefix = 'query: '
queries = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192)
documents = ['The Data Cloud!', 'Mexico City of Course!']
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192)
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
document_embeddings = model(**document_tokens)[0][:, 0]
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
print("Query:", query)
for document, score in doc_score_pairs:
print(score, document)
使用Huggingface Transformers.js
npm i @huggingface/transformers
import { pipeline, dot } from '@huggingface/transformers';
const extractor = await pipeline('feature-extraction', 'Snowflake/snowflake-arctic-embed-m-v2.0');
const sentences = [
'query: what is snowflake?',
'The Data Cloud!',
'Mexico City of Course!',
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => dot(source_embeddings, x));
console.log(similarities);
✨ 主要特性
- 多语言无妥协:在英语和非英语检索任务中均表现出色,在MTEB Retrieval、CLEF和MIRACL等基准测试中超越了许多领先的开源和专有模型。
- 推理高效:拥有1.13亿非嵌入参数,推理速度快,适用于任何规模的应用。
- 压缩友好:通过Matryoshka Representation Learning (MRL)和量化感知嵌入训练,即使嵌入向量低至128字节/向量,仍能实现高质量的检索。请注意,与v1.5模型类似,此模型的MRL为256维,通过4位量化(例如使用
pq256x4fs
快速扫描FAISS索引或使用与1.5模型一起发布的示例代码)可实现高质量的128字节压缩。
- 长上下文支持:基于GTE-multilingual-base构建,通过RoPE可支持长达8192的上下文窗口。
📦 质量基准
与大多数其他开源模型不同,Arctic-embed-m-v2.0在英语(通过MTEB Retrieval)和多语言(通过MIRACL和CLEF)任务中均表现出色。以下是各模型在不同数据集上的平均NDCG@10数据:
模型名称 |
参数数量 |
非嵌入参数数量 |
维度 |
BEIR (15) |
MIRACL (4) |
CLEF (聚焦) |
CLEF (完整) |
snowflake-arctic-m-v2.0 |
3.05亿 |
1.13亿 |
768 |
55.4 |
55.2 |
51.7 |
53.9 |
snowflake-arctic-m |
1.09亿 |
8600万 |
768 |
54.9 |
24.9 |
34.4 |
29.1 |
me5 base |
5.6亿 |
3.03亿 |
1024 |
51.4 |
54.0 |
43.0 |
34.6 |
bge-m3 (BAAI) |
5.68亿 |
3.03亿 |
1024 |
48.8 |
56.8 |
40.8 |
41.3 |
gte (Alibaba) |
3.05亿 |
1.13亿 |
768 |
51.1 |
52.3 |
47.7 |
53.1 |
此外,Arctic-embed-m-v2.0的嵌入向量易于压缩。通过MRL进行向量截断可将向量大小缩小3倍,同时质量损失约3%。结合MRL向量和向量压缩(Int4),可实现每个文档128字节的检索。
模型 |
维度 |
BEIR (15) |
相对性能 |
MIRACL (4) |
相对性能 |
CLEF (5) |
相对性能 |
CLEF (完整) |
相对性能 |
snowflake-arctic-m-v2.0 |
768 |
55.4 |
N/A |
55.2 |
N/A |
51.7 |
N/A |
53.9 |
N/A |
snowflake-arctic-m-v2.0 |
256 |
54.4 |
-1.81% |
54.0 |
-2.17% |
50.6 |
-2.13% |
52.3 |
-3.06% |
📄 许可证
Arctic采用Apache-2许可证。发布的模型可免费用于商业用途。
📞 联系我们
如果您对本项目有任何疑问或建议,请随时打开一个issue或提交pull request。您也可以通过电子邮件联系Daniel Campos(daniel.campos@snowflake.com)。