Snowflake Arctic Embed L开源模型 - 免费用于自然语言处理的句子相似度及特征提取

首页

Snowflake Arctic Embed L

由 Snowflake 开发

Snowflake Arctic Embed L 是一个专注于句子相似度和特征提取的模型，适用于多种自然语言处理任务。

文本嵌入

Transformers

开源协议:Apache-2.0 #句子嵌入 #多任务评估 #高维特征提取

下载量 50.58k

发布时间 : 4/12/2024

模型简介

该模型主要用于句子转换、特征提取和句子相似度计算，支持多种评估任务，如分类、聚类、检索和语义文本相似度。

模型特点

多任务支持

支持多种自然语言处理任务，包括分类、聚类、检索和语义文本相似度。

高性能

在多个评估数据集上表现优异，如 AmazonCounterfactualClassification 和 BIOSSES。

易于集成

支持 Transformers.js，便于在前端和服务器端集成使用。

模型能力

句子相似度计算

特征提取

文本分类

文本聚类

信息检索

语义文本相似度分析

使用案例

电子商务

产品评论分类

用于对亚马逊产品评论进行情感分析和分类。

在 AmazonPolarityClassification 任务中准确率达到 78.40%。

学术研究

论文聚类

用于对 arXiv 和 bioRxiv 论文进行主题聚类。

在 ArxivClusteringP2P 任务中 V-measure 达到 47.46%。

问答系统

问答检索

用于在 CQADupstack 等问答平台上检索相关问题。

在 CQADupstackAndroidRetrieval 任务中 MAP@10 达到 49.43。

🚀 Snowflake的Arctic-embed-l

Snowflake的Arctic-embed-l是一套文本嵌入模型，专注于创建针对性能优化的高质量检索模型。该模型旨在解决文本检索中的准确性和效率问题，为用户提供更精准、高效的文本检索体验。

🚀 快速开始

环境准备

确保你已经安装了所需的Python库，如sentence-transformers、transformers等。

代码示例

以下是使用sentence-transformers库调用snowflake-arctic-embed-l模型的示例代码：

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l")

queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # 输出段落和分数
    print("查询:", query)
    for document, score in doc_score_pairs:
        print(score, document)

✨ 主要特性

高性能检索：在MTEB/BEIR排行榜上，各尺寸变体的模型均达到了最先进的性能。
多模型选择：提供snowflake-arctic-embed-xs、snowflake-arctic-embed-s、snowflake-arctic-embed-m、snowflake-arctic-embed-m-long和snowflake-arctic-embed-l等多种模型，满足不同场景需求。
可替代闭源模型：最大的模型snowflake-arctic-embed-l可作为闭源嵌入的自然替代品。

📦 安装指南

使用Sentence Transformers

pip install sentence-transformers

使用Huggingface transformers

pip install transformers

使用Transformers.js

npm i @xenova/transformers

💻 使用示例

基础用法

使用Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l")

queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # 输出段落和分数
    print("查询:", query)
    for document, score in doc_score_pairs:
        print(score, document)

使用Huggingface transformers

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-l')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-l', add_pooling_layer=False)
model.eval()

query_prefix = 'Represent this sentence for searching relevant passages: '
queries  = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

documents = ['The Data Cloud!', 'Mexico City of Course!']
document_tokens =  tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)

# 计算令牌嵌入
with torch.no_grad():
    query_embeddings = model(**query_tokens)[0][:, 0]
    document_embeddings = model(**document_tokens)[0][:, 0]

# 归一化嵌入
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)

scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # 输出段落和分数
    print("查询:", query)
    for document, score in doc_score_pairs:
        print(score, document)

使用Transformers.js

import { pipeline, dot } from '@xenova/transformers';

// 创建特征提取管道
const extractor = await pipeline('feature-extraction', 'Snowflake/snowflake-arctic-embed-l', {
    quantized: false, // 注释掉此行以使用量化版本
});

// 生成句子嵌入
const sentences = [
    'Represent this sentence for searching relevant passages: Where can I get the best tacos?',
    'The Data Cloud!',
    'Mexico City of Course!',
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });

// 计算相似度分数
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => dot(source_embeddings, x));
console.log(similarities); // [0.25145517380846977, 0.3865060421197194]

高级用法

使用Infinity进行OpenAI兼容API部署

docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
michaelf34/infinity:0.0.70 \
v2 --model-id Snowflake/snowflake-arctic-embed-l --dtype float16 --batch-size 32 --engine torch --port 7997

📚 详细文档

模型介绍

snowflake-arctic-embed是一套文本嵌入模型，通过利用现有的开源文本表示模型（如bert-base-uncased），并在多阶段管道中进行训练，以优化其检索性能。

模型对比

名称	MTEB检索分数 (NDCG @ 10)	参数数量 (百万)	嵌入维度
snowflake-arctic-embed-xs	50.15	22	384
snowflake-arctic-embed-s	51.98	33	384
snowflake-arctic-embed-m	54.90	110	768
snowflake-arctic-embed-m-long	54.83	137	768
snowflake-arctic-embed-l	55.98	335	1024