SFR-Embedding-Code-2B_R Universal Embedding Model - Open Source Empowers Multilingual and Multi-task Code-Text Retrieval

SFR Embedding Code 2B R

Developed by Salesforce

A general-purpose embedding model developed by Salesforce Research, suitable for multilingual and multi-task code and text retrieval, excelling in various code retrieval tasks.

Text Embedding

Transformers

Other#Code Retrieval #Multilingual Support #2 Billion Parameters

Downloads 6,977

Release Time : 1/17/2025

Model Overview

SFR-Embedding-Code is a series of general-purpose embedding models suitable for multilingual and multi-task code and text retrieval. In multiple code retrieval tasks, this model demonstrates superior performance compared to various open-source code embedding models.

Model Features

Multilingual Support

Suitable for multilingual and multi-task code and text retrieval.

High-Performance Retrieval

Outperforms other open-source code embedding models in multiple code retrieval tasks.

Large Parameter Scale

2 billion parameters, providing stronger representation capabilities.

Model Capabilities

Code Retrieval

Text Retrieval

Multilingual Processing

Use Cases

Code Retrieval

Code Snippet Retrieval

Retrieve relevant code snippets based on natural language queries.

Achieved NDCG@10 of 67.4 on the CoIR benchmark.

Text Retrieval

Technical Document Retrieval

Retrieve relevant documents or solutions based on technical questions.

🚀 Salesforce/SFR-Embedding-Code-2B_R

SFR-Embedding by Salesforce Research.

Salesforce/SFR-Embedding-Code is a generalist embedding model family designed for multilingual and multi-task code and text retrieval. It outperforms various open - source code embedding models in multiple code retrieval tasks.

Check out our paper for more details!

We also offer a 400M - size model Salesforce/SFR-Embedding-Code-400_R

🚀 Quick Start

Ethical Considerations

This release is solely for research purposes to support an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream applications. We strongly advise users to assess and address potential issues related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and follow best practices when choosing use - cases, especially in high - risk scenarios where errors or misuse could significantly affect people's lives, rights, or safety. For more guidance on use - cases, refer to our AUP and AI AUP.

License Statement

Users are required to assess their own obligations or responsibilities under the corresponding licenses or terms and conditions of the original datasets and data. This release is for research purposes only to support an academic paper.

This released model is a fine - tuned version of Gemma. Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms. Additionally, the use of this model is restricted as set forth in the Gemma Prohibited Use Policy at ai.google.dev/gemma/prohibited_use_policy ("Prohibited Use Policy"), which is hereby incorporated by reference into this Agreement.

Performance on CoIR Benchmark

Property	Details
Model Type	Salesforce/SFR-Embedding-Code
Training Data	Not specified

Model	Model Size	CoIR AVG (NDCG@10)
SFR-Embedding-Code	2B	67.4
CodeSage-Large-v2	1.3B	64.2
CodeSage-Large	1.3B	61.0
SFR-Embedding-Code	400M	61.9
CodeRankEmbed	137M	60.1
CodeSage-Base	356M	57.5
Voyage-Code-002	-	56.3
CodeSage-Small	130M	54.4

Team Members

SFR-Embedding Team († indicates co - leaders)

Ye Liu
Rui Meng
Shafiq Rayhan Joty
Silvio Savarese
Caiming Xiong †
Yingbo Zhou †
Semih Yavuz †

💻 Usage Examples

Basic Usage

Transformers

import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Each query needs to be accompanied by an corresponding instruction describing the task.
query_instruction_example = "Given Code or Text, retrieval relevant content"
queries = [
    "how to implement quick sort in Python?"
]

# No instruction needed for retrieval passages
passages = [
    "def quick_sort(arr):\n    if len(arr) <= 1:\n        return arr\n    pivot = arr[len(arr) // 2]\n    left = [x for x in arr if x < pivot]\n    middle = [x for x in arr if x == pivot]\n    right = [x for x in arr if x > pivot]\n    return quick_sort(left) + middle + quick_sort(right)",
    "def bubble_sort(arr):\n    n = len(arr)\n    for i in range(n):\n        for j in range(0, n-i-1):\n            if arr[j] > arr[j+1]:\n                arr[j], arr[j+1] = arr[j+1], arr[j]\n    return arr"
]

# load model with tokenizer
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-Code-2B_R', trust_remote_code=True)

# get the embeddings
max_length = 32768
query_embeddings = model.encode_queries(queries, instruction=query_instruction_example, max_length=max_length)
passage_embeddings = model.encode_corpus(passages, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
# [[69.26929473876953, 58.41606903076172]]

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Each query needs to be accompanied by an corresponding instruction describing the task.
query_instruction_example = "Instruct: Given Code or Text, retrieval relevant content\nQuery: "
queries = ["how to implement quick sort in Python?"]

# No instruction needed for retrieval passages
passages = [
    "def quick_sort(arr):\n    if len(arr) <= 1:\n        return arr\n    pivot = arr[len(arr) // 2]\n    left = [x for x in arr if x < pivot]\n    middle = [x for x in arr if x == pivot]\n    right = [x for x in arr if x > pivot]\n    return quick_sort(left) + middle + quick_sort(right)",
    "def bubble_sort(arr):\n    n = len(arr)\n    for i in range(n):\n        for j in range(0, n-i-1):\n            if arr[j] > arr[j+1]:\n                arr[j], arr[j+1] = arr[j+1], arr[j]\n    return arr"
]

# Load the Sentence Transformer model, including pooling
model = SentenceTransformer('Salesforce/SFR-Embedding-Code-2B_R', trust_remote_code=True)

# Compute the embeddings for both queries and passages. Use 'prompt' for queries only
query_embeddings = model.encode(queries, prompt=query_instruction_example)
passage_embeddings = model.encode(passages)

# Compute the similarities between the queries and passages
similarities = model.similarity(query_embeddings, passage_embeddings)
print(similarities)
# tensor([[0.6927, 0.5842]])

📄 License

This model is released under the cc - by - nc - 4.0 license.

📚 Documentation

Citation

@article{liu2024codexembed,
  title={CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval},
  author={Liu, Ye and Meng, Rui and Jot, Shafiq and Savarese, Silvio and Xiong, Caiming and Zhou, Yingbo and Yavuz, Semih},
  journal={arXiv preprint arXiv:2411.12644},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご