OASIS-code-embedding-1.5B Open-Source Code Embedding Model: Enhance Code Search Efficiency and Accuracy

OASIS Code Embedding 1.5B

Developed by Kwaipilot

OASIS is a state-of-the-art code embedding model developed by Kwaipilot, integrating repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function, setting new benchmarks in code search efficiency and accuracy.

Text Embedding

Safetensors

Open Source License:MIT #Code Semantic Retrieval #Cross-language Code Understanding #Repository-level Program Analysis

Downloads 4,576

Release Time : 3/11/2025

Model Overview

OASIS is an embedding model optimized for code retrieval systems, excelling in semantic understanding and retrieving code snippets across different programming environments.

Model Features

Repository-level Program Analysis

Enhances understanding of code context by analyzing entire code repositories.

OASIS-instruct Data Synthesis

Uses a proprietary algorithm to generate high-quality training data, improving model generalization.

Specialized Fusion Loss Function

Optimizes the training objective function to enhance code search accuracy.

Multi-language Support

Supports code embedding and retrieval for multiple mainstream programming languages.

Model Capabilities

Code Semantic Understanding

Code Snippet Retrieval

Code Similarity Calculation

Cross-language Code Search

Use Cases

Development Tools

Code Search Engine

Builds a semantic-based code search system.

Achieves state-of-the-art performance on multiple benchmarks.

Code Recommendation System

Recommends relevant code snippets to developers.

Improves developer productivity.

Education

Programming Learning Aid

Helps students find and understand relevant code examples.

🚀 Kwaipilot OASIS-1.5B

Kwaipilot OASIS-1.5B is a state-of-the-art code embedding model that uses unique methods to enhance code search efficiency and accuracy, suitable for developers and researchers working on code retrieval systems.

🚀 Quick Start

Direct Usage

pip install -U torch
pip install -U transformers

⚠️ Important Note

Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.

Sentence Transformers

First install the Sentence Transformers library:

pip install -U sentence-transformers

✨ Features

Unique Methods: Incorporates repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function.
Broad Understanding: Trained on a synthetic dataset created through repository-level analysis, ensuring understanding across different coding styles and languages.
State-of-the-Art Performance: Demonstrates excellent performance on latest code search benchmarks.

📦 Installation

Direct Usage

pip install -U torch
pip install -U transformers

Sentence Transformers

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoModel, AutoTokenizer
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
# Add query prompt
def get_query_prompt(query: str):
    query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
    prompt = f'Instruct: {query_description}\nQuery: {query}'
    return prompt
query = "How to do quicksort in python?"

code1 = """def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(1, n - i):
            if arr[j - 1] > arr[j]:
                arr[j - 1], arr[j] = arr[j], arr[j - 1]
                swapped = True
        if not swapped:
            break
    return arr"""
code2 = """def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less = [x for x in arr[1:] if x <= pivot]
        greater = [x for x in arr[1:] if x > pivot]
        return quick_sort(less) + [pivot] + quick_sort(greater)"""
model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.5B", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.5B")

# Tokenize and inference
inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=1024, padding=True, truncation=True, return_tensors='pt')
outputs = model(**inputs)
# Last token pooling
embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
print(embeddings.shape)
# torch.Size([3, 1536])
embeddings = F.normalize(embeddings, dim=1, p=2)
similarity = embeddings @ embeddings.T
print(similarity[0, 1:])
# tensor([0.6895, 0.8240])

Advanced Usage

from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Kwaipilot/OASIS-code-1.5B")#, model_kwargs={"torch_dtype": torch.bfloat16})
query = "How to do quicksort in python?"
code1 = """def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(1, n - i):
            if arr[j - 1] > arr[j]:
                arr[j - 1], arr[j] = arr[j], arr[j - 1]
                swapped = True
        if not swapped:
            break
    return arr"""
code2 = """def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less = [x for x in arr[1:] if x <= pivot]
        greater = [x for x in arr[1:] if x > pivot]
        return quick_sort(less) + [pivot] + quick_sort(greater)"""
# Run inference
query_embedding = model.encode([query], prompt_name="query")
code_embeddings = model.encode([code1, code2])
print(code_embeddings.shape)
# (2, 1536)
# Get the similarity scores for the embeddings
print(model.similarity(query_embedding[0], code_embeddings[0]))
print(model.similarity(query_embedding[0], code_embeddings[1]))
# tensor([[0.6895]])
# tensor([[0.8240]])

📚 Documentation

Model Details

Model Name: OASIS (Order-Augmented Strategy for Improved Code Search)

Introduction

OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function, setting new benchmarks in code search efficiency and accuracy.

Intended Use

This model is ideal for developers and researchers engaged in enhancing code retrieval systems. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.

Training and Performance

OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.

Our preprint is now available OASIS-arxiv.

Performance

Property	Details
Model Type	Code Embedding Model
Training Data	Synthetic dataset created through repository-level analysis

	Size	CoSQA	AdvTest	CSN-Py	CSN-Ja	CSN-JS	CSN-PHP	CSN-Go	CSN-Ruby	Avg
OpenAI-Embedding-Ada-002	Unknown	0.4423	0.3808	0.6802	0.7149	0.6750	0.6062	0.8563	0.7472	0.6378
OpenAI-Text-embedding-3-large	Unknown	0.5538	0.4684	0.7084	0.7292	0.6813	0.5959	0.8764	0.7525	0.6707
jina-embeddings-v2-base-code	161M	0.6837	0.385	0.6634	0.6803	0.6304	0.5701	0.8595	0.7095	0.6477
CodeSage-large	1.3B	0.4753	0.5267	0.7077	0.7021	0.695	0.6133	0.8371	0.7192	0.6595
CodeFuse-CGE-Small	3.8B	0.5619	0.4639	0.6958	0.6863	0.6564	0.6133	0.8637	0.7341	0.6594
OASIS-code-1.5B	1.5B	0.5577	0.5727	0.7369	0.7397	0.6980	0.6384	0.8821	0.7547	0.6975

🔧 Technical Details

OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It uses unique methods such as repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function.

📄 License

This project is licensed under the MIT license.

News 📢

🔥 [2025/03/12] Our latest Code Embedding Model OASIS-code-1.5B is now released.
🔥 [2025/03/12] Our preprint is now available at OASIS-arxiv.

BibTeX

@misc{kwaipilotoasis,
  title = {Optimized Augmentation Strategy for Improved code Search},
  author = {Kwaipilot team},
  year = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご