đ Kwaipilot OASIS-1.5B
Kwaipilot OASIS-1.5B is a state-of-the-art code embedding model that uses unique methods to enhance code search efficiency and accuracy, suitable for developers and researchers working on code retrieval systems.
đ Quick Start
Direct Usage
pip install -U torch
pip install -U transformers
â ī¸ Important Note
Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.
Sentence Transformers
First install the Sentence Transformers library:
pip install -U sentence-transformers
⨠Features
- Unique Methods: Incorporates repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function.
- Broad Understanding: Trained on a synthetic dataset created through repository-level analysis, ensuring understanding across different coding styles and languages.
- State-of-the-Art Performance: Demonstrates excellent performance on latest code search benchmarks.
đĻ Installation
Direct Usage
pip install -U torch
pip install -U transformers
Sentence Transformers
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoModel, AutoTokenizer
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_query_prompt(query: str):
query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
prompt = f'Instruct: {query_description}\nQuery: {query}'
return prompt
query = "How to do quicksort in python?"
code1 = """def bubble_sort(arr):
n = len(arr)
for i in range(n):
swapped = False
for j in range(1, n - i):
if arr[j - 1] > arr[j]:
arr[j - 1], arr[j] = arr[j], arr[j - 1]
swapped = True
if not swapped:
break
return arr"""
code2 = """def quick_sort(arr):
if len(arr) <= 1:
return arr
else:
pivot = arr[0]
less = [x for x in arr[1:] if x <= pivot]
greater = [x for x in arr[1:] if x > pivot]
return quick_sort(less) + [pivot] + quick_sort(greater)"""
model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.5B", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.5B")
inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=1024, padding=True, truncation=True, return_tensors='pt')
outputs = model(**inputs)
embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
print(embeddings.shape)
embeddings = F.normalize(embeddings, dim=1, p=2)
similarity = embeddings @ embeddings.T
print(similarity[0, 1:])
Advanced Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Kwaipilot/OASIS-code-1.5B")
query = "How to do quicksort in python?"
code1 = """def bubble_sort(arr):
n = len(arr)
for i in range(n):
swapped = False
for j in range(1, n - i):
if arr[j - 1] > arr[j]:
arr[j - 1], arr[j] = arr[j], arr[j - 1]
swapped = True
if not swapped:
break
return arr"""
code2 = """def quick_sort(arr):
if len(arr) <= 1:
return arr
else:
pivot = arr[0]
less = [x for x in arr[1:] if x <= pivot]
greater = [x for x in arr[1:] if x > pivot]
return quick_sort(less) + [pivot] + quick_sort(greater)"""
query_embedding = model.encode([query], prompt_name="query")
code_embeddings = model.encode([code1, code2])
print(code_embeddings.shape)
print(model.similarity(query_embedding[0], code_embeddings[0]))
print(model.similarity(query_embedding[0], code_embeddings[1]))
đ Documentation
Model Details
Model Name: OASIS (Order-Augmented Strategy for Improved Code Search)
Introduction
OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function, setting new benchmarks in code search efficiency and accuracy.
Intended Use
This model is ideal for developers and researchers engaged in enhancing code retrieval systems. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.
Training and Performance
OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
Our preprint is now available OASIS-arxiv.
Performance
Property |
Details |
Model Type |
Code Embedding Model |
Training Data |
Synthetic dataset created through repository-level analysis |
|
Size |
CoSQA |
AdvTest |
CSN-Py |
CSN-Ja |
CSN-JS |
CSN-PHP |
CSN-Go |
CSN-Ruby |
Avg |
OpenAI-Embedding-Ada-002 |
Unknown |
0.4423 |
0.3808 |
0.6802 |
0.7149 |
0.6750 |
0.6062 |
0.8563 |
0.7472 |
0.6378 |
OpenAI-Text-embedding-3-large |
Unknown |
0.5538 |
0.4684 |
0.7084 |
0.7292 |
0.6813 |
0.5959 |
0.8764 |
0.7525 |
0.6707 |
jina-embeddings-v2-base-code |
161M |
0.6837 |
0.385 |
0.6634 |
0.6803 |
0.6304 |
0.5701 |
0.8595 |
0.7095 |
0.6477 |
CodeSage-large |
1.3B |
0.4753 |
0.5267 |
0.7077 |
0.7021 |
0.695 |
0.6133 |
0.8371 |
0.7192 |
0.6595 |
CodeFuse-CGE-Small |
3.8B |
0.5619 |
0.4639 |
0.6958 |
0.6863 |
0.6564 |
0.6133 |
0.8637 |
0.7341 |
0.6594 |
OASIS-code-1.5B |
1.5B |
0.5577 |
0.5727 |
0.7369 |
0.7397 |
0.6980 |
0.6384 |
0.8821 |
0.7547 |
0.6975 |
đ§ Technical Details
OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It uses unique methods such as repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function.
đ License
This project is licensed under the MIT license.
News đĸ
- đĨ [2025/03/12] Our latest Code Embedding Model OASIS-code-1.5B is now released.
- đĨ [2025/03/12] Our preprint is now available at OASIS-arxiv.
BibTeX
@misc{kwaipilotoasis,
title = {Optimized Augmentation Strategy for Improved code Search},
author = {Kwaipilot team},
year = {2024},
}