CodeBert-base-cd-ft Open Source Model - Accurately Detect Code Clones and Help Developers Reduce Costs and Increase Efficiency

Codebert Base Cd Ft

Developed by mchochlov

This is a sentence-transformers-based model specifically fine-tuned for code clone detection tasks, capable of mapping code snippets into a 768-dimensional vector space.

Text Embedding

Transformers

#Code Clone Detection #Code Vectorization #Contrastive Learning Fine-tuning

Downloads 5,080

Release Time : 8/16/2022

Model Overview

The model is based on the CodeBERT architecture and fine-tuned using contrastive learning on the BigCloneBench dataset, primarily used for code similarity computation and clone detection tasks.

Model Features

Code-Specific Embedding

Vector representations optimized for code snippets, better capturing semantic features of code.

Clone Detection Optimization

Fine-tuned on the BigCloneBench dataset using contrastive learning, making it particularly suitable for code clone detection scenarios.

High-Dimensional Semantic Representation

Generates 768-dimensional dense vectors that effectively represent deep semantic features of code.

Model Capabilities

Code Similarity Computation

Code Clone Detection

Code Feature Extraction

Use Cases

Code Analysis

Code Clone Detection

Identify similarities between different code snippets to detect potential code clones.

Can effectively detect Type-1 to Type-4 level code clones.

Code Search

Achieve more precise code search through semantic similarity.

Code Quality

Duplicate Code Identification

Identify duplicate or highly similar code fragments in large codebases.

Helps reduce code redundancy and improve maintainability.

🚀 mchochlov/codebert-base-cd-ft

This is a sentence-transformers model that maps code to a 768-dimensional dense vector space. It's fine-tuned for clone detection using contrastive learning on parts of BigCloneBench code.

🚀 Quick Start

This section will guide you through using the mchochlov/codebert-base-cd-ft model in different ways.

✨ Features

Maps code to a 768-dimensional dense vector space.
Specifically fine-tuned for clone detection using contrastive learning on parts of BigCloneBench code.

📦 Installation

If you want to use this model, you need to install sentence-transformers first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model like this:

from sentence_transformers import SentenceTransformer
code_fragments = [...]

model = SentenceTransformer('mchochlov/codebert-base-cd-ft')
embeddings = model.encode(code_fragments)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model as follows: First, pass your input through the transformer model, then apply the right pooling-operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('mchochlov/codebert-base-cd-ft')
model = AutoModel.from_pretrained('mchochlov/codebert-base-cd-ft')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

No license information is provided in the original README.

📖 Citing & Authors

Please cite this paper if using the model.

@inproceedings{chochlov2022using,
  title={Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection},
  author={Chochlov, Muslim and Ahmed, Gul Aftab and Patten, James Vincent and Lu, Guoxian and Hou, Wei and Gregg, David and Buckley, Jim},
  booktitle={2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={582--591},
  year={2022},
  organization={IEEE}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご