CodeBERT-base-finetuned-detect-insecure-code Open Source Model - Free Deployment for Precise Identification of Insecure Code

Codebert Base Finetuned Detect Insecure Code

Developed by mrm8488

A model fine-tuned on CodeXGLUE defect detection dataset based on CodeBERT, used to identify insecure code

Text Classification English#Code Security Detection #Vulnerability Identification #Programming Language Analysis

Downloads 318

Release Time : 3/2/2022

Model Overview

This model is specifically designed to detect insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities, and DoS attacks.

Model Features

Efficient Code Analysis

Based on the CodeBERT pre-trained model, it effectively understands code semantics

Accurate Vulnerability Detection

Specially optimized for training on various types of code vulnerabilities

High Performance

Achieves 65.3% accuracy on the test set, outperforming similar models

Model Capabilities

Code Security Analysis

Vulnerability Detection

Code Classification

Use Cases

Software Development Security

Automated Code Review

Automatically detect insecure code during the development process

Improves code security and reduces manual review workload

CI/CD Integration

Identify potential vulnerabilities in continuous integration workflows

Early detection of security issues, reducing repair costs

🚀 CodeBERT fine-tuned for Insecure Code Detection 💾⛔

This model is a fine-tuned version of codebert-base on the CodeXGLUE -- Defect Detection dataset for the Insecure Code Detection downstream task.

✨ Features

Details of CodeBERT

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc.

We develop CodeBERT with a Transformer-based neural architecture and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators.

We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

Details of the downstream task (code classification) - Dataset 📚

Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities, and DoS attacks. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure code.

The dataset used comes from the paper Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. All projects are combined and split 80%/10%/10% for training/dev/test.

The data statistics of the dataset are shown in the following table:

Property	Details
Train Examples	21,854
Dev Examples	2,732
Test Examples	2,732

Test set metrics 🧾

Methods	ACC
BiLSTM	59.37
TextCNN	60.69
RoBERTa	61.05
CodeBERT	62.08
Ours	65.30

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
tokenizer = AutoTokenizer.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')
model = AutoModelForSequenceClassification.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')

inputs = tokenizer("your code here", return_tensors="pt", truncation=True, padding='max_length')
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

print(np.argmax(logits.detach().numpy()))

Created by Manuel Romero/@mrm8488 | LinkedIn

Made with ♥ in Spain

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご