CmdCaliper-large Open-source Command-line Embedding Model - Freely Empower Command-line-related Application Development

Cmdcaliper Large

Developed by CyCraftAI

CmdCaliper is the first embedding model series specifically designed for command line embedding, developed by CyCraft AI Lab.

Text Embedding

Safetensors

#Command Line Semantic Embedding #Specialized for Security Research #High Precision Similarity

Downloads 74

Release Time : 9/25/2024

Model Overview

The CmdCaliper model focuses on command line embedding design, offering three different model sizes (large, base, small) to accommodate various hardware resource constraints.

Model Features

Designed for Command Line

The first model dedicated to command line embedding, optimized for command line semantics.

Multiple Size Options

Offers large, base, and small model sizes to meet different hardware resource needs.

High Performance

Outperforms state-of-the-art sentence embedding models with over 10 times the parameters in various command line-specific tasks.

Model Capabilities

Command Line Semantic Understanding

Command Line Similarity Calculation

Command Line Feature Extraction

Use Cases

Security Research

Malicious Command Line Detection

Identifies potentially malicious commands by analyzing command line semantic similarity.

Improves accuracy in detecting malicious commands.

Command Line Behavior Analysis

Clusters and analyzes command lines in system logs.

Identifies abnormal command line patterns.

System Administration

Command Line Recommendation

Recommends related command lines based on semantic similarity.

Enhances administrator productivity.

🚀 CmdCaliper-large

The CmdCaliper models, developed by CyCraft AI Lab, are the first embedding models specifically tailored for command - line embeddings. Even the smallest version of CmdCaliper, with around 30 million parameters, can outperform state - of - the - art sentence embedding models with over 10 times more parameters (335 million) in various command - line - specific tasks. CmdCaliper offers three models of different sizes (CmdCaliper - large, CmdCaliper - base, and CmdCaliper - small), providing flexible options for different hardware resource constraints. It was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic - Aware Command - Line Embedding Model and Dataset for Security Research".

📚 Documentation

[Dataset] [[Code](https://github.com/cycraft - corp/CmdCaliper)] [Paper]

✨ Features

The CmdCaliper models are specifically designed for command - line embeddings. They offer different sizes to suit various hardware resource constraints and can achieve excellent performance in command - line - specific tasks even with relatively fewer parameters.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

HuggingFace Transformers

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]

tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
# Run inference
sentences = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📊 Metric

Property	Details
Model Type	CmdCaliper models (CmdCaliper - large, CmdCaliper - base, CmdCaliper - small)
Training Data	CyCraftAI/CyPHER

Methods	Model Parameters	MRR @3	MRR @10	Top @3	Top @10
Levenshtein distance	-	71.23	72.45	74.99	81.83
Word2Vec	-	45.83	46.93	48.49	54.86

E5 - small	Small (0.03B)	81.59	82.6	84.97	90.59
GTE - small	Small (0.03B)	82.35	83.28	85.39	90.84
CmdCaliper - small	Small (0.03B)	86.81	87.78	89.21	94.76

BGE - en - base	Base (0.11B)	79.49	80.41	82.33	87.39
E5 - base	Base (0.11B)	83.16	84.07	86.14	91.56
GTR - base	Base (0.11B)	81.55	82.51	84.54	90.1
GTE - base	Base (0.11B)	78.2	79.07	81.22	86.14
CmdCaliper - base	Base (0.11B)	87.56	88.47	90.27	95.26

BGE - en - large	Large (0.34B)	84.11	84.92	86.64	91.09
E5 - large	Large (0.34B)	84.12	85.04	87.32	92.59
GTR - large	Large (0.34B)	88.09	88.68	91.27	94.58
GTE - large	Large (0.34B)	84.26	85.03	87.14	91.41
CmdCaliper - large	Large (0.34B)	89.12	89.91	91.45	95.65

⚠️ Limitation

⚠️ Important Note

This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.

📄 Citation

@inproceedings{huang2024cmdcaliper,
  title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
  author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご