đ CmdCaliper-large
The CmdCaliper models, developed by CyCraft AI Lab, are the first embedding models specifically tailored for command - line embeddings. Even the smallest version of CmdCaliper, with around 30 million parameters, can outperform state - of - the - art sentence embedding models with over 10 times more parameters (335 million) in various command - line - specific tasks. CmdCaliper offers three models of different sizes (CmdCaliper - large, CmdCaliper - base, and CmdCaliper - small), providing flexible options for different hardware resource constraints. It was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic - Aware Command - Line Embedding Model and Dataset for Security Research".
đ Documentation
[Dataset] [[Code](https://github.com/cycraft - corp/CmdCaliper)] [Paper]
⨠Features
The CmdCaliper models are specifically designed for command - line embeddings. They offer different sizes to suit various hardware resource constraints and can achieve excellent performance in command - line - specific tasks even with relatively fewer parameters.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
HuggingFace Transformers
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
'cronjob schedule daily 00:00 ./program.exe',
'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
sentences = [
'cronjob schedule daily 00:00 ./program.exe',
'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
đ Metric
Property |
Details |
Model Type |
CmdCaliper models (CmdCaliper - large, CmdCaliper - base, CmdCaliper - small) |
Training Data |
CyCraftAI/CyPHER |
Methods |
Model Parameters |
MRR @3 |
MRR @10 |
Top @3 |
Top @10 |
Levenshtein distance |
- |
71.23 |
72.45 |
74.99 |
81.83 |
Word2Vec |
- |
45.83 |
46.93 |
48.49 |
54.86 |
|
|
|
|
|
|
E5 - small |
Small (0.03B) |
81.59 |
82.6 |
84.97 |
90.59 |
GTE - small |
Small (0.03B) |
82.35 |
83.28 |
85.39 |
90.84 |
CmdCaliper - small |
Small (0.03B) |
86.81 |
87.78 |
89.21 |
94.76 |
|
|
|
|
|
|
BGE - en - base |
Base (0.11B) |
79.49 |
80.41 |
82.33 |
87.39 |
E5 - base |
Base (0.11B) |
83.16 |
84.07 |
86.14 |
91.56 |
GTR - base |
Base (0.11B) |
81.55 |
82.51 |
84.54 |
90.1 |
GTE - base |
Base (0.11B) |
78.2 |
79.07 |
81.22 |
86.14 |
CmdCaliper - base |
Base (0.11B) |
87.56 |
88.47 |
90.27 |
95.26 |
|
|
|
|
|
|
BGE - en - large |
Large (0.34B) |
84.11 |
84.92 |
86.64 |
91.09 |
E5 - large |
Large (0.34B) |
84.12 |
85.04 |
87.32 |
92.59 |
GTR - large |
Large (0.34B) |
88.09 |
88.68 |
91.27 |
94.58 |
GTE - large |
Large (0.34B) |
84.26 |
85.03 |
87.14 |
91.41 |
CmdCaliper - large |
Large (0.34B) |
89.12 |
89.91 |
91.45 |
95.65 |
â ī¸ Limitation
â ī¸ Important Note
This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.
đ Citation
@inproceedings{huang2024cmdcaliper,
title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
year={2024}
}