🚀 PassGPT
PassGPT is a causal language model trained on password leaks, aiming to contribute to password - related research in the field of cybersecurity.
🚀 Quick Start
PassGPT is a causal language model trained on password leaks. It was first introduced in this paper. This version of the model was trained on passwords from the RockYou leak, after filtering those that were at most 16 characters long. You can also access PassGPT trained on passwords up to 10 characters long, without restrictions here.
✨ Features
- Curated Model: This is a curated version of the model reported in the paper. Vocabulary size was reduced to the most meaningful characters and training was slightly optimized. Results are slightly better with these architectures.
- Inherited Architecture: The model inherits the GPT2LMHeadModel architecture and implements a custom BertTokenizer that encodes each character in a password as a single token, avoiding merges.
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
Passwords can be sampled from the model using the built - in generation methods provided by HuggingFace and using the "start of password token" as seed (i.e. <s>
). This code can be used to generate one password with PassGPT. Note you may need to generate an [access token](https://huggingface.co/docs/hub/security - tokens) to authenticate your download.
from transformers import GPT2LMHeadModel
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("javirandor/passgpt-16characters",
use_auth_token="YOUR_ACCESS_TOKEN",
max_len=18,
padding="max_length",
truncation=True,
do_lower_case=False,
strip_accents=False,
mask_token="<mask>",
unk_token="<unk>",
pad_token="<pad>",
truncation_side="right")
model = GPT2LMHeadModel.from_pretrained("javirandor/passgpt-16characters", use_auth_token="YOUR_ACCESS_TOKEN").eval()
NUM_GENERATIONS = 1
g = model.generate(torch.tensor([[tokenizer.bos_token_id]]),
do_sample=True,
num_return_sequences=NUM_GENERATIONS,
max_length=18,
pad_token_id=tokenizer.pad_token_id,
bad_words_ids=[[tokenizer.bos_token_id]])
g = g[:, 1:]
decoded = tokenizer.batch_decode(g.tolist())
decoded_clean = [i.split("</s>")[0] for i in decoded]
print(decoded_clean)
Advanced Usage
You can find a more flexible script for sampling here.
📚 Documentation
Model description
The model inherits the GPT2LMHeadModel architecture and implements a custom BertTokenizer that encodes each character in a password as a single token, avoiding merges. It was trained from a random initialization, and the code for training can be found in the official repository.
Password Generation
Passwords can be sampled from the model using the built - in generation methods provided by HuggingFace and using the "start of password token" as seed (i.e. <s>
).
Usage and License Notices
PassGPT is intended and licensed for research use only. The model and code are CC BY NC 4.0 (allowing only non - commercial use) and should not be used outside of research purposes. This model should never be used to attack real systems. Access will be granted upon request. Please, make sure to indicate the details and scope of your project.
Cite our work
@article{rando2023passgpt,
title={PassGPT: Password Modeling and (Guided) Generation with Large Language Models},
author={Rando, Javier and Perez - Cruz, Fernando and Hitaj, Briland},
journal={arXiv preprint arXiv:2306.01545},
year={2023}
}
Additional Information
Property |
Details |
Model Type |
Causal language model |
Training Data |
Passwords from the RockYou leak (filtered passwords at most 16 characters long, also available for passwords up to 10 characters long) |
⚠️ Important Note
PassGPT is intended and licensed for research use only. The model and code are CC BY NC 4.0 (allowing only non - commercial use) and should not be used outside of research purposes. This model should never be used to attack real systems. Access will be granted upon request. Please, make sure to indicate the details and scope of your project.
💡 Usage Tip
You may need to generate an [access token](https://huggingface.co/docs/hub/security - tokens) to authenticate your download when using the model.