Nano Mistral
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Model Card for Model ID
This is a model card for a 🤗 transformers model pushed on the Hub. It provides details about the model, including its uses, training, evaluation, and more.
🚀 Quick Start
Use the code below to get started with the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("crumb/nano-mistral")
tokenizer = AutoTokenizer.from_pretrained("crumb/nano-mistral")
inputs = tokenizer(["Once upon a time,"], return_tensors="pt")
inputs = {k:v.to(model.device) for k,v in dict(inputs).items()}
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.7, top_k=20, do_sample=True)
outputs = tokenizer.batch_decode(outputs)
for i in outputs:
print(i)
✨ Features
- General Web Text Completions: Capable of general web text completions with extremely low resource use.
- Mistral Model Type: Based on the Mistral model type.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("crumb/nano-mistral")
tokenizer = AutoTokenizer.from_pretrained("crumb/nano-mistral")
inputs = tokenizer(["Once upon a time,"], return_tensors="pt")
inputs = {k:v.to(model.device) for k,v in dict(inputs).items()}
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.7, top_k=20, do_sample=True)
outputs = tokenizer.batch_decode(outputs)
for i in outputs:
print(i)
📚 Documentation
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: me
- Model type: Mistral
- Language(s) (NLP): en
- License: apache
Uses
- General Use: general web text completions at extremely low resource use.
- Out-of-Scope Use: not an instruct model.
Bias, Risks, and Limitations
The model is trained on web text, though filtered, there are no guarantees that there's no toxic stuff in there.
Training Details
Training Data
Training Procedure
Parameter | Value |
---|---|
Context Length | 2048 |
Batch Size | 128 |
Learning Rate | 6e-4 |
Scheduler | One-Cycle |
Adam eps | 1e-8 |
Adam beta1 | 0.9 |
Adam beta2 | 0.95 |
Weight Decay | 0.1 |
Max Grad Norm | 1.0 |
Optimizer | adamw_torch |
Tokens | 3,401,640,960 |
Training Hyperparameters
- Training regime: bf16 non-mixed precision
Evaluation
Testing Data, Factors & Metrics
Testing Data
held out set of crumb/askmistral-pile-2-15
Metrics
open llm leaderboard eval datasets and settings
Results
OpenLLM Leaderboard Mean Score + Stderr: (29.30, 0.42)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
arc_challenge | 1 | none | 25 | acc | 0.1843 | ± | 0.0113 |
none | 25 | acc_norm | 0.2167 | ± | 0.0120 | ||
truthfulqa_mc2 | 2 | none | 0 | acc | 0.4719 | ± | 0.0156 |
winogrande | 1 | none | 5 | acc | 0.517 | ± | 0.014 |
hellaswag | 1 | none | 10 | acc | 0.2803 | ± | 0.0045 |
none | 10 | acc_norm | 0.2886 | ± | 0.0045 | ||
gsm8k | 3 | strict-match | 5 | exact_match | 0.0008 | ± | 0.0008 |
flexible-extract | 5 | exact_match | 0.0099 | ± | 0.0027 |
MMLU
value, stderr = (0.253980701754386, 0.004428598058450528)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
world_religions | 0 | none | 5 | acc | 0.2222 | ± | 0.0319 |
virology | 0 | none | 5 | acc | 0.2711 | ± | 0.0346 |
us_foreign_policy | 0 | none | 5 | acc | 0.3300 | ± | 0.0473 |
sociology | 0 | none | 5 | acc | 0.2388 | ± | 0.0301 |
security_studies | 0 | none | 5 | acc | 0.2367 | ± | 0.0272 |
public_relations | 0 | none | 5 | acc | 0.2273 | ± | 0.0401 |
professional_psychology | 0 | none | 5 | acc | 0.2484 | ± | 0.0175 |
professional_medicine | 0 | none | 5 | acc | 0.4596 | ± | 0.0303 |
professional_law | 0 | none | 5 | acc | 0.2464 | ± | 0.0110 |
professional_accounting | 0 | none | 5 | acc | 0.2021 | ± | 0.0240 |
prehistory | 0 | none | 5 | acc | 0.2130 | ± | 0.0228 |
philosophy | 0 | none | 5 | acc | 0.2219 | ± | 0.0236 |
nutrition | 0 | none | 5 | acc | 0.2157 | ± | 0.0236 |
moral_scenarios | 0 | none | 5 | acc | 0.2380 | ± | 0.0142 |
moral_disputes | 0 | none | 5 | acc | 0.2486 | ± | 0.0233 |
miscellaneous | 0 | none | 5 | acc | 0.2516 | ± | 0.0155 |
medical_genetics | 0 | none | 5 | acc | 0.3000 | ± | 0.0461 |
marketing | 0 | none | 5 | acc | 0.2265 | ± | 0.0274 |
management | 0 | none | 5 | acc | 0.1748 | ± | 0.0376 |
machine_learning | 0 | none | 5 | acc | 0.3125 | ± | 0.0440 |
logical_fallacies | 0 | none | 5 | acc | 0.2393 | ± | 0.0335 |
jurisprudence | 0 | none | 5 | acc | 0.2315 | ± | 0.0408 |
international_law | 0 | none | 5 | acc | 0.3140 | ± | 0.0424 |
human_sexuality | 0 | none | 5 | acc | 0.2519 | ± | 0.0381 |
human_aging | 0 | none | 5 | acc | 0.3049 | ± | 0.0309 |
high_school_world_history | 0 | none | 5 | acc | 0.2658 | ± | 0.0288 |
high_school_us_history | 0 | none | 5 | acc | 0.2451 | ± | 0.0302 |
high_school_statistics | 0 | none | 5 | acc | 0.4722 | ± | 0.0340 |
high_school_psychology | 0 | none | 5 | acc | 0.1963 | ± | 0.0170 |
high_school_physics | 0 | none | 5 | acc | 0.3046 | ± | 0.0376 |
high_school_microeconomics | 0 | none | 5 | acc | 0.2773 | ± | 0.0291 |
high_school_mathematics | 0 | none | 5 | acc | 0.2667 | ± | 0.0270 |
high_school_macroeconomics | 0 | none | 5 | acc | 0.2667 | ± | 0.0224 |
high_school_government_and_politics | 0 | none | 5 | acc | 0.2591 | ± | 0.0316 |
high_school_geography | 0 | none | 5 | acc | 0.2424 | ± | 0.0305 |
high_school_european_history | 0 | none | 5 | acc | 0.2242 | ± | 0.0326 |
high_school_computer_science | 0 | none | 5 | acc | 0.2800 | ± | 0.0451 |
high_school_chemistry | 0 | none | 5 | acc | 0.2857 | ± | 0.0318 |
high_school_biology | 0 | none | 5 | acc | 0.3129 | ± | 0.0264 |
global_facts | 0 | none | 5 | acc | 0.1500 | ± | 0.0359 |
formal_logic | 0 | none | 5 | acc | 0.1905 | ± | 0.0351 |
elementary_mathematics | 0 | none | 5 | acc | 0.2513 | ± | 0.0223 |
electrical_engineering | 0 | none | 5 | acc | 0.2759 | ± | 0.0372 |
econometrics | 0 | none | 5 | acc | 0.2456 | ± | 0.0405 |
conceptual_physics | 0 | none | 5 | acc | 0.2638 | ± | 0.0288 |
computer_security | 0 | none | 5 | acc | 0.1800 | ± | 0.0386 |
college_physics | 0 | none | 5 | acc | 0.2549 | ± | 0.0434 |
college_medicine | 0 | none | 5 | acc | 0.2023 | ± | 0.0306 |
college_mathematics | 0 | none | 5 | acc | 0.2900 | ± | 0.0456 |
college_computer_science | 0 | none | 5 | acc | 0.2700 | ± | 0.0446 |
college_chemistry | 0 | none | 5 | acc | 0.2500 | ± | 0.0435 |
college_biology | 0 | none | 5 | acc | 0.2222 | ± | 0.0348 |
clinical_knowledge | 0 | none | 5 | acc | 0.2377 | ± | 0.0262 |
business_ethics | 0 | none | 5 | acc | 0.2100 | ± | 0.0409 |
astronomy | 0 | none | 5 | acc | 0.1776 | ± | 0.0311 |
anatomy | 0 | none | 5 | acc | 0.2593 | ± | 0.0379 |
abstract_algebra | 0 | none | 5 | acc | 0.2200 | ± | 0.0416 |
🔧 Technical Details
Model Architecture and Objective
mistral, causal language modelling
Compute Infrastructure
Hardware
lambda vector 2xA6000
Software
huggingface transformers / pytorch / custom trainer
🌱 Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: A6000
- Hours used: 34.74
- Cloud Provider: n/a
- Compute Region iowa
- Carbon Emitted: 4.5kg CO2eq.
📄 License
apache

