đ Qwen2.5-1.5B-Instruct GGUF Models
This project offers Qwen2.5-1.5B-Instruct GGUF models, which provide multiple model formats to meet different hardware and memory requirements. These models are suitable for various text generation tasks and support multiple languages.
đ Quick Start
Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
BF16 (Brain Float 16) â Use if BF16 acceleration is available
- A 16-bit floating-point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Recommended if your hardware supports BF16 acceleration (check your device's specs).
- Ideal for high-performance inference with reduced memory footprint compared to FP32.
đ Use BF16 if:
â Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
â You want higher precision while saving memory.
â You plan to requantize the model into another format.
đ Avoid BF16 if:
â Your hardware does not support BF16 (it may fall back to FP32 and run slower).
â You need compatibility with older devices that lack BF16 optimization.
F16 (Float 16) â More widely supported than BF16
- A 16-bit floating-point high precision but with less of range of values than BF16.
- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
đ Use F16 if:
â Your hardware supports FP16 but not BF16.
â You need a balance between speed, memory usage, and accuracy.
â You are running on a GPU or another device optimized for FP16 computations.
đ Avoid F16 if:
â Your device lacks native FP16 support (it may run slower than expected).
â You have memory limitations.
Quantized Models (Q4_K, Q6_K, Q8, etc.) â For CPU & Low-VRAM Inference
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- Lower-bit models (Q4_K) â Best for minimal memory usage, may have lower precision.
- Higher-bit models (Q6_K, Q8_0) â Better accuracy, requires more memory.
đ Use Quantized Models if:
â You are running inference on a CPU and need an optimized model.
â Your device has low VRAM and cannot load full-precision models.
â You want to reduce memory footprint while keeping reasonable accuracy.
đ Avoid Quantized Models if:
â You need maximum accuracy (full-precision models are better for this).
â Your hardware has enough VRAM for higher-precision formats (BF16/F16).
Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.
-
IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.
- Use case: Best for ultra-low-memory devices where even Q4_K is too large.
- Trade-off: Lower accuracy compared to higher-bit quantizations.
-
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low-memory devices where IQ3_XS is too aggressive.
-
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low-memory devices where IQ3_S is too limiting.
-
Q4_K: 4-bit quantization with block-wise optimization for better accuracy.
- Use case: Best for low-memory devices where Q6_K is too large.
-
Q4_0: Pure 4-bit quantization, optimized for ARM devices.
- Use case: Best for ARM-based devices or low-memory environments.
Summary Table: Model Format Selection
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & Post-training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings |
Number of Parameters |
1.54B |
Number of Paramaters (Non-Embedding) |
1.31B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
12 for Q and 2 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Property |
Details |
Model Type |
Causal Language Models |
Training Data |
Pretraining & |