DiffLlama-1B Open-Source Large Language Model - Achieving Efficient Intelligent Q&A Communication Based on Differential Architecture

Diffllama 1B

Developed by kajuma

DiffLlama-1B is a large language model pre-trained from scratch on approximately 100 billion tokens with around 1 billion parameters, innovatively adopting the 'Differential Transformer' architecture concept.

Large Language Model

Safetensors

JapaneseOpen Source License:Apache-2.0 #Differential Attention Mechanism #Japanese Text Generation #Efficient Training Optimization

Downloads 202

Release Time : 3/29/2025

Model Overview

By integrating the differential attention mechanism into the Llama model framework, this model achieves precise focus on key contextual information and noise suppression, making it suitable for Japanese text generation tasks.

Model Features

Differential Attention Mechanism

Innovatively integrates the differential attention mechanism into the Llama model framework to achieve precise focus on key contextual information and noise suppression.

Efficient Training Techniques

Adopts chunked training methods and μ-optimizer, improving training efficiency by 2 times (equivalent to training on 200 billion tokens).

Large-scale Pre-training

Pre-trained in a single round on approximately 100 billion high-quality Japanese educational data tokens.

Model Capabilities

Japanese Text Generation

Context Understanding

Long Text Processing

Use Cases

Education

Japanese Learning Assistance

Generates Japanese learning materials and exercises

Provides high-quality Japanese texts suitable for educational scenarios.

Content Creation

Japanese Content Generation

Automatically generates Japanese articles, stories, and other creative content

🚀 DiffLlama-1B

DiffLlama-1B is a large language model with approximately 1 billion parameters that has undergone pre - training on about 100 billion tokens from scratch. This model incorporates the concept of "Differential Transformer", which is proposed as an improvement over the traditional Transformer architecture. In particular, by applying the Differential Attention mechanism to the Llama model, it is designed to focus attention on highly relevant contexts and reduce noise.

🚀 Quick Start

The following code snippet demonstrates how to use the DiffLlama-1B model for text generation:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed

model = AutoModelForCausalLM.from_pretrained("kajuma/DiffLlama-1B", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("kajuma/DiffLlama-1B")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
set_seed(123)

text = generator(
    "おはようございます、今日の天気は",
    max_length=30,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    num_return_sequences=3,
)
for t in text:
    print(t)

✨ Features

Architecture: Integrates the Differential Attention mechanism into the Llama model.
Number of Parameters: 1 billion (1B) parameters.
Patch - level Training: Uses the learning cost reduction technique Patch - level Training.
Muon Optimizer: Doubles the learning efficiency by using an optimizer that converges faster than AdamW (effectively equivalent to learning on 200 billion tokens). The implementation can be found here.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed

model = AutoModelForCausalLM.from_pretrained("kajuma/DiffLlama-1B", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("kajuma/DiffLlama-1B")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
set_seed(123)

text = generator(
    "おはようございます、今日の天気は",
    max_length=30,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    num_return_sequences=3,
)
for t in text:
    print(t)

📚 Documentation

Training Data

DiffLlama-1B was trained for one epoch on a total of approximately 100 billion (100B) tokens using the following datasets:

hotchpotch/fineweb-2-edu-japanese: Approximately 90 billion (90B) tokens.
HuggingFaceFW/fineweb-edu: Approximately 10 billion (10B) tokens.

License

This model is provided under the Apache License 2.0.

Acknowledgments

During the development of this model, we received support from the "Free GPU Program for Generative AI Engineers" provided by ZEALS, Inc..

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご