ð lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese
Deepseek's R1 models are excellent for reasoning but may output inconsistent languages. This Japanese version reliably outputs Japanese in response to prompts.
ð Quick Start
When using these models, we recommend using a sampling temperature of between 0.5 - 0.7, as per the original distilled R1 models.
Additionally, we have observed that the model sometimes tends to repeat itself more than the original R1 model, so we also recommend setting repetition_penalty
to 1.1, or higher if the model repeats itself when processing your prompts.
âš Features
ðŠ Installation
To use this model in vLLM, you need to install vLLM first. You can install it using the following command:
pip install vllm
ð» Usage Examples
Basic Usage
from vllm import LLM, SamplingParams
llm = LLM(
model="lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese",
max_model_len=8_000
)
sampling_params = SamplingParams(
temperature=0.5,
max_tokens=8_000,
repetition_penalty=1.1
)
prompts = [
"""åŠæ ¡ã«ã¯1ã¯ã©ã¹ã«ã€ã20人ã®çåŸããããã¯ã©ã¹ã¯åèš3ã€ãããŸãã
åŠæ ¡å
šäœã§ã¯ç·åãšå¥³åããããã50%ãã€ããŸãã
1ã€ç®ã®ã¯ã©ã¹ã«ã¯å¥³åã15人ã2ã€ç®ã®ã¯ã©ã¹ã«ã¯å¥³åã12人ããŸãã
3ã€ç®ã®ã¯ã©ã¹ã«ã¯äœäººã®ç·åãããŸããïŒ"""
]
conversations = [
[{"role": "user", "content": x}] for x in prompts
]
outputs = llm.chat(conversations, sampling_params=sampling_params)
for output in outputs:
print(output.outputs[0].text)
<think>
...
...
ð Documentation
Evaluation
We evaluated this model for output accuracy and the percentage of valid Japanese <think>
sections using the first 50 rows of the SakanaAI/gsm8k-ja-test_250-1319 dataset.
We compare this to the original R1 model and test in both regimes where repetition penalty is 1.0 and 1.1:
|
Repetition Penalty |
Answer accuracy (%) |
Valid Japanese <think> (%) |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
1.0 |
60 |
94 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
1.1 |
62 |
96 |
lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese |
1.0 |
66 |
92 |
lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese |
1.1 |
70 |
98 |
Code for the SakanaAI/gsm8k-ja-test_250-1319 evaluation can be found here.
We further use the first 50 prompts from DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja to evaluate the percentage of valid Japanese <think>
sections in model responses. This benchmark contains more varied and complex prompts, meaning this is a more realistic evaluation of how reliably this model can output Japanese.
|
Repetition Penalty |
Valid Japanese <think> (%) |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
1.0 |
48 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
1.1 |
48 |
lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese |
1.0 |
84 |
lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese |
1.1 |
94 |
Code for the DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja evaluation can be found here.
How this model was made
We made the data for this model using the following steps:
- Sample English reasoning - style prompts from argilla/distilabel-reasoning-prompts.
- Remove similar prompts using text similarity based on BAAI/bge-m3 embeddings.
- Translate English prompts to Japanese using gpt-4o-mini-2024-07-18.
- Generate answers to prompts using deepseek-ai/DeepSeek-R1-Distill-Llama-70B.
- Filter out responses which did not:
- Finish within 2048 tokens
- Contain a valid
<think>
section
- Have the
<think>
section written in Japanese
Training details
Full training config
Training config yaml
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
stage: sft
do_train: true
finetuning_type: full
deepspeed: /root/LLaMA-Factory/examples/deepspeed/ds_z2_config.json
dataset: distilabel-reasoning-R1-Llama-70B-ja-train
template: qwen
cutoff_len: 4500
overwrite_cache: true
preprocessing_num_workers: 16
packing: true
output_dir: /root/train_outputs/DeepSeek-R1-Distill-Qwen-7B/distilabel-reasoning-R1-Llama-70B-ja-train
logging_steps: 1
save_steps: 0.99999
plot_loss: true
overwrite_output_dir: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.01
bf16: true
ddp_timeout: 180000000
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 0.1
Training run script
echo '{
"distilabel-reasoning-R1-Llama-70B-ja-train": {
"hf_hub_url": "lightblue/distilabel-reasoning-R1-Llama-70B-ja-train",
"formatting": "sharegpt"
}
}' > /root/LLaMA-Factory/data/dataset_info.json
cd /root/LLaMA-Factory && llamafactory-cli train /root/reasoning_train.yaml
rm -r /root/train_outputs/DeepSeek-R1-Distill-Qwen-7B/distilabel-reasoning-R1-Llama-70B-ja-train/checkpoint*
huggingface-cli upload lightblue/DeepSeek-R1-Distill-Qwen-7B-Japanese /root/train_outputs/DeepSeek-R1-Distill-Qwen-7B/distilabel-reasoning-R1-Llama-70B-ja-train
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 8
- total_eval_batch_size: 8
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.01
- num_epochs: 1.0
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
0.766 |
0.1087 |
5 |
0.5912 |
0.5873 |
0.2174 |
10 |
0.5282 |
0.3868 |
0.3261 |
15 |
0.4958 |
0.5101 |
0.4348 |
20 |
0.4761 |
0.4085 |
0.5435 |
25 |
0.4644 |
0.5561 |
0.6522 |
30 |
0.4578 |
0.4683 |
0.7609 |
35 |
0.4542 |
0.5055 |
0.8696 |
40 |
0.4526 |
0.5359 |
0.9783 |
45 |
0.4519 |
Framework versions
- Transformers 4.46.1
- Pytorch 2.5.1+cu124
- Datasets 3.1.0
- Tokenizers 0.20.3
ð License
We share this model under an Apache 2.0 license.
Developed by
This model was trained by Peter Devine (ptrdvn) for Lightblue