seed-coder-triton-8b-v1开源大语言模型 - 支持长序列输入与高效训练

首页

Seed Coder Triton 8b V1

由 winglian 开发

基于ByteDance-Seed/Seed-Coder-8B-Base模型在特定数据集上微调的大语言模型，支持长序列输入和高效训练策略。

大型语言模型

Transformers

开源协议:MIT #长序列推理 #代码生成优化 #高效微调

下载量 1,388

发布时间 : 5/13/2025

模型简介

该模型是在axolotl-ai-internal/gpumode-py2triton-reasoning-v2数据集上对Seed-Coder-8B-Base进行微调的成果，适用于特定领域的任务需求。

模型特点

长序列支持

支持长达16384的序列输入，适合处理长文本或复杂代码

高效训练策略

采用样本打包和填充策略，结合多种优化插件，提高训练效率

优化架构

使用LigerPlugin等优化技术改进模型架构，提升性能

模型能力

代码生成

逻辑推理

长文本处理

使用案例

代码相关

代码生成

根据需求生成特定功能的代码

在评估集上损失值为0.2177

代码推理

理解和分析现有代码逻辑

🚀 Transformers模型项目

本项目基于transformers库，对模型进行微调训练。该模型是在特定数据集上对基础模型进行微调的成果，解决了特定领域的任务需求，为相关领域的应用提供了有力支持。

🚀 快速开始

本模型是 ByteDance-Seed/Seed-Coder-8B-Base 在 axolotl-ai-internal/gpumode-py2triton-reasoning-v2 数据集上的微调版本。在评估集上取得了以下结果：

损失值：0.2177

✨ 主要特性

基于强大的基础模型ByteDance-Seed/Seed-Coder-8B-Base进行微调。
采用了多种插件和优化策略，如LigerPlugin、CutCrossEntropyPlugin等。
支持长序列输入，序列长度可达16384。
运用了样本打包和填充策略，提高训练效率。

📦 安装指南

文档未提供具体安装步骤，暂不展示。

💻 使用示例

文档未提供代码示例，暂不展示。

📚 详细文档

Axolotl配置详情

查看Axolotl配置

Axolotl版本：0.10.0.dev0

base_model: ByteDance-Seed/Seed-Coder-8B-Base

plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true

chat_template: llama3
datasets:
  - path: axolotl-ai-internal/gpumode-py2triton-reasoning-v2
    type: chat_template
    split: train

dataset_prepared_path: last_run_prepared
val_set_size: 0.005
output_dir: ./outputs/out

sequence_len: 16384
sample_packing: true
pad_to_sequence_len: true

wandb_project: seed-coder-8b-grpo-triton
wandb_entity: axolotl-ai
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_torch_fused
max_grad_norm: 0.1
neftune_noise_alpha: 10
lr_scheduler: cosine
learning_rate: 1e-6
lr_groups:
  - name: embeddings
    modules:
      - embed_tokens
      - lm_head
    lr: 0.00003  # scalu up LR for embeddings as these are unused tokens

bf16: true
tf32: true

gradient_checkpointing: offload
gradient_checkpointing_kwargs:
  use_reentrant: false
logging_steps: 1
flash_attention: true

warmup_steps: 100
evals_per_epoch: 5
saves_per_epoch: 1
weight_decay: 0.01
deepspeed: deepspeed_configs/zero1.json
special_tokens:
  eos_token: <|eot_id|>
added_tokens_overrides:
  7: <|start_header_id|>
  8: <|end_header_id|>
  9: <|eot_id|>
  10: <think>
  11: </think>
fix_untrained_tokens: [7, 8, 9, 10, 11]

训练超参数

训练过程中使用了以下超参数：

学习率：1e-06
训练批次大小：2
评估批次大小：2
随机种子：42
分布式类型：多GPU
设备数量：10
总训练批次大小：20
总评估批次大小：20
优化器：使用OptimizerNames.ADAMW_TORCH_FUSED，betas=(0.9,0.999)，epsilon=1e-08，无额外优化器参数
学习率调度器类型：余弦
学习率调度器热身步数：100
训练轮数：3.0

训练结果

训练损失	轮数	步数	验证损失
0.5293	0.0046	1	5.7151
0.4449	0.2018	44	0.4878
0.425	0.4037	88	0.4319
0.3437	0.6055	132	0.3322
0.2903	0.8073	176	0.2893
0.2528	1.0092	220	0.2677
0.2578	1.2110	264	0.2531
0.2522	1.4128	308	0.2414
0.2403	1.6147	352	0.2352
0.232	1.8165	396	0.2252
0.2093	2.0183	440	0.2360
0.2406	2.2202	484	0.2311
0.2523	2.4220	528	0.2260
0.2139	2.6239	572	0.2259
0.2296	2.8257	616	0.2177