CodeGeeX4-ALL-9B开源代码生成模型 - 基于GLM-4-9B训练，能力显著提升

首页

Codegeex4 All 9b

由 THUDM 开发

CodeGeeX4-ALL-9B是CodeGeeX4系列模型的最新开源版本，基于GLM-4-9B持续训练，显著提升了代码生成能力。

大型语言模型

Transformers

支持多种语言开源协议:其他 #多语言代码生成 #仓库级代码问答 #函数调用支持

下载量 294

发布时间 : 7/5/2024

模型简介

CodeGeeX4-ALL-9B是一个多语言代码生成模型，支持代码补全与生成、代码解释器、网络搜索、函数调用、仓库级代码问答等全面功能，覆盖软件开发的各种场景。

模型特点

多语言代码生成

支持多种编程语言的代码生成和补全，覆盖广泛的开发场景。

高性能

在BigCodeBench和NaturalCodeBench等公开基准测试中表现出色，是目前参数规模小于100亿的最强代码生成模型。

多功能支持

支持代码解释器、网络搜索、函数调用、仓库级代码问答等多种功能。

推理速度与性能平衡

在推理速度与模型性能之间实现了最佳平衡。

模型能力

代码补全

代码生成

代码解释

网络搜索

函数调用

仓库级代码问答

使用案例

软件开发

快速排序实现

生成快速排序算法的实现代码

生成可执行的快速排序代码

代码解释

解释复杂代码片段的功能和逻辑

提供详细的代码解释

代码补全

代码块补全

根据前缀和后缀代码补全中间代码块

生成符合上下文的代码补全

🚀 CodeGeeX4：开源多语言代码生成模型

CodeGeeX4是基于Transformer架构的代码生成模型，它可以支持代码补全、代码生成、代码解释、网络搜索、函数调用、仓库级代码问答等功能，覆盖软件开发的各种场景。CodeGeeX4在多个公开基准测试中取得了极具竞争力的成绩，是目前参数少于100亿的最强大代码生成模型，在推理速度和模型性能方面达到了最佳平衡。

🚀 快速开始

使用 4.39.0<=transformers<=4.40.2 快速启动 codegeex4-all-9b：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex4-all-9b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/codegeex4-all-9b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()
inputs = tokenizer.apply_chat_template([{"role": "user", "content": "write a quick sort"}], add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True ).to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=256)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

如果你想手动构建聊天提示，请确保遵循以下格式：

f"<|system|>\n{system_prompt}\n<|user|>\n{prompt}\n<|assistant|>\n"

默认系统提示：

你是一位智能编程助手，你叫CodeGeeX。你会为用户回答关于编程、代码、计算机方面的任何问题，并提供格式规范、可以执行、准确安全的代码，并在必要时提供详细的解释。

英文版：

You are an intelligent programming assistant named CodeGeeX. You will answer any questions users have about programming, coding, and computers, and provide code that is formatted correctly.

对于填充能力，请使用（无系统提示）：

f"<|user|>\n<|code_suffix|>{suffix}<|code_prefix|>{prefix}<|code_middle|><|assistant|>\n"

可以添加额外信息（如文件路径、编程语言、模式）。示例：

<|user|>
###PATH:src/example.py
###LANGUAGE:Python
###MODE:BLOCK
<|code_suffix|>{suffix}<|code_prefix|>{prefix}<|code_middle|><|assistant|>

✨ 主要特性

多语言支持：支持多种编程语言，满足不同开发者的需求。
高性能表现：在多个公开基准测试中取得了极具竞争力的成绩，是目前参数少于100亿的最强大代码生成模型。
功能丰富：支持代码补全、代码生成、代码解释、网络搜索、函数调用、仓库级代码问答等功能，覆盖软件开发的各种场景。
推理速度快：在推理速度和模型性能方面达到了最佳平衡。

📊 评估

模型	序列长度	HumanEval	MBPP	NCB	LCB	HumanEvalFIM	CRUXEval-O
Llama3-70B-intruct	8K	77.4	82.3	37.0	27.4	-	-
DeepSeek Coder 33B Instruct	16K	81.1	80.4	39.3	29.3	78.2	49.9
Codestral-22B	32K	81.1	78.2	46.0	35.3	91.6	51.3
CodeGeeX4-All-9B	128K	82.3	75.7	40.4	28.5	85.0	47.1

📄 许可证

模型权重遵循以下许可证。

📚 引用

如果您觉得我们的工作有帮助，请引用以下论文：

@inproceedings{zheng2023codegeex,
  title={CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X},
  author={Qinkai Zheng and Xiao Xia and Xu Zou and Yuxiao Dong and Shan Wang and Yufei Xue and Zihan Wang and Lei Shen and Andi Wang and Yang Li and Teng Su and Zhilin Yang and Jie Tang},
  booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  pages={5673--5684},
  year={2023}
}