Motif-2.6B开源语言模型 - 构建符合人类价值观的可靠有用AI

首页

Motif 2.6B

由 Motif-Technologies 开发

Motif 2.6B是一个拥有26亿参数的语言模型，在AMD Instinct™ MI250 GPU上从头开始训练，旨在构建符合人类价值观、有用且可靠的AI。

大型语言模型

Safetensors

支持多种语言开源协议:其他 #AMD GPU优化 #小规模大语言模型 #多任务基准性能

下载量 1,470

发布时间 : 6/6/2025

模型简介

Motif 2.6B是一个小型大语言模型（sLLM），目标是在性能上达到如Gemma、Llama和Phi等知名开源模型的水平。

模型特点

基于AMD GPU训练

在AMD Instinct™ MI250 GPU上从头开始训练，展示了在非NVIDIA硬件上的训练能力。

小型大语言模型

专注于小型大语言模型（sLLM）领域，旨在与知名开源模型竞争。

符合人类价值观

设计目标之一是构建符合人类价值观、有用且可靠的AI。

模型能力

文本生成

问答系统

对话系统

代码生成

使用案例

智能助手

问答系统

回答用户提出的各种问题，如事实查询、知识解答等。

能够提供准确且详细的回答，例如关于地理、历史等知识。

代码生成

编程辅助

根据用户需求生成代码片段或解决编程问题。

在HumanEval基准测试中表现优异，提升率高达123.93%。

🚀 Motif 2.6B

Motif 2.6B是一个拥有26亿参数的语言模型，在AMD Instinct™ MI250 GPU上从头开始训练。它标志着我们朝着构建符合人类价值观、有用且可靠的AI迈出了第一步。此次初始发布，我们的目标是让Motif 2.6B达到如Gemma、Llama和Phi等知名开源模型的性能，尤其是在小型大语言模型（sLLM）领域。

🚀 快速开始

你可以按照以下代码示例使用Motif 2.6B模型：

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Motif-Technologies/Motif-2.6B",
    trust_remote_code = True, 
    _attn_implementation = "eager", # also supports flash_attention_2
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Motif-Technologies/Motif-2.6B", 
    trust_remote_code = True, 
)

query = "What is the capital city of South Korea?"
input_ids = tokenizer.apply_chat_template(
    [
        {'role': 'system', 'content': 'you are an helpful assistant'},
        {'role': 'user', 'content': query},
    ],
    add_generation_prompt = True,
    return_tensors='pt',
).cuda()

output = model.generate(input_ids, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens = True)
print(output)

"""
The capital city of South Korea is Seoul. Located in the southern part of the country, Seoul is not only the largest city in South Korea but also one of the largest metropolitan areas in the world.
It is a vibrant and dynamic city known for its rich history, cultural heritage, and modern amenities. Seoul is a major economic, cultural, and political center in East Asia, and it plays a crucial role in the region's politics, economy, and culture.
The city is divided into different administrative districts, each with its own unique characteristics and attractions.
"""

✨ 主要特性

基于AMD Instinct™ MI250 GPU从头开始训练，迈出构建符合人类价值观AI的第一步。
目标是在小型大语言模型（sLLM）领域达到知名开源模型的性能。

📦 安装指南

文档未提及安装步骤，暂无法提供。

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Motif-Technologies/Motif-2.6B",
    trust_remote_code = True, 
    _attn_implementation = "eager", # also supports flash_attention_2
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Motif-Technologies/Motif-2.6B", 
    trust_remote_code = True, 
)

query = "What is the capital city of South Korea?"
input_ids = tokenizer.apply_chat_template(
    [
        {'role': 'system', 'content': 'you are an helpful assistant'},
        {'role': 'user', 'content': query},
    ],
    add_generation_prompt = True,
    return_tensors='pt',
).cuda()

output = model.generate(input_ids, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens = True)
print(output)

"""
The capital city of South Korea is Seoul. Located in the southern part of the country, Seoul is not only the largest city in South Korea but also one of the largest metropolitan areas in the world.
It is a vibrant and dynamic city known for its rich history, cultural heritage, and modern amenities. Seoul is a major economic, cultural, and political center in East Asia, and it plays a crucial role in the region's politics, economy, and culture.
The city is divided into different administrative districts, each with its own unique characteristics and attractions.
"""

高级用法

文档未提及高级用法示例，暂无法提供。

📚 详细文档

训练信息

GPU：384块MI250
训练时间：42天
训练数据：2.4T个token

⚠️ 重要提示

详细的技术报告将在稍后发布。

评估

当模型发布时，其附带的技术报告或论文通常会根据开发者选择的评估设置展示基准测试结果。虽然这是一种常见且可以理解的做法，但在比较不同组织的模型时可能会带来挑战。同一个模型根据评估条件的不同可能会得出不同的分数，而且这些条件的细节并不总是完全公开。这种缺乏标准化的情况可能会使开源社区难以解读和信任所报告的结果。因此，我们参考每个模型开发者在各自出版物中报告的官方性能分数。

为了说明评估分数在不同报告之间的差异程度，我们在评估附录中提供了主要模型基准分数差异的具体示例。

与Mistral AI的Mistral 7B对比

下表中的基准测试和相应分数直接取自Mistral 7B技术报告。

基准测试	指标	Mistral 7B	Motif 2.6B	提升率
MMLU	5-shot	60.1	57.93	-3.61%
HellaSwag	0-shot	81.3	61.35	-24.54%
WinoG	0-shot	75.3	59.91	-20.44%
PIQA	0-shot	83	75.95	-8.49%
Arc-e	0-shot	80	87.21	+9.01%
Arc-c	0-shot	55.5	74.2	+33.69%
NQ	5-shot	28.8	11.14	-61.32%
TriviaQA	5-shot	69.9	54.97	-21.36%
HumanEval	0-shot	30.5	68.3	+123.93%
MBPP	3-shot	47.5	60.3	+26.95%
MATH	4-shot, maj@4	13.1	40.2*	+206.87%
GSM8K	8-shot, maj@8	52.2	80.21	+53.66%
			平均	+34.25%

* : 我们报告的是4-shot, maj@1的分数，而不是4-shot, maj@4的分数。

与Google的Gemma系列对比

Gemma 1 & 2

下表中的基准测试和相应分数直接取自Gemma 2技术报告。

⚠️ 重要提示

尽管Gemma 2 2B被称为“2B”，但实际上它有26亿个参数。

基准测试	指标	Gemma 1 2B	Gemma 1 7B	Gemma 2 2B	Gemma 2 9B	Motif 2.6B	提升率（对比Gemma 1 2B）	提升率（对比Gemma 1 7B）	提升率（对比Gemma 2 2B）	提升率（对比Gemma 2 9B）
MMLU	5-shot	42.3	64.4	52.2	71.3	57.93	+36.95%	-10.05%	+10.98%	-18.75%
ARC-C	25-shot	48.5	61.1	55.7	68.4	75.08	+54.80%	+22.88%	+34.79%	+9.77%
GSM8K	5-shot	15.1	51.8	24.3	68.6	75.13	+397.55%	+45.04%	+309.18%	+9.52%
AGIEval	3 - 5-shot	24.2	44.9	31.5	52.8	-	-	-	-	-
DROP	3-shot, F1	48.5	56.3	51.2	69.4	29.33	-39.53%	-47.90%	-42.71%	-57.74%
BBH	3-shot, CoT	35.2	59	41.9	68.2	48.56	37.95%	-17.69%	+15.89%	-28.80%
Winogrande	5-shot	66.8	79	71.3	80.6	67.09	+0.43%	-15.08%	-5.90%	-16.76%
HellaSwag	10-shot	71.7	82.3	72.9	81.9	69.89	-2.52%	-15.08%	-4.13%	-14.66%
MATH	4-shot	11.8	24.3	16	36.6	40.2	+240.88%	+65.43%	+151.25%	+9.84%
ARC-e	0-shot	73.2	81.5	80.6	88	87.21	+19.14%	+7.01%	+8.20%	-0.90%
PIQA	0-shot	77.3	81.2	78.4	81.7	75.95	-1.75%	-6.47%	-3.13%	-7.04%
SIQA	0-shot	49.7	51.8	51.9	53.4	61.97	+24.69%	+19.63%	+19.40%	+16.05%
Boolq	0-shot	69.4	83.2	72.7	84.2	67.76	-2.36%	-18.56%	-6.80%	-19.52%
TriviaQA	5-shot	53.2	63.4	60.4	76.6	54.97	+3.33%	-13.30%	-8.99%	-28.24%
MATH	4-shot	11.8	24.3	16	36.6	40.2	+240.88%	+65.43%	+151.25%	+9.84%
ARC-e	0-shot	73.2	81.5	80.6	88	87.21	+19.14%	+7.01%	+8.20%	-0.90%
						平均	+90.79%	+3.44%	+46.17%	-13.45%

Gemma 3

下表中的基准测试和相应分数直接取自Gemma 3技术报告。

基准测试	指标	Gemma 3 1B	Gemma 3 4B	Motif 2.6B	提升率（对比Gemma 3 1B）	提升率（对比Gemma 3 4B）
HellaS	10-shot	62.3	77.2	69.89	+12.18%	-9.47%
BoolQ	0-shot	63.2	72.3	67.76	+7.22%	-6.28%
PIQA	0-shot	73.8	79.6	75.59	+2.43%	-5.04%
SIQA	0-shot	48.9	51.9	61.97	+26.73%	+19.40%
TQA	5-shot	39.8	65.8	54.97	+38.12%	-16.46%
NQ	5-shot	9.48	20	10.91	+15.08%	-45.45%
ARC-C	25-shot	38.4	56.2	75.08	+95.52%	+33.59%
ARC-E	0-shot	73	82.4	87.21	+19.47%	+5.84%
WinoG	5-shot	58.2	64.7	67.09	+15.27%	+3.69%
BBH	few-shot, CoT	28.4	50.9	48.56	+70.99%	-4.60%
Drop	1-shot, F1	42.4	60.1	29.33	-30.83%	-51.20%
MMLU	5-shot	-	59.6	57.93	-	-2.80%
MMLUpro	5-shot, CoT	-	29.2	-	-	-
AGIE	3 - 5-shot	-	42.1	-	-	-
MATH	4-shot, CoT	-	24.2	40.2	-	+66.12%
GSM8K	8-shot, CoT	-	38.4	80.21	-	+108.88%
GPQA Diamond	5-shot, CoT	-	15	31.81	-	+112.07%
MBPP	3-shot	-	46	60.3	-	+31.09%
HumanE	0-shot	-	36	68.3	-	+89.72%
					平均	+22.04%

与Meta的Llama系列对比

Llama 3

下表中的基准测试和相应分数直接取自Llama 3技术报告。

基准测试	指标	Llama 3 8B	Motif 2.6B	提升率
MMLU	5-shot	69.4	57.93	-16.53%
MMLU	0-shot, CoT	73	57.95	-20.62%
MMLU-Pro	5-shot, CoT	48.3	-	-
IFEval	-	80.4	74.02	-7.94%
HumanEval	0-shot	72.6	68.3	-5.92%
MBPP	0-shot	72.8	57.93	-20.43%
GSM8K	8-shot, CoT	84.5	80.21	-5.08%
MATH	0-shot, CoT	51.9	49.68	-4.28%
ARC Challenge	0-shot	83.4	74.2	-11.03%
GPQA	0-shot, CoT	32.8	18.53	-43.51%
			平均	-15.04%

Llama 3.2

下表中的基准测试和相应分数直接取自Llama 3.2官方博客。

基准测试	指标	Llama 3.2 1B	Llama 3.2 3B	Motif 2.6B	提升率（对比Llama 3.2 1B）	提升率（对比Llama 3.2 3B）
MMLU	0-shot	49.3	63.4	57.6	+16.75%	-9.21%
Open-rewrite eval*	0-shot, rougeL	41.6	40.1	-	-	-
TLDR9+	test, 1-shot, rougeL	16.8	19	-	-	-
IFEval	-	59.5	77.4	74.02	+24.40%	-4.37%
GSM8K	8-shot, CoT	44.4	77.7	80.21	+80.65%	+3.23%
MATH	0-shot, CoT	30.6	48	49.68	+62.35%	+3.50%
ARC Challenge	0-shot	59.4	78.6	74.2	+24.92%	-5.6%
GPQA	0-shot	27.2	32.8	25.45	-6.43%	-22.41%
Hellaswag	0-shot	41.2	69.8	61.35	+48.91%	-12.11%
					平均	+41.82%

与Microsoft的Phi系列对比

下表中的基准测试和相应分数直接取自Phi - 3技术报告。

基准测试	指标	Phi - 3 3.8B	Phi - 3 7B	Phi - 2 2.7B	Motif 2.6B	提升率（对比Phi - 3 3.8B）	提升率（对比Phi - 3 7B）	提升率（对比Phi - 2 2.7B）
MMLU	5-shot	68.8	75.7	56.3	57.93	-15.80%	-23.47%	+2.90%
HellaSwag	5-shot	76.7	77	53.6	68.97	-10.08%	-10.43%	+28.68%
ANLI	7-shot	52.8	58.1	42.5	47.99	-9.11%	-17.40%	+12.92%
GSM - 8K	8-shot, CoT	82.5	89.6	61.1	80.21	-2.78%	-10.48%	+31.28%
MATH	0-shot, CoT	41.3	34.6	-	49.68	+20.29%	+43.58%	-
MedQA	2-shot	53.8	65.4	40.9	42.1	-21.75%	-35.63%	+2.93%
AGIEval	0-shot	37.5	45.1	29.8	-	-	-	-
TriviaQA	5-shot	64	58.1	45.2	54.97	-14.11%	-5.39%	+21.62%
Arc-C	10-shot	84.9	90.7	75.9	75.17	-11.46%	-17.12%	-0.96%
Arc-E	10-shot	94.6	97	88.5	88.64	-6.30%	-8.62%	+0.16%
PIQA	5-shot	84.2	86.9	60.2	78.29	-7.02%	-9.91%	+30.05%
SociQA	5-shot	76.6	79.2	68.3	66.73	-12.89%	-15.74%	-2.3%
BigBench-Hard	3-shot, CoT	71.7	79.1	59.4	48.56	-32.27%	-38.61%	-18.25%
WinoGrande	5-shot	70.8	81.5	54.7	67.09	-5.24%	-17.68%	+22.65%
OpenBookQA	10-shot	83.2	88	73.6	87.8	+5.53%	-0.23%	+19.29%
BoolQ	2-shot	77.2	84.8	-	70.7	-8.42%	-16.63%	-
CommonSenseQA	10-shot	80.2	80	69.3	71.25	-11.16%	-10.94%	2.81%
TruthfulQA	10-shot	65	70.2	-	52.07	-19.89%	-25.83%	-
HumanEval	0-shot	58.5	61	59	68.29	+16.74%	+11.95%	+15.75%
MBPP	3-shot	70	71.7	60.6	60.3	-13.86%	-15.90%	-0.50%
GPQA	2-shot, CoT	32.8	34.3	-	23.44	-28.54%	-31.66%	-
MT Bench	2R. Avg.	8.38	8.7	-	6.77	-19.21%	-22.18%	-
						平均	-9.87%	-13.25%

评估附录

在上述比较中，根据原始技术报告中报告的基准分数，Motif 2.6B与Llama 3 8B和Gemma 2 9B相比，平均性能提升率分别为-15.36%和-13.45%。然而，与Qwen 2.5技术报告中报告的基准测试和分数相比，Motif 2.6B相对于Llama 3 8B的平均提升率为+19.27%，相对于Gemma 2 9B的平均提升率为+1.68%。详情见下表。

基于Qwen2.5技术报告分数与Llama 3 8B和Gemma 2 9B对比

基准测试	指标	Llama 3 8B	Gemma 2 9B	Motif 2.6B	提升率（对比Llama 3 8B）	提升率（对比Gemma 2 9B）
MMLU	5-shot	66.6	71.3	57.93	-13.02%	-18.75%
MMLU-pro	5-shot	35.4	44.7	28.4	-19.77%	-36.47%
MMLU-redux	5-shot	61.6	67.9	59.54	-3.34%	-12.31%
BBH	3-shot	57.7	68.2	39.28	-31.92%	-42.40%
ARC-C	25-shot	59.3	68.2	75.08	+26.61%	+10.09%
TruthfulQA	0-shot	44	45.3	41.55	-5.56%	-8.27%
Winogrande	5-shot	77.4	79.5	67.09	-13.32%	-15.61%
HellaSwag	10-shot	82.1	81.9	69.88	-14.88%	-14.68%
GPQA	5-shot	25.8	32.8	29.24	+13.33%	-10.85%
TheoremQA	5-shot	22.1	28.9	-	-	-
MATH	4-shot	20.5	37.7	40.2	+96.10%	+6.63%
MMLU-stem	5-shot	55.3	65.1	52.9	-4.34%	-18.74%
GSM8K	4-shot	55.3	70.7	75.2	+35.99%	+6.36%
HumanEval	0-shot	33.5	37.8	68.3	+103.88%	+80.69%
HumanEval+	0-shot	29.3	30.5	62.2	+112.29%	+103.93%
MBPP	0-shot	53.9	62.2	60.3	+11.87%	-3.05%
MBPP+	0-shot	44.4	50.6	50.8	+14.41%	+0.40%
MultiPL-E	0-shot	22.6	34.9	-	-	-
					平均	+19.27%