Motif 2.6B
模型简介
模型特点
模型能力
使用案例
🚀 Motif 2.6B
Motif 2.6B是一个拥有26亿参数的语言模型,在AMD Instinct™ MI250 GPU上从头开始训练。它标志着我们朝着构建符合人类价值观、有用且可靠的AI迈出了第一步。此次初始发布,我们的目标是让Motif 2.6B达到如Gemma、Llama和Phi等知名开源模型的性能,尤其是在小型大语言模型(sLLM)领域。
🚀 快速开始
你可以按照以下代码示例使用Motif 2.6B模型:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Motif-Technologies/Motif-2.6B",
trust_remote_code = True,
_attn_implementation = "eager", # also supports flash_attention_2
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
"Motif-Technologies/Motif-2.6B",
trust_remote_code = True,
)
query = "What is the capital city of South Korea?"
input_ids = tokenizer.apply_chat_template(
[
{'role': 'system', 'content': 'you are an helpful assistant'},
{'role': 'user', 'content': query},
],
add_generation_prompt = True,
return_tensors='pt',
).cuda()
output = model.generate(input_ids, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens = True)
print(output)
"""
The capital city of South Korea is Seoul. Located in the southern part of the country, Seoul is not only the largest city in South Korea but also one of the largest metropolitan areas in the world.
It is a vibrant and dynamic city known for its rich history, cultural heritage, and modern amenities. Seoul is a major economic, cultural, and political center in East Asia, and it plays a crucial role in the region's politics, economy, and culture.
The city is divided into different administrative districts, each with its own unique characteristics and attractions.
"""
✨ 主要特性
- 基于AMD Instinct™ MI250 GPU从头开始训练,迈出构建符合人类价值观AI的第一步。
- 目标是在小型大语言模型(sLLM)领域达到知名开源模型的性能。
📦 安装指南
文档未提及安装步骤,暂无法提供。
💻 使用示例
基础用法
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Motif-Technologies/Motif-2.6B",
trust_remote_code = True,
_attn_implementation = "eager", # also supports flash_attention_2
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
"Motif-Technologies/Motif-2.6B",
trust_remote_code = True,
)
query = "What is the capital city of South Korea?"
input_ids = tokenizer.apply_chat_template(
[
{'role': 'system', 'content': 'you are an helpful assistant'},
{'role': 'user', 'content': query},
],
add_generation_prompt = True,
return_tensors='pt',
).cuda()
output = model.generate(input_ids, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens = True)
print(output)
"""
The capital city of South Korea is Seoul. Located in the southern part of the country, Seoul is not only the largest city in South Korea but also one of the largest metropolitan areas in the world.
It is a vibrant and dynamic city known for its rich history, cultural heritage, and modern amenities. Seoul is a major economic, cultural, and political center in East Asia, and it plays a crucial role in the region's politics, economy, and culture.
The city is divided into different administrative districts, each with its own unique characteristics and attractions.
"""
高级用法
文档未提及高级用法示例,暂无法提供。
📚 详细文档
训练信息
- GPU:384块MI250
- 训练时间:42天
- 训练数据:2.4T个token
⚠️ 重要提示
详细的技术报告将在稍后发布。
评估
当模型发布时,其附带的技术报告或论文通常会根据开发者选择的评估设置展示基准测试结果。虽然这是一种常见且可以理解的做法,但在比较不同组织的模型时可能会带来挑战。同一个模型根据评估条件的不同可能会得出不同的分数,而且这些条件的细节并不总是完全公开。这种缺乏标准化的情况可能会使开源社区难以解读和信任所报告的结果。因此,我们参考每个模型开发者在各自出版物中报告的官方性能分数。
为了说明评估分数在不同报告之间的差异程度,我们在评估附录中提供了主要模型基准分数差异的具体示例。
与Mistral AI的Mistral 7B对比
下表中的基准测试和相应分数直接取自Mistral 7B技术报告。
基准测试 | 指标 | Mistral 7B | Motif 2.6B | 提升率 |
---|---|---|---|---|
MMLU | 5-shot | 60.1 | 57.93 | -3.61% |
HellaSwag | 0-shot | 81.3 | 61.35 | -24.54% |
WinoG | 0-shot | 75.3 | 59.91 | -20.44% |
PIQA | 0-shot | 83 | 75.95 | -8.49% |
Arc-e | 0-shot | 80 | 87.21 | +9.01% |
Arc-c | 0-shot | 55.5 | 74.2 | +33.69% |
NQ | 5-shot | 28.8 | 11.14 | -61.32% |
TriviaQA | 5-shot | 69.9 | 54.97 | -21.36% |
HumanEval | 0-shot | 30.5 | 68.3 | +123.93% |
MBPP | 3-shot | 47.5 | 60.3 | +26.95% |
MATH | 4-shot, maj@4 | 13.1 | 40.2* | +206.87% |
GSM8K | 8-shot, maj@8 | 52.2 | 80.21 | +53.66% |
平均 | +34.25% |
* : 我们报告的是4-shot, maj@1的分数,而不是4-shot, maj@4的分数。
与Google的Gemma系列对比
Gemma 1 & 2
下表中的基准测试和相应分数直接取自Gemma 2技术报告。
⚠️ 重要提示
尽管Gemma 2 2B被称为“2B”,但实际上它有26亿个参数。
基准测试 | 指标 | Gemma 1 2B | Gemma 1 7B | Gemma 2 2B | Gemma 2 9B | Motif 2.6B | 提升率(对比Gemma 1 2B) | 提升率(对比Gemma 1 7B) | 提升率(对比Gemma 2 2B) | 提升率(对比Gemma 2 9B) |
---|---|---|---|---|---|---|---|---|---|---|
MMLU | 5-shot | 42.3 | 64.4 | 52.2 | 71.3 | 57.93 | +36.95% | -10.05% | +10.98% | -18.75% |
ARC-C | 25-shot | 48.5 | 61.1 | 55.7 | 68.4 | 75.08 | +54.80% | +22.88% | +34.79% | +9.77% |
GSM8K | 5-shot | 15.1 | 51.8 | 24.3 | 68.6 | 75.13 | +397.55% | +45.04% | +309.18% | +9.52% |
AGIEval | 3 - 5-shot | 24.2 | 44.9 | 31.5 | 52.8 | - | - | - | - | - |
DROP | 3-shot, F1 | 48.5 | 56.3 | 51.2 | 69.4 | 29.33 | -39.53% | -47.90% | -42.71% | -57.74% |
BBH | 3-shot, CoT | 35.2 | 59 | 41.9 | 68.2 | 48.56 | 37.95% | -17.69% | +15.89% | -28.80% |
Winogrande | 5-shot | 66.8 | 79 | 71.3 | 80.6 | 67.09 | +0.43% | -15.08% | -5.90% | -16.76% |
HellaSwag | 10-shot | 71.7 | 82.3 | 72.9 | 81.9 | 69.89 | -2.52% | -15.08% | -4.13% | -14.66% |
MATH | 4-shot | 11.8 | 24.3 | 16 | 36.6 | 40.2 | +240.88% | +65.43% | +151.25% | +9.84% |
ARC-e | 0-shot | 73.2 | 81.5 | 80.6 | 88 | 87.21 | +19.14% | +7.01% | +8.20% | -0.90% |
PIQA | 0-shot | 77.3 | 81.2 | 78.4 | 81.7 | 75.95 | -1.75% | -6.47% | -3.13% | -7.04% |
SIQA | 0-shot | 49.7 | 51.8 | 51.9 | 53.4 | 61.97 | +24.69% | +19.63% | +19.40% | +16.05% |
Boolq | 0-shot | 69.4 | 83.2 | 72.7 | 84.2 | 67.76 | -2.36% | -18.56% | -6.80% | -19.52% |
TriviaQA | 5-shot | 53.2 | 63.4 | 60.4 | 76.6 | 54.97 | +3.33% | -13.30% | -8.99% | -28.24% |
MATH | 4-shot | 11.8 | 24.3 | 16 | 36.6 | 40.2 | +240.88% | +65.43% | +151.25% | +9.84% |
ARC-e | 0-shot | 73.2 | 81.5 | 80.6 | 88 | 87.21 | +19.14% | +7.01% | +8.20% | -0.90% |
平均 | +90.79% | +3.44% | +46.17% | -13.45% |
Gemma 3
下表中的基准测试和相应分数直接取自Gemma 3技术报告。
基准测试 | 指标 | Gemma 3 1B | Gemma 3 4B | Motif 2.6B | 提升率(对比Gemma 3 1B) | 提升率(对比Gemma 3 4B) |
---|---|---|---|---|---|---|
HellaS | 10-shot | 62.3 | 77.2 | 69.89 | +12.18% | -9.47% |
BoolQ | 0-shot | 63.2 | 72.3 | 67.76 | +7.22% | -6.28% |
PIQA | 0-shot | 73.8 | 79.6 | 75.59 | +2.43% | -5.04% |
SIQA | 0-shot | 48.9 | 51.9 | 61.97 | +26.73% | +19.40% |
TQA | 5-shot | 39.8 | 65.8 | 54.97 | +38.12% | -16.46% |
NQ | 5-shot | 9.48 | 20 | 10.91 | +15.08% | -45.45% |
ARC-C | 25-shot | 38.4 | 56.2 | 75.08 | +95.52% | +33.59% |
ARC-E | 0-shot | 73 | 82.4 | 87.21 | +19.47% | +5.84% |
WinoG | 5-shot | 58.2 | 64.7 | 67.09 | +15.27% | +3.69% |
BBH | few-shot, CoT | 28.4 | 50.9 | 48.56 | +70.99% | -4.60% |
Drop | 1-shot, F1 | 42.4 | 60.1 | 29.33 | -30.83% | -51.20% |
MMLU | 5-shot | - | 59.6 | 57.93 | - | -2.80% |
MMLUpro | 5-shot, CoT | - | 29.2 | - | - | - |
AGIE | 3 - 5-shot | - | 42.1 | - | - | - |
MATH | 4-shot, CoT | - | 24.2 | 40.2 | - | +66.12% |
GSM8K | 8-shot, CoT | - | 38.4 | 80.21 | - | +108.88% |
GPQA Diamond | 5-shot, CoT | - | 15 | 31.81 | - | +112.07% |
MBPP | 3-shot | - | 46 | 60.3 | - | +31.09% |
HumanE | 0-shot | - | 36 | 68.3 | - | +89.72% |
平均 | +22.04% |
与Meta的Llama系列对比
Llama 3
下表中的基准测试和相应分数直接取自Llama 3技术报告。
基准测试 | 指标 | Llama 3 8B | Motif 2.6B | 提升率 |
---|---|---|---|---|
MMLU | 5-shot | 69.4 | 57.93 | -16.53% |
MMLU | 0-shot, CoT | 73 | 57.95 | -20.62% |
MMLU-Pro | 5-shot, CoT | 48.3 | - | - |
IFEval | - | 80.4 | 74.02 | -7.94% |
HumanEval | 0-shot | 72.6 | 68.3 | -5.92% |
MBPP | 0-shot | 72.8 | 57.93 | -20.43% |
GSM8K | 8-shot, CoT | 84.5 | 80.21 | -5.08% |
MATH | 0-shot, CoT | 51.9 | 49.68 | -4.28% |
ARC Challenge | 0-shot | 83.4 | 74.2 | -11.03% |
GPQA | 0-shot, CoT | 32.8 | 18.53 | -43.51% |
平均 | -15.04% |
Llama 3.2
下表中的基准测试和相应分数直接取自Llama 3.2官方博客。
基准测试 | 指标 | Llama 3.2 1B | Llama 3.2 3B | Motif 2.6B | 提升率(对比Llama 3.2 1B) | 提升率(对比Llama 3.2 3B) |
---|---|---|---|---|---|---|
MMLU | 0-shot | 49.3 | 63.4 | 57.6 | +16.75% | -9.21% |
Open-rewrite eval* | 0-shot, rougeL | 41.6 | 40.1 | - | - | - |
TLDR9+ | test, 1-shot, rougeL | 16.8 | 19 | - | - | - |
IFEval | - | 59.5 | 77.4 | 74.02 | +24.40% | -4.37% |
GSM8K | 8-shot, CoT | 44.4 | 77.7 | 80.21 | +80.65% | +3.23% |
MATH | 0-shot, CoT | 30.6 | 48 | 49.68 | +62.35% | +3.50% |
ARC Challenge | 0-shot | 59.4 | 78.6 | 74.2 | +24.92% | -5.6% |
GPQA | 0-shot | 27.2 | 32.8 | 25.45 | -6.43% | -22.41% |
Hellaswag | 0-shot | 41.2 | 69.8 | 61.35 | +48.91% | -12.11% |
平均 | +41.82% |
与Microsoft的Phi系列对比
下表中的基准测试和相应分数直接取自Phi - 3技术报告。
基准测试 | 指标 | Phi - 3 3.8B | Phi - 3 7B | Phi - 2 2.7B | Motif 2.6B | 提升率(对比Phi - 3 3.8B) | 提升率(对比Phi - 3 7B) | 提升率(对比Phi - 2 2.7B) |
---|---|---|---|---|---|---|---|---|
MMLU | 5-shot | 68.8 | 75.7 | 56.3 | 57.93 | -15.80% | -23.47% | +2.90% |
HellaSwag | 5-shot | 76.7 | 77 | 53.6 | 68.97 | -10.08% | -10.43% | +28.68% |
ANLI | 7-shot | 52.8 | 58.1 | 42.5 | 47.99 | -9.11% | -17.40% | +12.92% |
GSM - 8K | 8-shot, CoT | 82.5 | 89.6 | 61.1 | 80.21 | -2.78% | -10.48% | +31.28% |
MATH | 0-shot, CoT | 41.3 | 34.6 | - | 49.68 | +20.29% | +43.58% | - |
MedQA | 2-shot | 53.8 | 65.4 | 40.9 | 42.1 | -21.75% | -35.63% | +2.93% |
AGIEval | 0-shot | 37.5 | 45.1 | 29.8 | - | - | - | - |
TriviaQA | 5-shot | 64 | 58.1 | 45.2 | 54.97 | -14.11% | -5.39% | +21.62% |
Arc-C | 10-shot | 84.9 | 90.7 | 75.9 | 75.17 | -11.46% | -17.12% | -0.96% |
Arc-E | 10-shot | 94.6 | 97 | 88.5 | 88.64 | -6.30% | -8.62% | +0.16% |
PIQA | 5-shot | 84.2 | 86.9 | 60.2 | 78.29 | -7.02% | -9.91% | +30.05% |
SociQA | 5-shot | 76.6 | 79.2 | 68.3 | 66.73 | -12.89% | -15.74% | -2.3% |
BigBench-Hard | 3-shot, CoT | 71.7 | 79.1 | 59.4 | 48.56 | -32.27% | -38.61% | -18.25% |
WinoGrande | 5-shot | 70.8 | 81.5 | 54.7 | 67.09 | -5.24% | -17.68% | +22.65% |
OpenBookQA | 10-shot | 83.2 | 88 | 73.6 | 87.8 | +5.53% | -0.23% | +19.29% |
BoolQ | 2-shot | 77.2 | 84.8 | - | 70.7 | -8.42% | -16.63% | - |
CommonSenseQA | 10-shot | 80.2 | 80 | 69.3 | 71.25 | -11.16% | -10.94% | 2.81% |
TruthfulQA | 10-shot | 65 | 70.2 | - | 52.07 | -19.89% | -25.83% | - |
HumanEval | 0-shot | 58.5 | 61 | 59 | 68.29 | +16.74% | +11.95% | +15.75% |
MBPP | 3-shot | 70 | 71.7 | 60.6 | 60.3 | -13.86% | -15.90% | -0.50% |
GPQA | 2-shot, CoT | 32.8 | 34.3 | - | 23.44 | -28.54% | -31.66% | - |
MT Bench | 2R. Avg. | 8.38 | 8.7 | - | 6.77 | -19.21% | -22.18% | - |
平均 | -9.87% | -13.25% |
评估附录
在上述比较中,根据原始技术报告中报告的基准分数,Motif 2.6B与Llama 3 8B和Gemma 2 9B相比,平均性能提升率分别为-15.36%和-13.45%。然而,与Qwen 2.5技术报告中报告的基准测试和分数相比,Motif 2.6B相对于Llama 3 8B的平均提升率为+19.27%,相对于Gemma 2 9B的平均提升率为+1.68%。详情见下表。
基于Qwen2.5技术报告分数与Llama 3 8B和Gemma 2 9B对比
基准测试 | 指标 | Llama 3 8B | Gemma 2 9B | Motif 2.6B | 提升率(对比Llama 3 8B) | 提升率(对比Gemma 2 9B) |
---|---|---|---|---|---|---|
MMLU | 5-shot | 66.6 | 71.3 | 57.93 | -13.02% | -18.75% |
MMLU-pro | 5-shot | 35.4 | 44.7 | 28.4 | -19.77% | -36.47% |
MMLU-redux | 5-shot | 61.6 | 67.9 | 59.54 | -3.34% | -12.31% |
BBH | 3-shot | 57.7 | 68.2 | 39.28 | -31.92% | -42.40% |
ARC-C | 25-shot | 59.3 | 68.2 | 75.08 | +26.61% | +10.09% |
TruthfulQA | 0-shot | 44 | 45.3 | 41.55 | -5.56% | -8.27% |
Winogrande | 5-shot | 77.4 | 79.5 | 67.09 | -13.32% | -15.61% |
HellaSwag | 10-shot | 82.1 | 81.9 | 69.88 | -14.88% | -14.68% |
GPQA | 5-shot | 25.8 | 32.8 | 29.24 | +13.33% | -10.85% |
TheoremQA | 5-shot | 22.1 | 28.9 | - | - | - |
MATH | 4-shot | 20.5 | 37.7 | 40.2 | +96.10% | +6.63% |
MMLU-stem | 5-shot | 55.3 | 65.1 | 52.9 | -4.34% | -18.74% |
GSM8K | 4-shot | 55.3 | 70.7 | 75.2 | +35.99% | +6.36% |
HumanEval | 0-shot | 33.5 | 37.8 | 68.3 | +103.88% | +80.69% |
HumanEval+ | 0-shot | 29.3 | 30.5 | 62.2 | +112.29% | +103.93% |
MBPP | 0-shot | 53.9 | 62.2 | 60.3 | +11.87% | -3.05% |
MBPP+ | 0-shot | 44.4 | 50.6 | 50.8 | +14.41% | +0.40% |
MultiPL-E | 0-shot | 22.6 | 34.9 | - | - | - |
平均 | +19.27% |
🔧 技术细节
文档未提及技术实现细节,暂无法提供。
📄 许可证
本项目采用motif-license许可证。



