TinyLlama_v1.1開源語言模型 - 適用於資源受限場景，免費輕鬆部署

首頁

Tinyllama V1.1

由TinyLlama開發

TinyLlama是一個11億參數的小型語言模型，採用與Llama 2相同的架構和分詞器，適用於資源受限的應用場景。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #輕量級Llama #多領域適配 #中英雙語優化

下載量 42.11k

發布時間 : 3/9/2024

模型概述

TinyLlama是一個輕量級語言模型，具有11億參數，設計用於在計算和內存資源有限的環境中高效運行。它支持多種應用場景，包括通用文本生成、數學與代碼處理以及中文理解。

模型特點

輕量級設計

僅有11億參數，適合資源受限的環境。

多版本支持

提供通用版、數學與代碼版和中文版三種變體，滿足不同需求。

高效訓練

採用三階段訓練策略（基礎預訓練、特定領域持續預訓練、冷卻階段），優化模型性能。

兼容性

與Llama 2完全兼容，可即插即用於基於Llama的開源項目。

模型能力

文本生成

數學推理

代碼生成

中文理解

使用案例

通用文本處理

文本生成

生成連貫的文本內容。

數學與代碼

數學問題求解

解決數學推理問題。

代碼生成

生成編程代碼片段。

中文處理

中文文本理解

理解和生成中文文本。

🚀 TinyLlama-1.1B-v1.1

TinyLlama-1.1B-v1.1採用了與Llama 2完全相同的架構和分詞器，可無縫集成到眾多基於Llama的開源項目中。此外，它僅擁有1.1B參數，體積小巧，適用於對計算和內存要求較低的各種應用場景。

代碼庫：github.com/jzhang38/TinyLlama
技術報告：arxiv.org/pdf/2401.02385

🚀 快速開始

你需要transformers>=4.31版本。更多信息請查看 TinyLlama 的GitHub頁面。

from transformers import AutoTokenizer
import transformers 
import torch
model = "TinyLlama/TinyLlama_v1.1"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    repetition_penalty=1.5,
    eos_token_id=tokenizer.eos_token_id,
    max_length=500,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

✨ 主要特性

採用與Llama 2相同的架構和分詞器，可在許多基於Llama的開源項目中直接使用。
模型僅含1.1B參數，體積小巧，適用於對計算和內存要求較低的應用。

📦 安裝指南

你需要安裝transformers>=4.31版本，可使用以下命令進行安裝：

pip install transformers>=4.31

💻 使用示例

基礎用法

from transformers import AutoTokenizer
import transformers 
import torch
model = "TinyLlama/TinyLlama_v1.1"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    repetition_penalty=1.5,
    eos_token_id=tokenizer.eos_token_id,
    max_length=500,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

📚 詳細文檔

概述

在這個項目中，我們並非只訓練單個TinyLlama模型，而是先在包含1.5萬億個標記的語料庫上訓練TinyLlama，以獲得基礎語言能力。隨後，通過三種不同的數據採樣方式進行持續預訓練，將該模型轉化為三個不同的模型。具體過程可參考下圖。

Overview

預訓練

由於一些問題（bug1，bug2），我們嘗試重新訓練TinyLlama以提供更好的模型。我們使用2T標記對模型進行訓練，並將預訓練分為三個不同階段：

基礎預訓練：在此初始階段，我們僅使用slimpajama語料庫訓練模型，以培養其常識推理能力。在這個基礎預訓練階段，模型使用了1.5T標記進行訓練。由於我們使用的集群每個節點配備4個A100 - 40G GPU，且僅在節點內對模型權重進行分片，因此此時只能將批量大小設置為約1.8M。
特定領域的持續預訓練：在此預訓練階段，我們引入了三種不同的語料庫：slimpajama（與第一階段相同）、Math&Code（starcoder和proof pile）和Chinese（skypile）。這種方法使我們能夠開發出具有特定能力的三種變體模型。在這個階段的前約6B標記中，我們線性增加特定領域語料庫（不包括Slimpajama，因為它與第一階段保持不變）的採樣比例。這種預熱採樣增加策略旨在逐步調整預訓練數據的分佈，確保訓練過程更加穩定。在採樣增加階段之後，我們繼續使用穩定的採樣策略對模型進行預訓練，直到達到約1.85T標記。
冷卻階段：在預訓練結束時，實施冷卻階段已成為實現更好模型收斂的關鍵技術。然而，由於我們在開始時已經使用了餘弦學習率策略，因此像MiniCPM或deepseek那樣為冷卻階段更改學習率變得具有挑戰性。因此，我們嘗試通過調整批量大小來進行冷卻。具體而言，在冷卻階段，我們將批量大小從1.8M增加到7.2M，同時保持原有的餘弦學習率調度。

Tinyllama模型家族

經過廣泛而詳細的預訓練過程，我們現在發佈模型的三個特定版本：

TinyLlama_v1.1：標準版本，用於一般用途。
TinyLlama_v1.1_Math&Code：具備更好的數學和代碼處理能力。
TinyLlama_v1.1_Chinese：對中文有良好的理解能力。

數據

以下是各階段的數據分佈情況：

TinyLlama_v1.1

語料庫	基礎預訓練	特定領域的持續預訓練	冷卻階段
Slimpajama	100.0	100.0	100.0

TinyLlama_v1.1_math_code

語料庫	基礎預訓練	特定領域的持續預訓練	冷卻階段
Slimpajama	100.0	75.0	75.0
starcoder	-	15.0	15.0
proof_pile	-	10.0	10.0

TinyLlama_v1.1_chinese

語料庫	基礎預訓練	特定領域的持續預訓練	冷卻階段
Slimpajama	100.0	50.0	50.0
skypile	-	50.0	50.0

評估

模型	預訓練標記	HellaSwag	Obqa	WinoGrande	ARC_c	ARC_e	boolq	piqa	平均
Pythia - 1.0B	300B	47.16	31.40	53.43	27.05	48.99	60.83	69.21	48.30
TinyLlama - 1.1B - intermediate - step - 1431k - 3T	3T	59.20	36.00	59.12	30.12	55.25	57.83	73.29	52.99
TinyLlama - 1.1B - v1.1	2T	61.47	36.80	59.43	32.68	55.47	55.99	73.56	53.63
TinyLlama - 1.1B - v1_math_code	2T	60.80	36.40	60.22	33.87	55.20	57.09	72.69	53.75
TinyLlama - 1.1B - v1.1_chinese	2T	58.23	35.20	59.27	31.40	55.35	61.41	73.01	53.41