finance-Llama3-8B開源金融模型 - 免費部署，性能超越Llama3-70B

首頁

Finance Llama3 8B

由instruction-pretrain開發

基於Llama3-8B開發的金融領域模型，通過指令預訓練框架增強領域適應能力，在金融任務上達到甚至超越Llama3-70B的性能。

大型語言模型

Transformers

英語#金融領域適配 #指令增強預訓練 #多任務學習

下載量 1,200

發布時間 : 6/18/2024

模型概述

該模型採用指令預訓練框架，通過合成指令-響應對增強原始語料庫進行持續預訓練，專門優化金融領域任務處理能力。

模型特點

指令預訓練框架

通過合成指令-響應對增強海量原始語料庫，顯著提升模型在領域任務上的表現。

領域自適應能力

在金融領域持續預訓練後，8B參數模型性能可超越原版70B模型。

多階段訓練優化

支持從頭預訓練和持續預訓練兩種模式，均優於傳統預訓練方法。

模型能力

金融文本理解

金融問題解答

金融數據分析

金融報告生成

使用案例

金融分析

證券信息查詢

解析上市公司證券註冊信息，回答關於債務證券的複雜查詢

能準確識別交易代碼與證券類型的對應關係

財務報告分析

理解並提取財務報告中的關鍵數據

🚀 指令預訓練：語言模型是有監督的多任務學習者 (EMNLP 2024)

本項目包含我們在論文指令預訓練：語言模型是有監督的多任務學習者中基於 Llama3 - 8B 開發的金融模型。我們提出了 指令預訓練 框架，探索有監督的多任務預訓練，該框架可擴展地用指令 - 響應對對大量原始語料進行擴充，以預訓練語言模型。指令 - 響應對由基於開源模型構建的高效指令合成器生成。指令預訓練 在從頭開始的通用預訓練和特定領域自適應持續預訓練中均優於 普通預訓練。在從頭開始的預訓練中，指令預訓練 不僅能改進預訓練的基礎模型，還能從進一步的指令調優中獲得更多收益。在持續預訓練中，指令預訓練 使 Llama3 - 8B 能夠與 Llama3 - 70B 相媲美甚至超越它。

✨ 主要特性

提出 指令預訓練 框架，可擴展地用指令 - 響應對對原始語料進行擴充。
指令 - 響應對由高效指令合成器生成。
在通用預訓練和領域自適應持續預訓練中均優於普通預訓練。

📦 安裝指南

若要評估任何 Huggingface 語言模型在特定領域任務上的表現，可按以下步驟設置依賴：

git clone https://github.com/microsoft/LMOps
cd LMOps/adaptllm
pip install -r requirements.txt

💻 使用示例

基礎用法

與 finance - Llama3 - 8B 模型進行對話：

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("instruction-pretrain/finance-Llama3-8B")
tokenizer = AutoTokenizer.from_pretrained("instruction-pretrain/finance-Llama3-8B")

# Put your input here, NO prompt template is required
user_input = '''Use this fact to answer the question: Title of each class Trading Symbol(s) Name of each exchange on which registered
Common Stock, Par Value $.01 Per Share MMM New York Stock Exchange
MMM Chicago Stock Exchange, Inc.
1.500% Notes due 2026 MMM26 New York Stock Exchange
1.750% Notes due 2030 MMM30 New York Stock Exchange
1.500% Notes due 2031 MMM31 New York Stock Exchange

Which debt securities are registered to trade on a national securities exchange under 3M's name as of Q2 of 2023?'''

inputs = tokenizer(user_input, return_tensors="pt", add_special_tokens=True).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_new_tokens=400)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(pred)

高級用法

評估任何 Huggingface 語言模型在特定領域任務上的表現（💡新功能！）：

# Select the domain from ['biomedicine', 'finance']
DOMAIN='finance'
  
# Specify any Huggingface LM name (Not applicable to models requiring specific prompt templates)
MODEL='instruction-pretrain/finance-Llama3-8B'
  
# Model parallelization:
# - Set MODEL_PARALLEL=False if the model fits on a single GPU. 
#   We observe that LMs smaller than 10B always meet this requirement.
# - Set MODEL_PARALLEL=True if the model is too large and encounters OOM on a single GPU.
MODEL_PARALLEL=False
  
# Choose the number of GPUs from [1, 2, 4, 8]
N_GPU=1
  
# Whether to add a BOS token at the beginning of the prompt input:
# - Set to False for AdaptLLM.
# - Set to True for instruction-pretrain models.
# If unsure, we recommend setting it to False, as this is suitable for most LMs.
add_bos_token=True

# Run the evaluation script
bash scripts/inference.sh ${DOMAIN} ${MODEL} ${add_bos_token} ${MODEL_PARALLEL} ${N_GPU}

📚 詳細文檔

資源

🤗 我們分享了數據和模型以及使用示例，歡迎在此頁面展開討論！🤗

感謝 [davanstrien/instruction - synthesizer](https://huggingface.co/spaces/davanstrien/instruction - synthesizer) 的演示實現了我們的方法。
基於上下文的指令合成器：[instruction - synthesizer](https://huggingface.co/instruction - pretrain/instruction - synthesizer)
合成器的微調數據：[ft - instruction - synthesizer - collection](https://huggingface.co/datasets/instruction - pretrain/ft - instruction - synthesizer - collection)
從頭開始預訓練的通用模型（基於 100B 標記）：
- [InstructLM - 500M](https://huggingface.co/instruction - pretrain/InstructLM - 500M)
- [InstructLM - 1.3B](https://huggingface.co/instruction - pretrain/InstructLM - 1.3B)
基於 Llama3 - 8B 預訓練的特定領域模型：
- [Finance - Llama3 - 8B](https://huggingface.co/instruction - pretrain/finance - Llama3 - 8B)
- [Biomedicine - Llama3 - 8B](https://huggingface.co/instruction - pretrain/medicine - Llama3 - 8B)
通用指令增強語料庫：[general - instruction - augmented - corpora](https://huggingface.co/datasets/instruction - pretrain/general - instruction - augmented - corpora)
特定領域指令增強語料庫（為避免倫理問題，無金融數據）：[medicine - instruction - augmented - corpora](https://huggingface.co/datasets/instruction - pretrain/medicine - instruction - augmented - corpora)

領域自適應持續預訓練

遵循 [AdaptLLM](https://huggingface.co/AdaptLLM/finance - chat)，我們使用 [基於上下文的指令合成器](https://huggingface.co/instruction - pretrain/instruction - synthesizer) 生成的指令 - 響應對對特定領域的原始語料進行擴充。

常見問題解答

問題 1：你們在預訓練中使用官方的 Llama3 指令提示嗎？ 不，提供的 Llama3 指令提示是為 [指令調優模型](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct) 設計的，而我們的持續預訓練是在 [預訓練基礎模型](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B) 上進行的，只需要 BOS (<|begin_of_text|>) 和 EOS (<|end_of_text|>) 標記。

問題 2：對於來自 OpenOrca 的通用指令，你們是否使用 '\n' 將每個指令與其輸出連接起來？ 不，如預訓練建議中所述，對於來自 OpenOrca 的通用指令數據，我們使用簡單的空格將每個問題與其響應連接起來。這是因為 OpenOrca 的數據已經使用了多種自然語言模板（如包含 \n 的模板），所以空格足以處理數據。請注意，使用我們模板化的指令增強文本時，無需添加任何連接符。

問題 3：OpenOrca 中的那些系統提示怎麼辦？ 我們直接丟棄系統提示。

綜上所述，標記化之前的文本如下：

general_instruction_response_text = "<|begin_of_text|>{question} {response}<|end_of_text|>"

instruction_augmented_text = "<|begin_of_text|>{instruction augmented text}<|end_of_text|>"

然後，進行標記化時，無需添加 BOS 和 EOS 標記 ID。標記化代碼如下：

text_ids = tokenizer(text, add_special_tokens=False, **kwargs).input_ids

📄 許可證

本項目使用 Llama3 許可證。

📖 引用

如果您覺得我們的工作有幫助，請引用我們：指令預訓練 (EMNLP 2024)

@article{cheng2024instruction,
  title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
  author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
  journal={arXiv preprint arXiv:2406.14491},
  year={2024}
}

將大語言模型適配到特定領域 (ICLR 2024)

@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}

🔧 技術細節

模型類型：基於 Llama3 - 8B 開發的金融模型。
訓練數據：使用了 Open - Orca/OpenOrca、GAIR/lima、WizardLM/WizardLM_evol_instruct_V2_196k 等數據集。

屬性	詳情
模型類型	基於 Llama3 - 8B 開發的金融模型
訓練數據	Open - Orca/OpenOrca、GAIR/lima、WizardLM/WizardLM_evol_instruct_V2_196k

🌟 更新日誌

2024/11/30：發佈了指令合成器的多模態版本：[視覺指令合成器](https://huggingface.co/AdaptLLM/Adapt - MLLM - to - Domains)
2024/9/20：我們的論文被 EMNLP 2024 主會議接受🎉
2024/9/11：更新了 [關於從 Llama3 進行持續預訓練的常見問題解答](https://huggingface.co/instruction - pretrain/instruction - synthesizer)
2024/8/29：更新了 [評估任何 🤗Huggingface 模型在特定領域任務上的指南](https://huggingface.co/instruction - pretrain/medicine - Llama3 - 8B)
2024/7/31：更新了 [指令合成器](https://huggingface.co/instruction - pretrain/instruction - synthesizer) 高級用法 部分的預訓練建議
2024/7/15：我們將預訓練標記從 100B 擴展到 250B，合成的指令 - 響應對數量達到 5 億。預訓練過程中在下游任務上的性能趨勢：

* 2024/6/21：發佈了 [論文](https://huggingface.co/papers/2406.14491)、[代碼](https://github.com/microsoft/LMOps) 和 [資源](https://huggingface.co/instruction - pretrain)