Model Overview
Model Features
Model Capabilities
Use Cases
๐ finance-chat-GGUF
This repo contains GGUF format model files for financial chat, offering high - performance text generation capabilities in the finance domain.
๐ Quick Start
This repository provides GGUF format model files for [finance - chat](https://huggingface.co/AdaptLLM/finance - chat). To get started, you can choose a suitable way to download and run the model according to your needs.
โจ Features
- Multiple Client Support: Compatible with a wide range of clients and libraries such as llama.cpp, text - generation - webui, Ollama, etc.
- Quantization Options: Offers various quantization methods to balance between model size and performance.
- Easy Download: Can be downloaded automatically by some clients or manually using the
huggingface - hub
library.
๐ฆ Installation
Downloading GGUF Files
โ ๏ธ Important Note
You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file.
Some clients/libraries like LM Studio, LoLLMS Web UI, and Faraday.dev will automatically download models for you.
In text - generation - webui
Under Download Model, enter the model repo: andrijdavid/finance - chat - GGUF
and below it, a specific filename to download, such as finance - chat - f16.gguf
, then click Download.
On the command line
First, install the huggingface - hub
Python library:
pip3 install huggingface - hub
Then download an individual model file to the current directory:
huggingface - cli download andrijdavid/finance - chat - GGUF finance - chat - f16.gguf --local - dir. --local - dir - use - symlinks False
You can also download multiple files at once with a pattern:
huggingface - cli download andrijdavid/finance - chat - GGUF --local - dir. --local - dir - use - symlinks False --include='*Q4_K*gguf'
To accelerate downloads on fast connections (1Gbit/s or higher), install hf_transfer
:
pip3 install hf_transfer
And set environment variable HF_HUB_ENABLE_HF_TRANSFER
to 1
:
HF_HUB_ENABLE_HF_TRANSFER = 1 huggingface - cli download andrijdavid/finance - chat - GGUF finance - chat - f16.gguf --local - dir. --local - dir - use - symlinks False
๐ป Usage Examples
Basic Usage with llama.cpp
Make sure you are using llama.cpp
from commit d0cee0d or later.
./main -ngl 35 -m finance - chat - f16.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<PROMPT>"
- Change
-ngl 32
to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. - Change
-c 4096
to the desired sequence length.
If you want to have a chat - style conversation, replace the -p <PROMPT>
argument with -i -ins
.
Using in text - generation - webui
Further instructions can be found in the text - generation - webui documentation, here: [text - generation - webui/docs/04 โ Model Tab.md](https://github.com/oobabooga/text - generation - webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp).
Using from Python code
You can use GGUF models from Python using the [llama - cpp - python](https://github.com/abetlen/llama - cpp - python) library.
Install the package
# Base ctransformers with no GPU acceleration
pip install llama - cpp - python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama - cpp - python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama - cpp - python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama - cpp - python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama - cpp - python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama - cpp - python
# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama - cpp - python
Simple llama - cpp - python example code
from llama_cpp import Llama
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
model_path="./finance - chat - f16.gguf", # Download the model file first
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)
# Simple inference example
output = llm(
"<PROMPT>", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=["</s>"], # Example stop token - not necessarily correct for this specific model! Please check before using.
echo=True # Whether to echo the prompt
)
# Chat Completion API
llm = Llama(model_path="./finance - chat - f16.gguf", chat_format="llama - 2") # Set chat_format according to the model you are using
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are a story writing assistant."},
{
"role": "user",
"content": "Write a story about llamas."
}
]
)
Using with LangChain
Here are guides on using llama - cpp - python and ctransformers with LangChain:
๐ Documentation
About GGUF
GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.
Here is an incomplete list of clients and libraries that are known to support GGUF:
- llama.cpp. This is the source project for GGUF, providing both a Command Line Interface (CLI) and a server option.
- [text - generation - webui](https://github.com/oobabooga/text - generation - webui), Known as the most widely used web UI, this project boasts numerous features and powerful extensions, and supports GPU acceleration.
- Ollama Ollama is a lightweight and extensible framework designed for building and running language models locally. It features a simple API for creating, managing, and executing models, along with a library of pre - built models for use in various applicationsโ
- KoboldCpp, A comprehensive web UI offering GPU acceleration across all platforms and architectures, particularly renowned for storytelling.
- GPT4All, This is a free and open source GUI that runs locally, supporting Windows, Linux, and macOS with full GPU acceleration.
- LM Studio An intuitive and powerful local GUI for Windows and macOS (Silicon), featuring GPU acceleration.
- [LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui). A notable web UI with a variety of unique features, including a comprehensive model library for easy model selection.
- Faraday.dev, An attractive, user - friendly character - based chat GUI for Windows and macOS (both Silicon and Intel), also offering GPU acceleration.
- [llama - cpp - python](https://github.com/abetlen/llama - cpp - python), A Python library equipped with GPU acceleration, LangChain support, and an OpenAI - compatible API server.
- candle, A Rust - based ML framework focusing on performance, including GPU support, and designed for ease of use.
- ctransformers, A Python library featuring GPU acceleration, LangChain support, and an OpenAI - compatible AI server.
- localGPT An open - source initiative enabling private conversations with documents.
Explanation of Quantisation Methods
Click to see details
The new methods available are:- GGML_TYPE_Q2_K - "type - 1" 2 - bit quantization in super - blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
- GGML_TYPE_Q3_K - "type - 0" 3 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
- GGML_TYPE_Q4_K - "type - 1" 4 - bit quantization in super - blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
- GGML_TYPE_Q5_K - "type - 1" 5 - bit quantization. Same super - block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
- GGML_TYPE_Q6_K - "type - 0" 6 - bit quantization. Super - blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw.
๐ง Technical Details
This domain - specific chat model is developed from LLaMA - 2 - Chat - 7B, using the method in the paper Adapting Large Language Models via Reading Comprehension. The team explores continued pre - training on domain - specific corpora for large language models. By transforming large - scale pre - training corpora into reading comprehension texts, it improves the prompting performance across tasks in biomedicine, finance, and law domains. Their 7B model competes with much larger domain - specific models like BloombergGPT - 50B.
Updates
- 2024/1/16: ๐ Their research paper has been accepted by ICLR 2024!!!๐
- 2023/12/19: Released their [13B base models](https://huggingface.co/AdaptLLM/law - LLM - 13B) developed from LLaMA - 1 - 13B.
- 2023/12/8: Released their [chat models](https://huggingface.co/AdaptLLM/law - chat) developed from LLaMA - 2 - Chat - 7B.
- 2023/9/18: Released their paper, code, [data](https://huggingface.co/datasets/AdaptLLM/law - tasks), and [base models](https://huggingface.co/AdaptLLM/law - LLM) developed from LLaMA - 1 - 7B.
Domain - Specific LLaMA - 1
LLaMA - 1 - 7B
In the paper, three domain - specific models are developed from LLaMA - 1 - 7B, which are also available in Huggingface: [Biomedicine - LLM](https://huggingface.co/AdaptLLM/medicine - LLM), [Finance - LLM](https://huggingface.co/AdaptLLM/finance - LLM) and [Law - LLM](https://huggingface.co/AdaptLLM/law - LLM). The performances of AdaptLLM compared to other domain - specific LLMs are shown in the following image:
LLaMA - 1 - 13B
Moreover, the base model is scaled up to LLaMA - 1 - 13B to see if the method is similarly effective for larger models.
๐ License
The model uses the llama2 license.
Property | Details |
---|---|
Model Type | GGUF format model for finance - chat |
Training Data | Open - Orca/OpenOrca, GAIR/lima, WizardLM/WizardLM_evol_instruct_V2_196k |
Metrics | accuracy |
Pipeline Tag | text - generation |
Quantized By | andrijdavid |
Tags | finance, GGUF |
License | llama2 |

