đ dolly-v2-3b Model Card
dolly-v2-3b
is an instruction-following large language model developed by Databricks. Based on pythia-2.8b
, it's fine - tuned on a custom dataset, enabling it to follow instructions effectively. Although it's not a state - of - the - art model, it shows surprisingly high - quality instruction - following behavior.
⨠Features
- Commercial Use License: Licensed for commercial use, making it accessible for various business applications.
- Instruction - Following Ability: Trained on a custom dataset of ~15k instruction/response records, it can handle a wide range of instructions, including brainstorming, classification, and summarization.
- Multiple Model Sizes: Available in different sizes (
dolly-v2-3b
, dolly-v2-7b
, dolly-v2-12b
), allowing users to choose according to their specific needs.
đĻ Installation
To use the model with the transformers
library on a machine with GPUs, first ensure you have the transformers
and accelerate
libraries installed. In a Databricks notebook, you can run the following command:
%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
đģ Usage Examples
Basic Usage
The instruction following pipeline can be loaded using the pipeline
function as shown below. This loads a custom InstructionTextGenerationPipeline
found in the model repo here, which is why trust_remote_code=True
is required.
import torch
from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
You can then use the pipeline to answer instructions:
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])
Advanced Usage
If you prefer not to use trust_remote_code=True
, you can download instruct_pipeline.py, store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", device_map="auto", torch_dtype=torch.bfloat16)
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
LangChain Usage
To use the pipeline with LangChain, you must set return_full_text=True
, as LangChain expects the full text to be returned and the default for the pipeline is to only return the new text.
import torch
from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16,
trust_remote_code=True, device_map="auto", return_full_text=True)
You can create a prompt that either has only an instruction or has an instruction with context:
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
prompt = PromptTemplate(
input_variables=["instruction"],
template="{instruction}")
prompt_with_context = PromptTemplate(
input_variables=["instruction", "context"],
template="{instruction}\n\nInput:\n{context}")
hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
Example predicting using a simple instruction:
print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
Example predicting using an instruction with context:
context = """George Washington (February 22, 1732[b] - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""
print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())
đ§ Technical Details
dolly-v2-3b
is a 2.8 billion parameter causal language model created by Databricks. It is derived from EleutherAI's Pythia - 2.8b and fine - tuned on a ~15K record instruction corpus generated by Databricks employees. The dataset is released under a permissive license (CC - BY - SA).
Model Overview
Property |
Details |
Model Type |
Causal Language Model |
Training Data |
Based on Pythia-2.8b and fine - tuned on databricks-dolly-15k |
Known Limitations
Performance Limitations
dolly-v2-3b
is not a state - of - the - art generative language model. It struggles with syntactically complex prompts, programming problems, mathematical operations, etc. It also lacks some capabilities present in the original model, such as well - formatted letter writing.
Dataset Limitations
- The Pile: The pre - training corpus of GPT - J contains content mostly from the public internet, which may include objectionable content. The model is likely to reflect these shortcomings.
databricks-dolly-15k
: The training data was generated by Databricks employees from March to April 2023. It may contain typos, factual errors, and biases from Wikipedia. It also reflects the interests and semantic choices of Databricks employees.
Benchmark Metrics
Below are the benchmark performances of various models on the EleutherAI LLM Evaluation Harness. Model results are sorted by geometric mean. These results show that dolly-v2-3b
is not state - of - the - art.
model |
openbookqa |
arc_easy |
winogrande |
hellaswag |
arc_challenge |
piqa |
boolq |
gmean |
EleutherAI/pythia-2.8b |
0.348 |
0.585859 |
0.589582 |
0.591217 |
0.323379 |
0.73395 |
0.638226 |
0.523431 |
EleutherAI/pythia-6.9b |
0.368 |
0.604798 |
0.608524 |
0.631548 |
0.343857 |
0.761153 |
0.6263 |
0.543567 |
databricks/dolly-v2-3b |
0.384 |
0.611532 |
0.589582 |
0.650767 |
0.370307 |
0.742655 |
0.575535 |
0.544886 |
EleutherAI/pythia-12b |
0.364 |
0.627104 |
0.636148 |
0.668094 |
0.346416 |
0.760065 |
0.673394 |
0.559676 |
EleutherAI/gpt-j-6B |
0.382 |
0.621633 |
0.651144 |
0.662617 |
0.363481 |
0.761153 |
0.655963 |
0.565936 |
databricks/dolly-v2-12b |
0.408 |
0.63931 |
0.616417 |
0.707927 |
0.388225 |
0.757889 |
0.568196 |
0.56781 |
databricks/dolly-v2-7b |
0.392 |
0.633838 |
0.607735 |
0.686517 |
0.406997 |
0.750816 |
0.644037 |
0.573487 |
databricks/dolly-v1-6b |
0.41 |
0.62963 |
0.643252 |
0.676758 |
0.384812 |
0.773667 |
0.687768 |
0.583431 |
EleutherAI/gpt-neox-20b |
0.402 |
0.683923 |
0.656669 |
0.7142 |
0.408703 |
0.784004 |
0.695413 |
0.602236 |
đ License
This project is licensed under the MIT license.
đ Documentation
For tips on running inference for various GPU configurations, please refer to the dolly GitHub repo.
đ Citation
@online{DatabricksBlog2023DollyV2,
author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
year = {2023},
url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
urldate = {2023-06-30}
}
Happy Hacking!