đ dolly-v2-12b Model Card
dolly-v2-12b
is an instruction-following large language model developed by Databricks. It's trained on the Databricks machine learning platform and is licensed for commercial use. This model is based on pythia-12b
and fine - tuned on a dataset of about 15k instruction/response records.
⨠Features
- Commercial Use License:
dolly-v2-12b
is licensed for commercial use, making it accessible for various business applications.
- Instruction - Following Capability: It can follow instructions in multiple domains such as brainstorming, classification, and summarization.
- Available in Multiple Sizes: Besides the 12 - billion parameter version, smaller models like
dolly-v2-7b
and dolly-v2-3b
are also available.
đĻ Installation
To use the model with the transformers
library on a machine with GPUs, first ensure you have the transformers
and accelerate
libraries installed. In a Databricks notebook, you can run the following command:
%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
đģ Usage Examples
Basic Usage
import torch
from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])
Advanced Usage
If you prefer not to use trust_remote_code=True
, you can download instruct_pipeline.py, store it alongside your notebook, and construct the pipeline yourself:
import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16)
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
LangChain Usage
To use the pipeline with LangChain, you must set return_full_text=True
:
import torch
from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16,
trust_remote_code=True, device_map="auto", return_full_text=True)
Create prompts and use them for prediction:
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
prompt = PromptTemplate(
input_variables=["instruction"],
template="{instruction}")
prompt_with_context = PromptTemplate(
input_variables=["instruction", "context"],
template="{instruction}\n\nInput:\n{context}")
hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
context = """George Washington (February 22, 1732[b] - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""
print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())
đ Documentation
Model Overview
dolly-v2-12b
is a 12 - billion parameter causal language model created by Databricks. It is derived from EleutherAI's Pythia - 12b and fine - tuned on a ~15K record instruction corpus generated by Databricks employees and released under a permissive license (CC - BY - SA).
Known Limitations
Performance Limitations
dolly-v2-12b
is not a state - of - the - art generative language model. It struggles with syntactically complex prompts, programming problems, mathematical operations, etc.
Dataset Limitations
- The Pile: The pre - training corpus contains content from the public internet, which may include objectionable content. The model may reflect these shortcomings.
databricks-dolly-15k
: The training data was generated by Databricks employees from March to April 2023. It may contain typos, factual errors, and biases from Wikipedia.
Benchmark Metrics
model |
openbookqa |
arc_easy |
winogrande |
hellaswag |
arc_challenge |
piqa |
boolq |
gmean |
EleutherAI/pythia-2.8b |
0.348 |
0.585859 |
0.589582 |
0.591217 |
0.323379 |
0.73395 |
0.638226 |
0.523431 |
EleutherAI/pythia-6.9b |
0.368 |
0.604798 |
0.608524 |
0.631548 |
0.343857 |
0.761153 |
0.6263 |
0.543567 |
databricks/dolly-v2-3b |
0.384 |
0.611532 |
0.589582 |
0.650767 |
0.370307 |
0.742655 |
0.575535 |
0.544886 |
EleutherAI/pythia-12b |
0.364 |
0.627104 |
0.636148 |
0.668094 |
0.346416 |
0.760065 |
0.673394 |
0.559676 |
EleutherAI/gpt-j-6B |
0.382 |
0.621633 |
0.651144 |
0.662617 |
0.363481 |
0.761153 |
0.655963 |
0.565936 |
databricks/dolly-v2-12b |
0.408 |
0.63931 |
0.616417 |
0.707927 |
0.388225 |
0.757889 |
0.568196 |
0.56781 |
databricks/dolly-v2-7b |
0.392 |
0.633838 |
0.607735 |
0.686517 |
0.406997 |
0.750816 |
0.644037 |
0.573487 |
databricks/dolly-v1-6b |
0.41 |
0.62963 |
0.643252 |
0.676758 |
0.384812 |
0.773667 |
0.687768 |
0.583431 |
EleutherAI/gpt-neox-20b |
0.402 |
0.683923 |
0.656669 |
0.7142 |
0.408703 |
0.784004 |
0.695413 |
0.602236 |
đ License
The model is licensed under the MIT license.
đ Citation
@online{DatabricksBlog2023DollyV2,
author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
year = {2023},
url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
urldate = {2023-06-30}
}
Happy Hacking!