Dolly-v2-3b Open-source Large Language Model - An Instruction Q&A Tool for Commercial Use

Dolly V2 3b

Developed by databricks

A 2.8 billion parameter instruction-finetuned large language model released by Databricks, based on the pythia-2.8b architecture, fine-tuned on 15k instruction datasets, and supports commercial use

Large Language Model

Transformers

EnglishOpen Source License:MIT #Commercial Instruction Fine-tuning #Lightweight LLM #Multi-task Instruction Following

Downloads 15.36k

Release Time : 4/13/2023

Model Overview

An open-source language model focused on instruction-following capabilities, suitable for tasks like text generation and Q&A. While not state-of-the-art, it offers practical value

Model Features

Commercial-friendly License

Uses MIT open-source license, allowing commercial use

Instruction Fine-tuning Optimization

Fine-tuned on 15k high-quality instruction datasets, significantly improving instruction comprehension

Lightweight Deployment

With 2.8 billion parameters, it's relatively small and suitable for resource-constrained scenarios

Model Capabilities

Text generation

Closed-domain Q&A

Open-domain Q&A

Information extraction

Content summarization

Classification tasks

Brainstorming

Use Cases

Knowledge Q&A

Scientific Concept Explanation

Explain the difference between nuclear fission and nuclear fusion

Generates concise and easy-to-understand scientific explanations

Information Processing

Person Information Extraction

Extract key person information from given text

Accurately identifies and extracts key information such as timelines and positions

🚀 dolly-v2-3b Model Card

dolly-v2-3b is an instruction-following large language model developed by Databricks. Based on pythia-2.8b, it's fine - tuned on a custom dataset, enabling it to follow instructions effectively. Although it's not a state - of - the - art model, it shows surprisingly high - quality instruction - following behavior.

✨ Features

Commercial Use License: Licensed for commercial use, making it accessible for various business applications.
Instruction - Following Ability: Trained on a custom dataset of ~15k instruction/response records, it can handle a wide range of instructions, including brainstorming, classification, and summarization.
Multiple Model Sizes: Available in different sizes (dolly-v2-3b, dolly-v2-7b, dolly-v2-12b), allowing users to choose according to their specific needs.

📦 Installation

To use the model with the transformers library on a machine with GPUs, first ensure you have the transformers and accelerate libraries installed. In a Databricks notebook, you can run the following command:

%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"

💻 Usage Examples

Basic Usage

The instruction following pipeline can be loaded using the pipeline function as shown below. This loads a custom InstructionTextGenerationPipeline found in the model repo here, which is why trust_remote_code=True is required.

import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

You can then use the pipeline to answer instructions:

res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

Advanced Usage

If you prefer not to use trust_remote_code=True, you can download instruct_pipeline.py, store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

LangChain Usage

To use the pipeline with LangChain, you must set return_full_text=True, as LangChain expects the full text to be returned and the default for the pipeline is to only return the new text.

import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16,
                         trust_remote_code=True, device_map="auto", return_full_text=True)

You can create a prompt that either has only an instruction or has an instruction with context:

from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

# template for an instrution with no input
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="{instruction}")

# template for an instruction with input
prompt_with_context = PromptTemplate(
    input_variables=["instruction", "context"],
    template="{instruction}\n\nInput:\n{context}")

hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)

Example predicting using a simple instruction:

print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())

Example predicting using an instruction with context:

context = """George Washington (February 22, 1732[b] - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""

print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())

🔧 Technical Details

dolly-v2-3b is a 2.8 billion parameter causal language model created by Databricks. It is derived from EleutherAI's Pythia - 2.8b and fine - tuned on a ~15K record instruction corpus generated by Databricks employees. The dataset is released under a permissive license (CC - BY - SA).

Model Overview

Property	Details
Model Type	Causal Language Model
Training Data	Based on `Pythia-2.8b` and fine - tuned on `databricks-dolly-15k`

Known Limitations

Performance Limitations

dolly-v2-3b is not a state - of - the - art generative language model. It struggles with syntactically complex prompts, programming problems, mathematical operations, etc. It also lacks some capabilities present in the original model, such as well - formatted letter writing.

Dataset Limitations

The Pile: The pre - training corpus of GPT - J contains content mostly from the public internet, which may include objectionable content. The model is likely to reflect these shortcomings.
databricks-dolly-15k: The training data was generated by Databricks employees from March to April 2023. It may contain typos, factual errors, and biases from Wikipedia. It also reflects the interests and semantic choices of Databricks employees.

Benchmark Metrics

Below are the benchmark performances of various models on the EleutherAI LLM Evaluation Harness. Model results are sorted by geometric mean. These results show that dolly-v2-3b is not state - of - the - art.

model	openbookqa	arc_easy	winogrande	hellaswag	arc_challenge	piqa	boolq	gmean
EleutherAI/pythia-2.8b	0.348	0.585859	0.589582	0.591217	0.323379	0.73395	0.638226	0.523431
EleutherAI/pythia-6.9b	0.368	0.604798	0.608524	0.631548	0.343857	0.761153	0.6263	0.543567
databricks/dolly-v2-3b	0.384	0.611532	0.589582	0.650767	0.370307	0.742655	0.575535	0.544886
EleutherAI/pythia-12b	0.364	0.627104	0.636148	0.668094	0.346416	0.760065	0.673394	0.559676
EleutherAI/gpt-j-6B	0.382	0.621633	0.651144	0.662617	0.363481	0.761153	0.655963	0.565936
databricks/dolly-v2-12b	0.408	0.63931	0.616417	0.707927	0.388225	0.757889	0.568196	0.56781
databricks/dolly-v2-7b	0.392	0.633838	0.607735	0.686517	0.406997	0.750816	0.644037	0.573487
databricks/dolly-v1-6b	0.41	0.62963	0.643252	0.676758	0.384812	0.773667	0.687768	0.583431
EleutherAI/gpt-neox-20b	0.402	0.683923	0.656669	0.7142	0.408703	0.784004	0.695413	0.602236

📄 License

This project is licensed under the MIT license.

📚 Documentation

For tips on running inference for various GPU configurations, please refer to the dolly GitHub repo.

📖 Citation

@online{DatabricksBlog2023DollyV2,
    author    = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
    title     = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
    year      = {2023},
    url       = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
    urldate   = {2023-06-30}
}

Happy Hacking!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご