Piccolo Math 2x7b

Developed by macadeliccc

Piccolo-math-2x7b is a large language model specializing in mathematical and logical reasoning, named in honor of the author's pet dog Klaus. The model demonstrates outstanding performance across multiple benchmarks, particularly in mathematical and code generation tasks.

Large Language Model

Transformers

Open Source License:MIT #Mathematical reasoning #Logical analysis #Multitask evaluation

Downloads 87

Release Time : 1/16/2024

Model Overview

Piccolo-math-2x7b is a large language model based on the Transformer architecture, focusing on mathematical, code generation, and logical reasoning tasks. It supports high-quality text generation and has achieved excellent results on multiple standard evaluation datasets.

Model Features

Mathematical reasoning capability

Achieves 70.13% accuracy on the GSM8k mathematical reasoning benchmark, significantly outperforming similar base models

Multitasking

Demonstrates balanced performance across various tasks including text generation, logical reasoning, and code generation

Efficient inference

Supports 4-bit quantization loading, reducing hardware requirements while maintaining good performance

Model Capabilities

Mathematical problem solving

Code generation

Logical reasoning

Common sense Q&A

Text generation

Use Cases

Education

Math tutoring

Helps students solve math problems and explains solution steps

Achieves 70.13% accuracy on the GSM8k test set

Development assistance

Code generation

Generates code snippets based on natural language descriptions

Examples demonstrate high-quality code generation capability

license: mit model-index:

name: piccolo-math-2x7b results:
- task: type: text-generation name: Text Generation dataset: name: AI2 Reasoning Challenge (25-Shot) type: ai2_arc config: ARC-Challenge split: test args: num_few_shot: 25 metrics:
  - type: acc_norm value: 69.11 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: HellaSwag (10-Shot) type: hellaswag split: validation args: num_few_shot: 10 metrics:
  - type: acc_norm value: 87.27 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: MMLU (5-Shot) type: cais/mmlu config: all split: test args: num_few_shot: 5 metrics:
  - type: acc value: 63.69 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: TruthfulQA (0-shot) type: truthful_qa config: multiple_choice split: validation args: num_few_shot: 0 metrics:
  - type: mc2 value: 63.86 source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: Winogrande (5-shot) type: winogrande config: winogrande_xl split: validation args: num_few_shot: 5 metrics:
  - type: acc value: 79.87 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: GSM8k (5-shot) type: gsm8k config: main split: test args: num_few_shot: 5 metrics:
  - type: acc value: 70.13 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b name: Open LLM Leaderboard

Piccolo-math-2x7b

In loving memory of my dog Klaus (Piccolo)

~ Piccolo (Italian): the little one ~

Code Example

Inference and Evaluation colab available here

from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_response(prompt):
    """
    Generate a response from the model based on the input prompt.
    Args:
    prompt (str): Prompt for the model.

    Returns:
    str: The generated response from the model.
    """
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

model_id = "macadeliccc/piccolo-math-2x7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,load_in_4bit=True)

prompt = "What is the best way to train Cane Corsos?"

print("Response:")
print(generate_response(prompt), "\n")

The model is capable of quality code, math, and logical reasoning. Try whatever questions you think of.

Evaluations

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
piccolo-math-2x7b	43.89	74.98	63.96	44.99	56.96

EQ Bench

Benchmark Complete:

2024-01-24 00:00:40
Time taken: 183.3 mins
Prompt Format: Mistral
Model: macadeliccc/piccolo-math-2x7b
Score (v2): 70.74
Parseable: 167.0

Batch completed Time taken: 183.3 mins

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	24.41	±	2.70
		acc_norm	24.80	±	2.72
agieval_logiqa_en	0	acc	35.79	±	1.88
		acc_norm	36.71	±	1.89
agieval_lsat_ar	0	acc	23.48	±	2.80
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	49.22	±	2.22
		acc_norm	50.00	±	2.22
agieval_lsat_rc	0	acc	63.94	±	2.93
		acc_norm	64.31	±	2.93
agieval_sat_en	0	acc	77.18	±	2.93
		acc_norm	76.70	±	2.95
agieval_sat_en_without_passage	0	acc	45.15	±	3.48
		acc_norm	44.66	±	3.47
agieval_sat_math	0	acc	33.64	±	3.19
		acc_norm	30.00	±	3.10

Average: 43.89%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	61.86	±	1.42
		acc_norm	62.88	±	1.41
arc_easy	0	acc	84.34	±	0.75
		acc_norm	80.47	±	0.81
boolq	1	acc	86.88	±	0.59
hellaswag	0	acc	68.56	±	0.46
		acc_norm	85.16	±	0.35
openbookqa	0	acc	37.00	±	2.16
		acc_norm	47.80	±	2.24
piqa	0	acc	82.21	±	0.89
		acc_norm	83.68	±	0.86
winogrande	0	acc	77.98	±	1.16

Average: 74.98%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	47.37	±	1.75
		mc2	63.96	±	1.57

Average: 63.96%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	55.26	±	3.62
bigbench_date_understanding	0	multiple_choice_grade	63.14	±	2.51
bigbench_disambiguation_qa	0	multiple_choice_grade	42.64	±	3.08
bigbench_geometric_shapes	0	multiple_choice_grade	22.84	±	2.22
		exact_str_match	3.34	±	0.95
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	36.60	±	2.16
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	25.57	±	1.65
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	56.00	±	2.87
bigbench_movie_recommendation	0	multiple_choice_grade	42.40	±	2.21
bigbench_navigate	0	multiple_choice_grade	54.70	±	1.57
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	62.90	±	1.08
bigbench_ruin_names	0	multiple_choice_grade	53.35	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	24.35	±	1.36
bigbench_snarks	0	multiple_choice_grade	62.43	±	3.61
bigbench_sports_understanding	0	multiple_choice_grade	70.28	±	1.46
bigbench_temporal_sequences	0	multiple_choice_grade	41.30	±	1.56
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.32	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.77	±	0.91
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	56.00	±	2.87

Average: 44.99%

Average score: 56.96%

Elapsed time: 01:51:53

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	72.32
AI2 Reasoning Challenge (25-Shot)	69.11
HellaSwag (10-Shot)	87.27
MMLU (5-Shot)	63.69
TruthfulQA (0-shot)	63.86
Winogrande (5-shot)	79.87
GSM8k (5-shot)	70.13

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご