Mamba-7B-RW Open-Source Natural Language Processing Model - Multi-Round Training Boosts Language Task Processing

Mamba 7b Rw

Developed by TRI-ML

Mamba-7B is a 7-billion-parameter model based on the Mamba architecture, trained over multiple rounds on the RefinedWeb dataset (1.2 trillion tokens). Mamba is a state space model that does not use self-attention mechanisms and excels in various natural language benchmarks.

Large Language Model

Safetensors

EnglishOpen Source License:Apache-2.0 #State Space Model #Efficient Text Generation #Attention-Free Mechanism

Downloads 188

Release Time : 4/8/2024

Model Overview

Mamba-7B is an autoregressive language model based on the Mamba architecture, designed for text generation tasks. It was trained on the RefinedWeb dataset with 1.2 trillion tokens and supports the English language.

Model Features

Mamba Architecture

Mamba is a state space model that does not use self-attention mechanisms, featuring linear time complexity and efficient inference capabilities.

Large-Scale Training Data

Trained on the RefinedWeb dataset with 1.2 trillion tokens, covering a wide range of natural language tasks.

Efficient Inference

Due to the characteristics of the Mamba architecture, the model exhibits high efficiency and low computational costs during inference.

Model Capabilities

Text Generation

Natural Language Understanding

Question Answering

Use Cases

Natural Language Processing

Text Generation

Generates coherent and contextually relevant text, suitable for content creation, dialogue systems, etc.

The generated text exhibits high coherence and relevance.

Question Answering

Answers user queries, applicable in customer service, education, and other fields.

Achieves 33.3% accuracy on the MMLU dataset.

🚀 Mamba-7B

Mamba-7B is a 7B parameter model based on the Mamba architecture, trained on the RefinedWeb dataset. It offers strong performance on various natural language benchmarks and serves as a baseline for the paper "Linearizing Large Language Models".

🚀 Quick Start

This model was trained using OpenLM. The weights have been converted to be compatible with HuggingFace.

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tri-ml/mamba-7b-rw")
model = AutoModelForCausalLM.from_pretrained("tri-ml/mamba-7b-rw")

inputs = tokenizer(["The Toyota Supra"], return_tensors="pt")
gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)
# The Toyota Supra is a sports car that has been in production since 1978. The car was discontinued in 2002, but it has recently been revived and will be available again in 2020. The Supra has always been known for its powerful engines and agile handling.

✨ Features

Innovative Architecture: Based on the Mamba architecture, a state - space model that doesn't use self - attention, unlike the standard transformer architecture.
Strong Performance: Demonstrates good performance on various natural language benchmarks.
Large - Scale Training: Trained on 1.2T tokens of the RefinedWeb dataset.

📦 Installation

This model can be used with HuggingFace's transformers library. You can install the necessary libraries using the following command:

pip install transformers

📚 Documentation

Model Details

Property	Details
Developed by	Toyota Research Institute
Model Type	An auto - regressive language model based on the Mamba architecture
Training Data	Trained on 1.2T tokens of RefinedWeb
Tokenizer	`EleutherAI/gpt-neox-20b`
Library	OpenLM
License	Apache License, Version 2.0

Parameters	Hidden Size	Layers	Vocab Size	Sequence Length
7B	4096	64	50432	2048

Training Details

Mamba-7B was trained using AWS SageMaker on 128 H100 80GB GPUs.
Training began in March 2024 and lasted three weeks.

Hyperparameter	Value
Precision	`bfloat16`
Optimizer	AdamW
Learning rate	3e-4
LR cooldown end	1e-5
Warmup steps	2000
Z-loss	1e-4
Batch size	2M

Performance Evaluation

Our evaluations were done using the Eleuther LM Eval Harness repo.

	HellaSwag	PIQA	Winogrande	ARC-E	ARC-C	MMLU (5-shot)
Mamba-1.4B	59.0	73.9	61.4	65.5	32.9	25.2
Mamba-2.8B	71.0	78.1	65.9	68.2	41.7	26.2
RWKV5-1.7T-7B	73.0	78.6	72.9	75.8	45.6	34.9
Llama2-7B	76.0	79.1	69.1	76.3	46.3	45.9
Gemma-7B	80.7	81.9	73.7	81.1	53.2	62.9
Mistral-7B	81.0	82.1	74.0	80.9	53.8	62.4
Mamba-7B	77.9	81.0	71.8	77.5	46.7	33.3

🔧 Technical Details

This model was trained as a baseline for our paper Linearizing Large Language Models. The Mamba architecture is a state - space model that shows strong performance on natural language tasks without using self - attention.

📄 License

This model is licensed under Apache License, Version 2.0.

How to Cite

If you use this model, please cite our paper on Linearizing Large Language Models.

@article{Mercat2024Linearizing,
  title={Linearizing Large Language Models},
  author={Jean Mercat and Igor Vasiljevic and Sedrick Keh and Kushal Arora and Achal Dave and Adrien Gaidon and Thomas Kollar},
  journal={arXiv preprint arXiv:2405.06640},
  year={2024}
}

Citations

Mamba

@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}

OpenLM

@misc{open_lm,
  author = {Gururangan, Suchin and Wortsman, Mitchell and Gadre, Samir Yitzhak and Dave, Achal and Kilian, Maciej and Shi, Weijia and Mercat, Jean and Smyrnis, Georgios and Ilharco, Gabriel and Jordan, Matt and Heckel, Reinhard and Dimakis, Alex and Farhadi, Ali and Shankar, Vaishaal and Schmidt, Ludwig},
  title = {{open_lm}:  a minimal but performative language modeling (LM) repository},
  year = {2023},
  note = {GitHub repository},
  url = {https://github.com/mlfoundations/open_lm/}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご