MoLM-700M-4B Open-Source Language Model - 4 Billion Parameters, Low Computation Consumption, High Practical Value

Molm 700M 4B

Developed by ibm-research

MoLM is a series of language models based on the Mixture of Experts (MoE) architecture. The 700M-4B version has a total of 4 billion parameters, with computational consumption equivalent to a dense model of 700 million parameters.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Mixture of Experts Architecture #Efficient Computing #Large-scale Pretraining

Downloads 36

Release Time : 9/13/2023

Model Overview

The MoLM series of language models adopt the Mixture of Experts architecture, maintaining high parameter counts while reducing computational consumption through dynamic activation mechanisms, suitable for text generation and understanding tasks.

Model Features

Efficient Computing Architecture

Balances high parameter capacity with low computational consumption through Mixture of Experts design.

Modular Inference

Only activates a subset of expert modules per token (this model activates 4 modules).

Large-scale Pretraining

Trained on 300 billion tokens of public data.

Model Capabilities

Text Generation

Language Understanding

Question Answering System

Use Cases

Knowledge Q&A

Open-domain Q&A

Answers various common-sense questions

Achieves 16.49% accuracy in five-shot testing on TriviaQA.

Code Generation

Python Code Completion

Generates Python code snippets based on descriptions

Achieves 20.27% pass rate @100 in HumanEval testing.

🚀 MoLM

MoLM is a collection of MoE-based language models with parameter scales ranging from 4 billion to 8 billion. This repository is for the MoLM-700M-4B pretrained model, converted to the Hugging Face Transformers format. You can find links to other models in the index at the bottom.

🚀 Quick Start

To load the model, you need to install the ModuleFormer package. Then you can load the model with the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
from moduleformer import ModuleFormerForCausalLM, ModuleFormerConfig, ModuleFormerForSequenceClassification
AutoConfig.register("moduleformer", ModuleFormerConfig)
AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
AutoModelForSequenceClassification.register(ModuleFormerConfig, ModuleFormerForSequenceClassification)

tokenizer = AutoTokenizer.from_pretrained('ibm/MoLM-700M-4B')
model = AutoModelForCausalLM.from_pretrained('ibm/MoLM-700M-4B')

✨ Features

Scalable Architecture: MoLM comes in different parameter sizes (4B and 8B), with variants offering different computational costs (350M and 700M).
Efficient Computation: Each input token only activates a fraction of the total parameters, making it computationally equivalent to smaller dense models.
Trained on Large Datasets: All models are trained on 300 billion tokens from publicly available sources.

📚 Documentation

Model Details

MoLM-350M-4B: It has 4 billion parameters, but each input token only activates 350M parameters, making it computationally equivalent to a 350M dense model.
MoLM-700M-4B: It has 4 billion parameters and is computationally equivalent to a 700M dense model.
MoLM-700M-8B: It has 8 billion parameters and is computationally equivalent to a 700M dense model.

All models are trained on 300 billion tokens from publicly available sources, with a learning rate of 3.0 x 10^-4 and a global batch-size of 3M tokens.

Model Developers

IBM

Variations

MoLM comes in two different parameter sizes — 4B and 8B. The 4B models have two variants with different computation costs — 350M and 700M.

Input

The models accept text as input.

Output

The models generate text as output.

Model Architecture

MoLM is an auto-regressive language model that uses the ModuleFormer architecture. It has 16 attention modules in each attention layer and 32 MLP modules in each MLP layer. During inference, in each layer, MoLM-350M-4B and MoLM-700M-8B activate 2 modules for each token, while MoLM-700M-4B activates 4 modules. MoLM-350M-4B and MoLM-700M-4B have 24 blocks and MoLM-700M-8B has 48 blocks.

Status

This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.

Research Paper

"ModuleFormer: Modularity Emerges from Mixture-of-Experts"

📦 Installation

To use the MoLM models, you need to install the ModuleFormer package.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
from moduleformer import ModuleFormerForCausalLM, ModuleFormerConfig, ModuleFormerForSequenceClassification
AutoConfig.register("moduleformer", ModuleFormerConfig)
AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
AutoModelForSequenceClassification.register(ModuleFormerConfig, ModuleFormerForSequenceClassification)

tokenizer = AutoTokenizer.from_pretrained('ibm/MoLM-700M-4B')
model = AutoModelForCausalLM.from_pretrained('ibm/MoLM-700M-4B')

🔧 Technical Details

Training Data

MoLM models are pretrained on 300 billion tokens of data from publicly available sources.

Evaluation Results

In this section, we report the results for the MoLM models on standard academic benchmarks. For all the evaluations, we use LM evaluations Harness.

Model	Latency (ms)	Memory (GB)	Throughput (tokens/sec)	Hellaswag (acc)	PIQA (acc)	ARC-e (acc)	ARC-c (acc)	OBQA (acc)
Pythia 410M	554	25	59594	33.72	66.70	51.89	21.42	18.2
GPT-Neo 1.3B	991	23	32857	38.66	71.11	56.19	23.12	21.4
Pythia 1.4B	918	42	35559	40.41	70.84	60.52	26.11	22.2
MoLM-350M-4B	497	27	71017	39.21	70.13	56.44	23.55	20.8
GPT-Neo 2.7B	1737	35	18788	42.71	72.2	61.07	27.47	23.2
Pythia 2.8B	2111	70	15522	45.34	73.99	64.35	29.35	23.8
MoLM-700M-4B	863	27	39931	42.20	73.01	60.82	25.94	22.6
MoLM-700M-8B	939	38	37419	43.33	72.91	62.46	27.90	23.8

Model	TriviaQA (0-shot)	TriviaQA (1-shot)	TriviaQA (5-shot)	HumanEval (pass@1)	HumanEval (pass@10)	HumanEval (pass@100)	Wikitext (PPL)
Pythia 410M	2.32	5.02	6.42	1.20	3.85	9.98	20.09
GPT-Neo 1.3B	5.24	8.01	9.74	3.62	6.87	14.50	16.16
Pythia 1.4B	5.30	9.87	12.84	2.19	7.31	14.33	14.71
MoLM-350M-4B	5.40	11.12	13.70	3.04	6.99	13.79	15.15
GPT-Neo 2.7B	4.82	11.23	13.67	4.89	9.54	17.90	13.93
Pythia 2.8B	7.38	15.58	18.98	4.91	11.76	21.54	12.68
MoLM-700M-4B	9.07	14.24	16.49	5.50	10.65	20.27	13.20
MoLM-700M-8B	11.47	16.73	20.75	5.51	12.58	20.40	12.97

⚠️ Important Note

MoLM is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, MoLM’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of MoLM, developers should perform safety testing and tuning tailored to their specific applications of the model.

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

Please cite the following paper if you use the data or code in this repo.

@article{shen2023moduleformer,
  title={ModuleFormer: Learning Modular Large Language Models From Uncurated Data},
  author={Shen, Yikang and Zhang, Zheyu and Cao, Tianyou and Tan, Shawn and Chen, Zhenfang and Gan, Chuang},
  journal={arXiv preprint arXiv:2306.04640},
  year={2023}
}

MoLM Model Index

Model	Link
350M-4B	Link
700M-4B	Link
700M-8B	Link

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご