đ BigScience Large Open - science Open - access Multilingual Language Model
BigScience is an autoregressive Large Language Model (LLM). It is trained on vast amounts of text data using industrial - scale computational resources to continue text from a prompt. It can output coherent text in 46 languages and 13 programming languages, which is hardly distinguishable from human - written text. Also, it can be instructed to perform text tasks not explicitly trained for by casting them as text generation tasks.
đ Quick Start
The model can be used for various text - related tasks. You can start by providing a text prompt, and the model will generate relevant text based on it. For more detailed usage, please refer to the official documentation and relevant code examples.
⨠Features
- Multilingual Support: Capable of generating text in 46 natural languages and 13 programming languages.
- Task Adaptability: Can perform text tasks not explicitly trained for by converting them into text generation tasks.
- High - Quality Output: Produces coherent text that is difficult to distinguish from human - written text.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
The basic way to use this model is to input a text prompt, and the model will generate corresponding text. For example, in a text - generation task, you can use the following code in Python (assuming you have the relevant libraries installed):
prompt = "A 'whatpu' is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. | To do a 'farduddle' means to jump up and down really fast. An example of a sentence that uses the word farduddle is:"
generated_text = model.generate(prompt)
print(generated_text)
Advanced Usage
For more complex tasks, such as code generation or translation, you can adjust the input prompt according to the task requirements. For example, in a code - generation task:
prompt = "Do a hello world in different languages: Python: print(\"hello world\") R:"
generated_code = model.generate(prompt)
print(generated_code)
đ Documentation
Basics
This section provides information about the model type, version, license, funders, release date, developers, and contact information. It is useful for anyone who wants to reference the model.
Developed by: BigScience (website)
All collaborators are either volunteers or have an agreement with their employer. (Further breakdown of participants forthcoming.)
Model Type: Transformer - based Language Model
Checkpoints format: transformers
(Megatron - DeepSpeed format available [here](https://huggingface.co/bigscience/bloom - optimizer - states))
Version: 1.0.0
Languages: Multiple; see [training data](#training - data)
License: RAIL License v1.0 (link / [article and FAQ](https://bigscience.huggingface.co/blog/the - bigscience - rail - license))
Release Date Estimate: Monday, 11.July.2022
Send Questions to: bigscience - contact@googlegroups.com
Cite as: BigScience, BigScience Language Open - science Open - access Multilingual (BLOOM) Language Model. International, May 2021 - May 2022
Funded by:
- The French government.
- Hugging Face (website).
- Organizations of contributors. (Further breakdown of organizations forthcoming.)
Technical Specifications
This section includes details about the model objective and architecture, and the compute infrastructure. It is useful for people interested in model development.
Please see [the BLOOM training README](https://github.com/bigscience - workshop/bigscience/tree/master/train/tr11 - 176B - ml#readme) for full details on replicating training.
Model Architecture and Objective
- Modified from Megatron - LM GPT2 (see paper, [BLOOM Megatron code](https://github.com/bigscience - workshop/Megatron - DeepSpeed)):
- Decoder - only architecture
- Layer normalization applied to word embeddings layer (
StableEmbedding
; see code, paper)
- ALiBI positional encodings (see paper), with GeLU activation functions
- 176,247,271,424 parameters:
- 3,596,615,680 embedding parameters
- 70 layers, 112 attention heads
- Hidden layers are 14336 - dimensional
- Sequence length of 2048 tokens used (see BLOOM tokenizer, tokenizer description)
Objective Function: Cross Entropy with mean reduction (see API documentation).
Compute infrastructure
Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup - recherche.gouv.fr/fr/signature - du - marche - d - acquisition - de - l - un - des - supercalculateurs - les - plus - puissants - d - europe - 46733)).
Hardware
- 384 A100 80GB GPUs (48 nodes)
- Additional 32 A100 80GB GPUs (4 nodes) in reserve
- 8 GPUs per node Using NVLink 4 inter - gpu connects, 4 OmniPath links
- CPU: AMD
- CPU memory: 512GB per node
- GPU memory: 640GB per node
- Inter - node connect: Omni - Path Architecture (OPA)
- NCCL - communications network: a fully dedicated subnet
- Disc IO network: shared network with other types of nodes
Software
- Megatron - DeepSpeed ([Github link](https://github.com/bigscience - workshop/Megatron - DeepSpeed))
- DeepSpeed (Github link)
- PyTorch (pytorch - 1.11 w/ CUDA - 11.5; see Github link)
- apex (Github link)
đ§ Technical Details
Training
This section provides information about the training data, the speed and size of training elements, and the environmental impact of training. It is useful for people who want to learn more about the model inputs and training footprint.
Training Data
Details for each dataset are provided in individual Data Cards, and the sizes of each of their contributions to the aggregated training data are presented in an [Interactive Corpus Map](https://huggingface.co/spaces/bigscience - catalogue - lm - data/corpus - map).
Training data includes:
- 46 natural languages
- 13 programming languages
- In 1.6TB of pre - processed text, converted into 350B unique tokens (see the tokenizer section for more.)
Languages
The pie chart shows the distribution of languages in training data.

The following tables show the further distribution of Niger - Congo & Indic languages and programming languages in the training data.
Distribution of Niger Congo and Indic languages.
Niger Congo |
Percentage |
|
Indic |
Percentage |
Chi Tumbuka |
0.00002 |
|
Assamese |
0.01 |
Kikuyu |
0.00004 |
|
Odia |
0.04 |
Bambara |
0.00004 |
|
Gujarati |
0.04 |
... |
... |
|
... |
... |
Co2 Eq Emissions
- Emissions: 24,700,000
- Source: "Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. https://arxiv.org/abs/2211.02001"
- Training Type: "pre - training"
- Geographical Location: "Orsay, France"
- Hardware Used: "384 A100 80GB GPUs"
Model Index
- Name: bloom
- Results:
- Task:
- Dataset:
- Type: openai_humaneval
- Name: humaneval
- Metrics:
- Name: pass@1
- Type: pass@1
- Value: 0.15542682926829265
- Verified: false
- Name: pass@10
- Type: pass@10
- Value: 0.3278356276947017
- Verified: false
- Name: pass@100
- Type: pass@100
- Value: 0.5719815685597749
- Verified: false
đ License
The model is licensed under the RAIL License v1.0 (link / [article and FAQ](https://bigscience.huggingface.co/blog/the - bigscience - rail - license)).