BLOOM Open-Source Multilingual Generation Model - Supports Creation in 46 Languages and 13 Programming Languages

Bloom

Developed by bigscience

BLOOM is a large-scale multilingual generative model supporting 46 natural languages and 13 programming languages, developed by the international research organization BigScience

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Openrail #Multilingual text generation #176 billion parameter large model #Programming language support

Downloads 3,917

Release Time : 5/19/2022

Model Overview

An autoregressive large language model trained with industrial-scale computing resources, capable of generating coherent text and performing tasks not explicitly trained for

Model Features

Extensive multilingual support

Covers 46 natural languages and 13 programming languages, including many under-resourced languages

Open collaborative development

Developed collaboratively by over 1,000 researchers worldwide using open science principles

Efficient positional encoding

Utilizes ALiBI positional encoding technology supporting context windows up to 2048 tokens

Low-carbon training

Trained using nuclear-powered French supercomputing center with waste heat used for campus heating

Model Capabilities

Multilingual text generation

Programming code generation

Zero-shot task transfer

Cross-lingual knowledge transfer

Use Cases

Education

Language learning assistance

Generates example sentences and practice materials in different languages

Supports grammar and vocabulary learning in 46 languages

Research

Low-resource language studies

Provides linguists with generation and analysis tools for low-resource languages

Supports text generation for extremely low-resource languages like Bambara

Development

Code assistance generation

Generates code snippets in multiple programming languages based on comments

Supports code completion for 13 mainstream programming languages

🚀 BigScience Large Open - science Open - access Multilingual Language Model

BigScience is an autoregressive Large Language Model (LLM). It is trained on vast amounts of text data using industrial - scale computational resources to continue text from a prompt. It can output coherent text in 46 languages and 13 programming languages, which is hardly distinguishable from human - written text. Also, it can be instructed to perform text tasks not explicitly trained for by casting them as text generation tasks.

🚀 Quick Start

The model can be used for various text - related tasks. You can start by providing a text prompt, and the model will generate relevant text based on it. For more detailed usage, please refer to the official documentation and relevant code examples.

✨ Features

Multilingual Support: Capable of generating text in 46 natural languages and 13 programming languages.
Task Adaptability: Can perform text tasks not explicitly trained for by converting them into text generation tasks.
High - Quality Output: Produces coherent text that is difficult to distinguish from human - written text.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

The basic way to use this model is to input a text prompt, and the model will generate corresponding text. For example, in a text - generation task, you can use the following code in Python (assuming you have the relevant libraries installed):

# Here is a simple example of using the model for text generation
# This is a pseudo - code, and actual usage may require specific libraries and configurations
prompt = "A 'whatpu' is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. | To do a 'farduddle' means to jump up and down really fast. An example of a sentence that uses the word farduddle is:"
# Assume there is a function to call the model
generated_text = model.generate(prompt)
print(generated_text)

Advanced Usage

For more complex tasks, such as code generation or translation, you can adjust the input prompt according to the task requirements. For example, in a code - generation task:

# Code generation example
prompt = "Do a hello world in different languages: Python: print(\"hello world\") R:"
generated_code = model.generate(prompt)
print(generated_code)

📚 Documentation

Basics

This section provides information about the model type, version, license, funders, release date, developers, and contact information. It is useful for anyone who wants to reference the model.

Developed by: BigScience (website) All collaborators are either volunteers or have an agreement with their employer. (Further breakdown of participants forthcoming.)

Model Type: Transformer - based Language Model

Checkpoints format: transformers (Megatron - DeepSpeed format available [here](https://huggingface.co/bigscience/bloom - optimizer - states))

Version: 1.0.0

Languages: Multiple; see [training data](#training - data)

License: RAIL License v1.0 (link / [article and FAQ](https://bigscience.huggingface.co/blog/the - bigscience - rail - license))

Release Date Estimate: Monday, 11.July.2022

Send Questions to: bigscience - contact@googlegroups.com

Cite as: BigScience, BigScience Language Open - science Open - access Multilingual (BLOOM) Language Model. International, May 2021 - May 2022

Funded by:

The French government.
Hugging Face (website).
Organizations of contributors. (Further breakdown of organizations forthcoming.)

Technical Specifications

This section includes details about the model objective and architecture, and the compute infrastructure. It is useful for people interested in model development.

Please see [the BLOOM training README](https://github.com/bigscience - workshop/bigscience/tree/master/train/tr11 - 176B - ml#readme) for full details on replicating training.

Model Architecture and Objective

Modified from Megatron - LM GPT2 (see paper, [BLOOM Megatron code](https://github.com/bigscience - workshop/Megatron - DeepSpeed)):
- Decoder - only architecture
- Layer normalization applied to word embeddings layer (StableEmbedding; see code, paper)
- ALiBI positional encodings (see paper), with GeLU activation functions
- 176,247,271,424 parameters:
  - 3,596,615,680 embedding parameters
  - 70 layers, 112 attention heads
  - Hidden layers are 14336 - dimensional
  - Sequence length of 2048 tokens used (see BLOOM tokenizer, tokenizer description)

Objective Function: Cross Entropy with mean reduction (see API documentation).

Compute infrastructure

Jean Zay Public Supercomputer, provided by the French government (see [announcement](https://www.enseignementsup - recherche.gouv.fr/fr/signature - du - marche - d - acquisition - de - l - un - des - supercalculateurs - les - plus - puissants - d - europe - 46733)).

Hardware

384 A100 80GB GPUs (48 nodes)
Additional 32 A100 80GB GPUs (4 nodes) in reserve
8 GPUs per node Using NVLink 4 inter - gpu connects, 4 OmniPath links
CPU: AMD
CPU memory: 512GB per node
GPU memory: 640GB per node
Inter - node connect: Omni - Path Architecture (OPA)
NCCL - communications network: a fully dedicated subnet
Disc IO network: shared network with other types of nodes

Software

Megatron - DeepSpeed ([Github link](https://github.com/bigscience - workshop/Megatron - DeepSpeed))
DeepSpeed (Github link)
PyTorch (pytorch - 1.11 w/ CUDA - 11.5; see Github link)
apex (Github link)

🔧 Technical Details

Training

This section provides information about the training data, the speed and size of training elements, and the environmental impact of training. It is useful for people who want to learn more about the model inputs and training footprint.

Training Data

Details for each dataset are provided in individual Data Cards, and the sizes of each of their contributions to the aggregated training data are presented in an [Interactive Corpus Map](https://huggingface.co/spaces/bigscience - catalogue - lm - data/corpus - map).

Training data includes:

46 natural languages
13 programming languages
In 1.6TB of pre - processed text, converted into 350B unique tokens (see the tokenizer section for more.)

Languages

The pie chart shows the distribution of languages in training data. ![pie chart showing the distribution of languages in training data](https://github.com/bigscience - workshop/model_card/blob/main/assets/data/pie_v2.svg?raw=true)

The following tables show the further distribution of Niger - Congo & Indic languages and programming languages in the training data.

Distribution of Niger Congo and Indic languages.

Niger Congo	Percentage	Indic	Percentage
Chi Tumbuka	0.00002	Assamese	0.01
Kikuyu	0.00004	Odia	0.04
Bambara	0.00004	Gujarati	0.04
...	...	...	...

Co2 Eq Emissions

Emissions: 24,700,000
Source: "Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. https://arxiv.org/abs/2211.02001"
Training Type: "pre - training"
Geographical Location: "Orsay, France"
Hardware Used: "384 A100 80GB GPUs"

Model Index

Name: bloom
Results:
- Task:
  - Type: text - generation
- Dataset:
  - Type: openai_humaneval
  - Name: humaneval
- Metrics:
  - Name: pass@1
  - Type: pass@1
  - Value: 0.15542682926829265
  - Verified: false
  - Name: pass@10
  - Type: pass@10
  - Value: 0.3278356276947017
  - Verified: false
  - Name: pass@100
  - Type: pass@100
  - Value: 0.5719815685597749
  - Verified: false

📄 License

The model is licensed under the RAIL License v1.0 (link / [article and FAQ](https://bigscience.huggingface.co/blog/the - bigscience - rail - license)).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご