Bloom-1b1 Open-Source Multilingual Language Model - Free Support for 46 Natural Languages and 12 Programming Languages

Bloom 1b1

Developed by bigscience

BigScience Large Open-science Multilingual Language Model, supporting 46 natural languages and 12 programming languages

Large Language Model Supports Multiple LanguagesOpen Source License:Openrail #Multilingual generation #Ultra-large scale parameters #Open-source research oriented

Downloads 9,128

Release Time : 5/19/2022

Model Overview

BLOOM is a multilingual large language model based on the Transformer architecture, developed by the international collaborative project BigScience, aimed at promoting public research on large language models.

Model Features

Multilingual support

Supports 46 natural languages and 12 programming languages, with special focus on low-resource languages

Open science

Developed as an international collaborative project, adhering to open science principles

Large-scale training

Trained using 384 A100 80GB GPUs, with a parameter scale of 176 billion

Eco-friendly computing

Training supercomputer primarily powered by nuclear energy, with waste heat being recycled

Model Capabilities

Text generation

Multilingual text processing

Programming language understanding

Contextual learning

Use Cases

Research

Language model research

Exploring the behavior and characteristics of multilingual language models

Cloze test

Used for evaluating language understanding and generation capabilities

Application development

Multilingual applications

Developing text generation applications supporting multiple languages

Downstream task fine-tuning

Serving as a base model for fine-tuning on specific tasks

🚀 BLOOM LM

BigScience Large Open-science Open-access Multilingual Language Model, enabling public research on large language models.

🚀 Quick Start

This README provides a comprehensive overview of the BLOOM LM, including model details, uses, training data, and more.

✨ Features

Multilingual Support: Supports a wide range of languages, including 45 natural languages and 12 programming languages.
Text Generation: Can be used for text generation tasks, such as exploring language characteristics and downstream tasks like information extraction, question answering, and summarization.
Open Science: Created for public research on large language models, promoting open access and collaboration.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Model Details

Basics

Developed by: BigScience (website)
Model Type: Transformer-based Language Model
Version: 1.0.0
Languages: Multiple; see training data
License: RAIL License v1.0 (link)
Release Date Estimate: Monday, 11.July.2022
Send Questions to: bigscience-contact@googlegroups.com
Cite as: BigScience, BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model. International, May 2021 - May 2022
Funded by:
- The French government.
- Hugging Face (website).
- Organizations of contributors. (Further breakdown of organizations forthcoming.)

Technical Specifications

Model Architecture: Modified from Megatron-LM GPT2 (see paper, BLOOM Megatron code):
- Decoder-only architecture
- Layer normalization applied to word embeddings layer (StableEmbedding; see code, paper)
- ALiBI positional encodings (see paper), with GeLU activation functions
- 1,065,314,304 parameters:
  - 385,351,680 embedding parameters
  - 24 layers, 16 attention heads
  - Hidden layers are 1536 - dimensional
  - Sequence length of 2048 tokens used (see BLOOM tokenizer, tokenizer description)
Objective Function: Cross Entropy with mean reduction (see API documentation).
Compute infrastructure: Jean Zay Public Supercomputer, provided by the French government (see announcement).
- Hardware: 384 A100 80GB GPUs (48 nodes):
  - Additional 32 A100 80GB GPUs (4 nodes) in reserve
  - 8 GPUs per node Using NVLink 4 inter - gpu connects, 4 OmniPath links
  - CPU: AMD
  - CPU memory: 512GB per node
  - GPU memory: 640GB per node
  - Inter - node connect: Omni - Path Architecture (OPA)
  - NCCL - communications network: a fully dedicated subnet
  - Disc IO network: shared network with other types of nodes
- Software:
  - Megatron - DeepSpeed (Github link)
  - DeepSpeed (Github link)
  - PyTorch (pytorch - 1.11 w/ CUDA - 11.5; see Github link)
  - apex (Github link)

Training

Training logs: Tensorboard link
Number of epochs: 1
Dates:
- Started 11th March, 2022 11:42am PST
- Ended 5th July, 2022
Estimated cost of training: Equivalent of $2 - 5M in cloud computing (including preliminary experiments and other model sizes)
Server training location: Île - de - France, France

Tokenization

The BLOOM tokenizer (link) is a learned subword tokenizer trained using:

A byte - level Byte Pair Encoding (BPE) algorithm
A simple pre - tokenization rule, no normalization
A vocabulary size of 250,680 It was trained on a subset of a preliminary version of the corpus using alpha - weighting per language.

Environmental Impact

The training supercomputer, Jean Zay (website), uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.

Estimated carbon emissions: (Forthcoming upon completion of training.)
Estimated electricity usage: (Forthcoming upon completion of training.)

Uses

Intended Use

This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pretrained base model that can be further fine - tuned for specific tasks. Use cases below are not exhaustive.

Direct Use:
- Text generation
- Exploring characteristics of language generated by a language model
  - Examples: Cloze tests, counterfactuals, generations with reframings
Downstream Use:
- Tasks that leverage language models include: Information Extraction, Question Answering, Summarization

Misuse and Out - of - scope Use

See the BLOOM License, Attachment A, for detailed usage restrictions. The below list is non - exhaustive, but lists some easily foreseeable problematic use cases.

Out - of - scope Uses: Using the model in [high - stakes](#high - stakes) settings is out of scope for this model. The model is not designed for [critical decisions](#critical - decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but is not correct.
- Out - of - scope Uses Include:
  - Usage in biomedical domains, political and legal domains, or finance domains
  - Usage for evaluating or scoring individuals, such as for employment, education, or credit
  - Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct
Misuse: Intentionally using the model for harm, violating [human rights](#human - rights), or other kinds of malicious activities, is a misuse of this model. This includes:
- Spam generation
- Disinformation and influence operations
- Disparagement and defamation
- Harassment and abuse
- Deception
- Unconsented impersonation and imitation
- Unconsented surveillance
- Generating content without attribution to the model, as specified in the RAIL License, Use Restrictions

Intended Users

Direct Users:
- General Public
- Researchers
- Students
- Educators
- Engineers/developers
- Non - commercial entities
- Community advocates, including human and civil rights groups
Indirect Users:
- Users of derivatives created by Direct Users, such as those using software with an [intended use](#intended - use)
- Users of Derivatives of the Model, as described in the License
Others Affected (Parties Prenantes):
- People and groups referred to by the LLM
- People and groups exposed to outputs of, or decisions based on, the LLM
- People and groups whose original work is included in the LLM

Training Data

Details for each dataset are provided in individual Data Cards. Training data includes:

45 natural languages
12 programming languages
In 1.5TB of pre - processed text, converted into 350B unique tokens (see the tokenizer section for more.)

Languages

The pie chart shows the distribution of languages in training data. pie chart showing the distribution of languages in training data

The following table shows the further distribution of Niger - Congo and Indic languages in the training data.

Click to expand

Niger Congo	Percentage	Indic	Percentage
Chi Tumbuka	0.00002	Assamese	0.01
Kikuyu	0.00004	Odia	0.04
Bambara	0.00004	Gujarati	0.04
Akan	0.00007	Marathi	0.05
Xitsonga	0.00007	Punjabi	0.05
Sesotho	0.00007	Kannada	0.06
Chi Chewa	0.0001	Nepali	0.07
Setswana	0.0002	Telugu	0.09
Northern Sotho	0.0002	Malayalam	0.10
Fon	0.0002	Urdu	0.10
Kirundi	0.0003	Tamil	0.20
Wolof	0.0004	Bengali	0.50
Kuganda	0.0004	Hindi	0.70
Chi Shona	0.001
Isi Zulu	0.001
Igbo	0.001
Xhosa	0.001
Kinyarwanda	0.003
Yoruba	0.006
Swahili	0.02

The following table shows the distribution of programming languages.

Click to expand

Extension	Language	Number of files
java	Java	5,407,724
php	PHP	4,942,186
cpp	C++	2,503,930
py	Python	2,435,072
js	JavaScript	1,905,518
cs	C#	1,577,347
rb	Ruby	6,78,413
cc	C++	443,054
hpp	C++	391,048
lua	Lua	352,317
go	GO	227,763
ts	TypeScript	195,254
C	C	134,537
scala	Scala	92,052
hh	C++	67,161
H	C++	55,899
tsx	TypeScript	33,107
rs	Rust	29,693
phpt	PHP	9,702
c++	C++	1,342
h++	C++	791
php3	PHP	540
phps	PHP	270
php5	PHP	166
php4	PHP	29

🔧 Technical Details

The technical details are provided in the "Model Details" section, including model architecture, objective function, compute infrastructure, training, and tokenization.

📄 License

The model is licensed under the RAIL License v1.0 (link).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご