🚀 GPT-SW3: A Multilingual Large Language Model
GPT-SW3 is a collection of large decoder-only pretrained transformer language models capable of generating coherent text in multiple languages and programming languages. It offers various model versions and can be instructed to perform diverse text tasks.
🚀 Quick Start
To access the model from Python, since this is a private repository, you need to log in with your access token using huggingface-cli login
. Refer to the HuggingFace Quick Start Guide for more details.
✨ Features
- Multilingual Support: Capable of generating text in Danish, Swedish, English, Norwegian, and Icelandic, as well as 4 programming languages.
- Instruction Following: Can be instructed to perform text tasks not explicitly trained for by casting them as text generation tasks.
- Multiple Model Versions: Offers base models, instruct models, and quantized models with different scales.
📦 Installation
As this is a private repository, you need to log in with your access token to access the model from Python. Use the following command:
huggingface-cli login
💻 Usage Examples
Basic Usage
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
model_name = "AI-Sweden-Models/gpt-sw3-20b-instruct"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
prompt = "Träd är fina för att"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)
generated_token_ids = model.generate(
inputs=input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.6,
top_p=1,
)[0]
generated_text = tokenizer.decode(generated_token_ids)
print(generated_text)
Using HuggingFace Pipeline
generator = pipeline('text-generation', tokenizer=tokenizer, model=model, device=device)
generated = generator(prompt, max_new_tokens=100, do_sample=True, temperature=0.6, top_p=1)[0]["generated_text"]
print(generated)
📚 Documentation
Model Description
GPT-SW3 is a collection of large decoder-only pretrained transformer language models developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. It has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code, using a causal language modeling (CLM) objective with the NeMo Megatron GPT implementation. The instruct
models were finetrained on instruction data in both chat and raw text formats.
Intended Use
GPT-SW3 is an autoregressive large language model capable of generating coherent text in 5 different languages and 4 programming languages. It can also be instructed to perform text tasks it has not been explicitly trained for by casting them as text generation tasks.
Limitations
Like other large language models, GPT-SW3 has limitations in terms of bias, safety, generation diversity, and hallucination. The model may overrepresent some viewpoints, contain stereotypes, generate inappropriate language, make errors, and produce irrelevant or repetitive outputs.
Compliance
The release of GPT-SW3 consists of model weights, a configuration file, a tokenizer file, and a vocabulary file, none of which contain any personally identifiable information (PII) or copyrighted material.
Model Card
We provide a model card for GPT-SW3 following Mitchell et al. (2018).
Model Details
Property |
Details |
Developer |
AI Sweden in collaboration with RISE and the WASP WARA for Media and Language |
Release Date |
2022-12-20 |
Model Version |
Second generation of GPT-SW3 |
Model Type |
Large decoder-only transformer language model |
Training Algorithm |
Trained with the NeMo Megatron GPT implementation |
Paper or Resource |
N/A |
License |
LICENSE |
Contact |
nlu@ai.se |
Intended Use
- Primary Uses: Pre-release for research and evaluation of large language model capabilities for Nordic languages.
- Primary Users: Organizations and individuals in the Nordic NLP ecosystem who can contribute to model validation and testing.
- Out-of-Scope Use Cases: See the modified RAIL license.
Data, Limitations, and Recommendations
- Data Selection for Training: Training data was selected based on breadth and availability. See the Datasheet for more details.
- Limitations: Similar to other large language models, GPT-SW3 has limitations in bias, safety, generation diversity, and hallucination.
- Recommendations for Future Work: Indirect users should be aware of LLM-generated content. Users should be aware of risks and limitations and include appropriate disclaimers. Models pretrained with the LLM should include an updated Model Card, and users should provide feedback mechanisms.
Datasheet
We follow the recommendations of Gebru et al. (2021) and provide a datasheet for the dataset used to train GPT-SW3.
Motivation
- Purpose: To train Swedish large language models, a large-scale Swedish dataset of high quality was needed. Since no such dataset existed, data in Nordic and English languages was collected.
- Creator: The NLU research group at AI Sweden, consisting of researchers and developers from AI Sweden and RISE.
- Funding: Funded by the Swedish Innovation Agency (Vinnova) through grants such as 2019-02996 and 2022-00949.
Composition
The dataset consists of textual documents categorized by language and document type, including sources from books, articles, code, conversational data, math, miscellaneous sources, web common crawl, and web sources.
📄 License
The model is released under the modified RAIL license.