🚀 GPT-SW3 Model
GPT-SW3 is a collection of large decoder-only pretrained transformer language models. It can generate coherent text in multiple languages and programming languages, and can be instructed to perform various text tasks.
🚀 Quick Start
Since this is a private repository, you need to log in with your access token to access the model from Python. You can do this with huggingface-cli login
. See HuggingFace Quick Start Guide for more information.
The following code snippet loads the tokenizer & model, and uses the GPU if available:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
model_name = "AI-Sweden-Models/gpt-sw3-6.7b-v2-instruct"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
prompt = "Träd är fina för att"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)
✨ Features
- Multilingual Generation: Capable of generating coherent text in 5 different languages and 4 programming languages.
- Instruction-based Tasks: Can be instructed to perform text tasks it hasn't been explicitly trained for by casting them as text generation tasks.
📦 Installation
Not provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
model_name = "AI-Sweden-Models/gpt-sw3-6.7b-v2-instruct"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
prompt = "Träd är fina för att"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)
generated_token_ids = model.generate(
inputs=input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.6,
top_p=1,
)[0]
generated_text = tokenizer.decode(generated_token_ids)
Advanced Usage
Generating text using the generate
method in the chat format:
prompt = """
<|endoftext|><s>
User:
Varför är träd fina?
<s>
Bot:
""".strip()
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)
generated_token_ids = model.generate(
inputs=input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.6,
top_p=1,
)[0]
generated_text = tokenizer.decode(generated_token_ids)
Using the HuggingFace pipeline:
generator = pipeline('text-generation', tokenizer=tokenizer, model=model, device=device)
generated = generator(prompt, max_new_tokens=100, do_sample=True, temperature=0.6, top_p=1)[0]["generated_text"]
📚 Documentation
Intended Use
GPT-SW3 is pre-released for research and evaluation of the capabilities of Large Language Models for the Nordic languages. It aims to contribute to knowledge building for LLMs, validate the model, and collect feedback.
Limitations
Like other large language models, GPT-SW3 has limitations in terms of bias, safety, generation diversity, and hallucination. It may overrepresent some viewpoints, contain stereotypes, generate inappropriate language, make errors, and produce irrelevant or repetitive outputs.
Model Details
Property |
Details |
Person or organization developing model |
GPT-SW3 was developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. |
Model date |
GPT-SW3 date of release 2022 - 12 - 20 |
Model version |
This is the second generation of GPT-SW3. |
Model type |
GPT-SW3 is a large decoder-only transformer language model. |
Information about training algorithms, parameters, fairness constraints or other applied approaches, and features |
GPT-SW3 was trained with the NeMo Megatron GPT implementation. |
Paper or other resource for more information |
N/A |
License |
LICENSE |
Where to send questions or comments about the model |
nlu@ai.se |
Intended Use
- Primary intended uses: Research and evaluation of Large Language Models for the Nordic languages.
- Primary intended users: Organizations and individuals in the Nordic NLP ecosystem.
- Out-of-scope use cases: See the modified RAIL license.
Data, Limitations, and Recommendations
- Data selection for training: Training data was selected based on breadth and availability. See the Datasheet for more details.
- Data selection for evaluation: N/A
- Limitations: Similar to other large language models, GPT-SW3 has limitations in bias, safety, generation diversity, and hallucination.
- Recommendations for future work: Indirect users should be aware of LLM-generated content. Users should be aware of risks and limitations and include appropriate disclaimers. Models pretrained with the LLM should have an updated Model Card. Users should provide feedback mechanisms.
GPT-SW3 Datasheet
Motivation
- The dataset was created for pre-training Swedish Large Language Models due to the lack of large-scale high-quality Swedish datasets.
- The dataset was created by the NLU research group at AI Sweden, which consists of researchers and developers from AI Sweden and RISE.
- The Swedish Innovation Agency (Vinnova) funded the work through several grants, including 2019 - 02996 and 2022 - 00949.
Composition
The dataset consists of textual documents categorized by language and document type. It includes sources such as books, articles, code, conversational data, math data, miscellaneous data, and web common crawl data.
🔧 Technical Details
GPT-SW3 was trained with the NeMo Megatron GPT implementation on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The instruct
models were finetrained on instruction data using both chat and raw text formats.
📄 License
The model is released under the LICENSE.