🚀 Tweety-7b-dutch: A Dutch Large Language Model
Tweety-7b-dutch is a foundation model tailored for the Dutch language. It uses a Dutch tokenizer to better understand and generate Dutch text. Built on the mistral architecture with flash attention, it can efficiently process up to 8192 tokens.
🚀 Quick Start
This section provides a high - level overview of the Tweety-7b-dutch model. For more detailed information, please refer to the subsequent sections.
✨ Features
- Dutch - centric: Incorporates a Dutch tokenizer for enhanced Dutch text understanding and generation.
- Efficient architecture: Based on the mistral architecture with flash attention, enabling efficient processing within an 8192 - token context window.
- Clean training data: Trained on the cleaned Dutch mC4 dataset without instruction finetuning.
📦 Installation
No installation steps were provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples were provided in the original document, so this section is skipped.
📚 Documentation
Model Details
Our tweety-7b-dutch model has an Apache 2.0 license, which encourages applications in research, content creation, and language analysis.
Uses
As a base model, tweety-7b-dutch is suitable for direct applications in text generation and understanding within the Dutch language.
Technical Specifications
Compute Infrastructure
Training utilized Nvidia H100 and A100 GPUs. Inference is accessible on lower - end GPUs, basically any GPU capable of running mistral models.
Model Weights
- This model was trained in bfloat16.
- GGUF weights are released by Bram Vanroy.
Citation
If you use this model, please cite our work as:
@article{tweeties2024,
title = {Trans - Tokenization and Cross - lingual Vocabulary Transfers: Language Adaptation of LLMs for Low - Resource NLP},
author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
url = {https://arxiv.org/abs/2408.04303},
year = {2024},
note = {Accepted at COLM 2024}
}
Contributors
Pieter Delobelle, François Remy, Miryam de Lhoneux, Thomas Demeester
🇳🇱🇧🇪 There is also a Dutch readme