Tweety-7b-dutch-v24a Open-Source Dutch Large Model - Optimize Dutch Text Processing Capabilities

Tweety 7b Dutch V24a

Developed by Tweeties

Tweety-7b-dutch is a foundational large language model specialized in Dutch, based on the Mistral architecture, optimized for Dutch text processing with a Dutch tokenizer.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Dutch language generation #Large context window #Efficient attention mechanism

Downloads 1,568

Release Time : 4/4/2024

Model Overview

This model is a foundational large language model specifically designed and optimized for Dutch, suitable for Dutch text generation and comprehension tasks.

Model Features

Dutch language optimization

Trained with a Dutch tokenizer, specifically optimized for Dutch text processing

Large context window

Supports a context window of 8192 tokens

Efficient processing

Utilizes flash attention mechanism for efficient processing

Model Capabilities

Dutch text generation

Dutch text comprehension

Use Cases

Research

Dutch language research

Used for analyzing and studying Dutch language features

Content creation

Dutch content generation

Automatically generates Dutch articles, reports, and other content

🚀 Tweety-7b-dutch: A Dutch Large Language Model

Tweety-7b-dutch is a foundation model tailored for the Dutch language. It uses a Dutch tokenizer to better understand and generate Dutch text. Built on the mistral architecture with flash attention, it can efficiently process up to 8192 tokens.

🚀 Quick Start

This section provides a high - level overview of the Tweety-7b-dutch model. For more detailed information, please refer to the subsequent sections.

✨ Features

Dutch - centric: Incorporates a Dutch tokenizer for enhanced Dutch text understanding and generation.
Efficient architecture: Based on the mistral architecture with flash attention, enabling efficient processing within an 8192 - token context window.
Clean training data: Trained on the cleaned Dutch mC4 dataset without instruction finetuning.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples were provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Our tweety-7b-dutch model has an Apache 2.0 license, which encourages applications in research, content creation, and language analysis.

Property	Details
Tokenizer	Dutch, 50k tokens (yhavinga/gpt-neo-1.3B-dutch)
Pre - training data	Scraped Dutch (yhavinga/mc4_nl_cleaned)
Context window	8196 tokens
Training data	8.5B tokens
Developed by	KU Leuven and UGent
Funded by	KU Leuven BOF, VSC (Flemish Supercomputer Center), Vlaams AI - onderzoeksprogramma
Model Type	Foundation model
License	Apache 2.0

Uses

As a base model, tweety-7b-dutch is suitable for direct applications in text generation and understanding within the Dutch language.

Technical Specifications

Compute Infrastructure

Training utilized Nvidia H100 and A100 GPUs. Inference is accessible on lower - end GPUs, basically any GPU capable of running mistral models.

Model Weights

This model was trained in bfloat16.
GGUF weights are released by Bram Vanroy.

Citation

If you use this model, please cite our work as:

@article{tweeties2024,
    title = {Trans - Tokenization and Cross - lingual Vocabulary Transfers: Language Adaptation of LLMs for Low - Resource NLP},
    author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
    url = {https://arxiv.org/abs/2408.04303},
    year = {2024},
    note = {Accepted at COLM 2024}
}

Contributors

Pieter Delobelle, François Remy, Miryam de Lhoneux, Thomas Demeester

Tweety-7b-dutch: A Dutch Large Language Model

🇳🇱🇧🇪 There is also a Dutch readme

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご