BgGPT-7B-Instruct-v0.1 Open-source Model - Efficiently process Bulgarian and support relevant application scenarios

Bggpt 7B Instruct V0.1

Developed by INSAIT-Institute

BgGPT-7B is a model based on the Bulgarian language, which performs excellently in Bulgarian language processing and provides strong support for Bulgarian language-related application scenarios.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Bulgarian language optimization #Efficient encoding of Cyrillic characters #Bilingual instruction fine-tuning

Downloads 198

Release Time : 2/17/2024

Model Overview

A large Bulgarian language model fine-tuned on Mistral-7B, supporting text generation and understanding tasks in Bulgarian and English.

Model Features

Optimized Bulgarian language processing

Optimize the encoding of Cyrillic characters by expanding the tokenizer, significantly improving the efficiency of Bulgarian language processing

Bilingual support

Retain English capabilities while supporting mixed input of Bulgarian and English

Instruction fine-tuning

Optimized using the [INST] instruction format, suitable for dialogue and instruction following scenarios

Efficient inference

Supports flash-attention2 acceleration, providing faster inference speed

Model Capabilities

Bulgarian text generation

English text generation

Instruction understanding and execution

Logical reasoning

Question-answering system

Use Cases

Education

Bulgarian history Q&A

Answer questions related to Bulgarian history and culture

In the example, the establishment time and founder of the University of Sofia were accurately answered

Business applications

Bulgarian language customer service robot

Provide localized customer service support for Bulgarian enterprises

🚀 INSAIT-Institute/BgGPT-7B-Instruct-v0.1

Meet BgGPT-7B, a Bulgarian language model trained from mistralai/Mistral-7B-v0.1 and distributed under Apache 2.0 license.

image/png

This model was created by INSAIT Institute, part of Sofia University, in Sofia, Bulgaria.

🚀 Quick Start

Use in Transformers

First, install the direct dependencies:

pip install transformers torch accelerate

If you want faster inference using flash-attention2, you need to install these dependencies:

pip install packaging ninja
pip install flash-attn

Then, load the model in transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
    model="INSAIT-Institute/BgGPT-7B-Instruct-v0.1",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    use_flash_attn_2=True # optional
)

Use with GGML / llama.cpp

The model in GGUF format INSAIT-Institute/BgGPT-7B-Instruct-v0.1-GGUF

✨ Features

Trained from mistralai/Mistral-7B-v0.1 to enhance Bulgarian language capabilities.
Distributed under the Apache 2.0 license.
The tokenizer is extended for more efficient encoding of Bulgarian words in Cyrillic, improving throughput and performance.

📦 Installation

Transformers Installation

pip install transformers torch accelerate

Optional Flash-Attention2 Installation

pip install packaging ninja
pip install flash-attn

💻 Usage Examples

Instruction Format

In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence token <s>. Following instructions should not. The assistant generation will be ended by the end-of-sentence token.

E.g.

text = "<s>[INST] –ö–æ–≥–∞ –µ –æ—Å–Ω–æ–≤–∞–Ω –°–æ—Ñ–∏–π—Å–∫–∏—è—Ç —É–Ω–∏–≤–µ—Ä—Å–∏—Ç–µ—Ç? [/INST]"
"–°–æ—Ñ–∏–π—Å–∫–∏—è—Ç —É–Ω–∏–≤–µ—Ä—Å–∏—Ç–µ—Ç ‚Äû–°–≤. –ö–ª–∏–º–µ–Ω—Ç –û—Ö—Ä–∏–¥—Å–∫–∏‚Äú –µ —Å—ä–∑–¥–∞–¥–µ–Ω –Ω–∞ 1 –æ–∫—Ç–æ–º–≤—Ä–∏ 1888 –≥.</s> "
"[INST] –ö–æ–π –≥–æ –µ –æ—Å–Ω–æ–≤–∞–ª? [/INST]"

This format is available as a chat template via the apply_chat_template() method.

📚 Documentation

Model description

The model is fine-tuned to improve its Bulgarian language capabilities using multiple datasets, including Bulgarian web crawl data, a range of specialized Bulgarian datasets sourced by INSAIT Institute, and machine translations of popular English datasets. This Bulgarian data was augmented with English datasets to retain English and logical reasoning skills.

The model's tokenizer has been extended to allow for a more efficient encoding of Bulgarian words written in Cyrillic. This not only increases throughput of Cyrillic text but also performance.

Benchmarks

The model comes with a set of Benchmarks that are translations of the corresponding English-benchmarks. These are provided at https://github.com/insait-institute/lm-evaluation-harness-bg

image/png

📄 License

This model is distributed under the Apache 2.0 license.

📋 Summary

Property	Details
Finetuned from	mistralai/Mistral-7B-v0.1
Model Type	Causal decoder-only transformer language model
Language	Bulgarian and English
License	Apache 2.0
Contact	bggpt@insait.ai

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご