UCCIX-Llama2-13B-Instruct Open-source Bilingual Large Model - Efficiently handle translation and communication between Irish and English

UCCIX Llama2 13B Instruct

Developed by ReliableAI

UCCIX-Llama2-13B-Instruct is an Irish-English bilingual large language model, developed based on the Llama 2-13B architecture, with special optimizations for Irish language processing.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Irish-English bilingual #Instruction fine-tuning optimization #Low-resource language support

Downloads 21

Release Time : 6/11/2024

Model Overview

The model incorporates native Irish language tokens through vocabulary expansion and has undergone continued pre-training and instruction fine-tuning on Irish texts, enabling effective understanding and generation of Irish language content.

Model Features

Bilingual capability

Supports both Irish and English, with excellent performance on Irish language tasks

Instruction following

Fine-tuned with supervised instructions, effectively understanding and executing human instructions

Vocabulary expansion

Enhanced understanding of Irish through the addition of native Irish language tokens

Model Capabilities

Text generation

Bilingual understanding

Instruction following

Irish language processing

Use Cases

Language learning

Irish language learning assistant

Helps learners understand and practice Irish

Provides accurate Irish language explanations and examples

Content creation

Irish language content generation

Generates Irish articles, stories, or other text content

Produces fluent and natural Irish language text

🚀 UCCIX-Llama2-13B-Instruct

The UCCIX-Llama2-13B-Instruct is an Irish-English bilingual Large Language Model (LLM). It can understand both languages and outperforms much larger models in Irish language tasks. This model is based on Llama 2-13B, with an expanded vocabulary to include native Irish tokens and additional pre - training on a collection of about 520M Irish tokens.

🚀 Quick Start

✨ Features

Bilingual Capability: Understands both English and Irish, excelling in Irish language tasks.
Vocabulary Expansion: Based on Llama 2 - 13B, with an expanded vocabulary for native Irish tokens.
Pre - training: Additional continued pre - training on ~520M Irish tokens from https://huggingface.co/datasets/ReliableAI/Irish - Text - Collection.
Instruction Fine - tuning: Supervised instruction fine - tuning to better follow human instructions.

📦 Installation

The installation process mainly involves using the transformers library. Here is the code to load the model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ReliableAI/UCCIX-Llama2-13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="auto",
                                             dtype=torch.float16 # optional, load in 16-bit precision mode to reduce memory usage
)
model.eval()

💻 Usage Examples

Basic Usage

The template used to build a prompt for this Instruct model is as follows:

### Instruction:
{system_prompt}

### Input:
{instruction1}

### Response:
{respone1}

### Input:
{instruction2}

### Response:
{respone2}

Here is a Python code example to run the model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ReliableAI/UCCIX-Llama2-13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="auto",
                                             dtype=torch.float16 # optional, load in 16-bit precision mode to reduce memory usage
)
model.eval()

def make_prompt(system_prompt, instruction):
    return f"""### Instruction:
{system_prompt}

### Input:
{instruction}

### Response:
"""

user_input = "Do you know about CloudCIX?"
SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe."
input_prompt = make_prompt(SYSTEM_PROMPT, user_input)

input_ids = tokenizer(input_prompt, return_tensors="pt")["input_ids"]

generated_token_ids = model.generate(
    inputs=input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.6,
    top_p=1,
)[0]

generated_text = tokenizer.decode(generated_token_ids)

📚 Documentation

UCCIX is a pioneering effort in the development of the first - ever open - source Irish - based LLM. You can find more details at: https://arxiv.org/abs/2405.13010

You can interact with the model live at: https://aine.chat

⚠️ Important Note

As a pioneering effort, the UCCIX model does not have any moderation mechanisms at the moment. We anticipate collaborating with the community to refine the model's adherence to restrictions so that it can be implemented in settings that demand moderated outcomes.

📄 License

The model is licensed under the apache - 2.0 license.

📚 Citation

@misc{tran2024uccix,
      title={UCCIX: Irish-eXcellence Large Language Model}, 
      author={Khanh-Tung Tran and Barry O'Sullivan and Hoang D. Nguyen},
      year={2024},
      eprint={2405.13010},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

📋 Model Information

Property	Details
Base Model	ReliableAI/UCCIX-Llama2-13B
Datasets	ReliableAI/Irish-Text-Collection
Language	English, Irish
License	apache - 2.0
Pipeline Tag	text - generation

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご