đ UCCIX-Llama2-13B-Instruct
The UCCIX-Llama2-13B-Instruct is an Irish-English bilingual Large Language Model (LLM). It can understand both languages and outperforms much larger models in Irish language tasks. This model is based on Llama 2-13B, with an expanded vocabulary to include native Irish tokens and additional pre - training on a collection of about 520M Irish tokens.
đ Quick Start
⨠Features
- Bilingual Capability: Understands both English and Irish, excelling in Irish language tasks.
- Vocabulary Expansion: Based on Llama 2 - 13B, with an expanded vocabulary for native Irish tokens.
- Pre - training: Additional continued pre - training on ~520M Irish tokens from https://huggingface.co/datasets/ReliableAI/Irish - Text - Collection.
- Instruction Fine - tuning: Supervised instruction fine - tuning to better follow human instructions.
đĻ Installation
The installation process mainly involves using the transformers
library. Here is the code to load the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ReliableAI/UCCIX-Llama2-13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
device_map="auto",
dtype=torch.float16
)
model.eval()
đģ Usage Examples
Basic Usage
The template used to build a prompt for this Instruct model is as follows:
### Instruction:
{system_prompt}
### Input:
{instruction1}
### Response:
{respone1}
### Input:
{instruction2}
### Response:
{respone2}
Here is a Python code example to run the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ReliableAI/UCCIX-Llama2-13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
device_map="auto",
dtype=torch.float16
)
model.eval()
def make_prompt(system_prompt, instruction):
return f"""### Instruction:
{system_prompt}
### Input:
{instruction}
### Response:
"""
user_input = "Do you know about CloudCIX?"
SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe."
input_prompt = make_prompt(SYSTEM_PROMPT, user_input)
input_ids = tokenizer(input_prompt, return_tensors="pt")["input_ids"]
generated_token_ids = model.generate(
inputs=input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.6,
top_p=1,
)[0]
generated_text = tokenizer.decode(generated_token_ids)
đ Documentation
UCCIX is a pioneering effort in the development of the first - ever open - source Irish - based LLM. You can find more details at: https://arxiv.org/abs/2405.13010
You can interact with the model live at: https://aine.chat
â ī¸ Important Note
As a pioneering effort, the UCCIX model does not have any moderation mechanisms at the moment. We anticipate collaborating with the community to refine the model's adherence to restrictions so that it can be implemented in settings that demand moderated outcomes.
đ License
The model is licensed under the apache - 2.0
license.
đ Citation
@misc{tran2024uccix,
title={UCCIX: Irish-eXcellence Large Language Model},
author={Khanh-Tung Tran and Barry O'Sullivan and Hoang D. Nguyen},
year={2024},
eprint={2405.13010},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
đ Model Information
Property |
Details |
Base Model |
ReliableAI/UCCIX-Llama2-13B |
Datasets |
ReliableAI/Irish-Text-Collection |
Language |
English, Irish |
License |
apache - 2.0 |
Pipeline Tag |
text - generation |