🚀 BODE
BODE is a Portuguese language model (LLM) developed from the Llama 2 model through fine - tuning on the Portuguese - translated Alpaca dataset by the authors of Cabrita. It is designed for various natural language processing tasks in Portuguese, such as text generation, machine translation, text summarization, and more.
🚀 Quick Start
We strongly recommend using Kaggle with GPU. You can easily use BODE with the HuggingFace Transformers library. However, you need to obtain authorization to access Llama 2. We also provide a Jupyter notebook on Google Colab. Click here to access it.
Here is a simple example of how to load the model and generate text:
!pip install transformers
!pip install einops accelerate bitsandbytes
!pip install sentence_transformers
!pip install git+https://github.com/huggingface/peft.git
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from peft import PeftModel, PeftConfig
llm_model = 'recogna-nlp/bode-7b-alpaca-pt-br'
hf_auth = 'HF_ACCESS_KEY'
config = PeftConfig.from_pretrained(llm_model)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True, return_dict=True, load_in_8bit=True, device_map='auto', token=hf_auth)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, token=hf_auth)
model = PeftModel.from_pretrained(model, llm_model)
model.eval()
def generate_prompt(instruction, input=None):
if input:
return f"""Below is an instruction that describes a task, along with an input that provides more context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
else:
return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""
generation_config = GenerationConfig(
temperature=0.2,
top_p=0.75,
num_beams=2,
do_sample=True
)
def evaluate(instruction, input=None):
prompt = generate_prompt(instruction, input)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_length=300
)
for s in generation_output.sequences:
output = tokenizer.decode(s)
print("Response:", output.split("### Response:")[1].strip())
evaluate("Answer in detail: What is a goat?")
✨ Features
BODE is a large language model (LLM) for Portuguese, developed from the Llama 2 model through fine - tuning on the Alpaca dataset translated into Portuguese by the authors of Cabrita. This model is designed for natural language processing tasks in Portuguese, such as text generation, automatic translation, text summarization, and more.
The goal of developing BODE is to address the shortage of LLMs for the Portuguese language. Classic models, like Llama itself, can respond to prompts in Portuguese, but they are prone to many grammar errors and sometimes generate responses in English. There are still few Portuguese models available for free use, and to our knowledge, there are no available models with 13b parameters or more specifically trained with Portuguese data.
Access the article for more information about BODE.
The version of the BODE model provided on this page was trained with the internal resources available in the advanced research laboratory of Recogna. Soon, after the necessary authorizations, we should provide the original version of the model, trained on Santos Dumont.
📦 Installation
The installation steps are included in the usage example code above. You need to run the following commands:
!pip install transformers
!pip install einops accelerate bitsandbytes
!pip install sentence_transformers
!pip install git+https://github.com/huggingface/peft.git
📚 Documentation
Model Details
Property |
Details |
Model Type |
Llama 2 |
Training Data |
Alpaca |
Language |
Portuguese |
Available Versions
Number of Parameters |
PEFT |
Model |
7b |
✓ |
[recogna - nlp/bode - 7b - alpaca - pt - br](https://huggingface.co/recogna - nlp/bode - 7b - alpaca - pt - br) |
13b |
✓ |
[recogna - nlp/bode - 13b - alpaca - pt - br](https://huggingface.co/recogna - nlp/bode - 13b - alpaca - pt - br) |
7b |
|
[recogna - nlp/bode - 7b - alpaca - pt - br - no - peft](https://huggingface.co/recogna - nlp/bode - 7b - alpaca - pt - br - no - peft) |
13b |
|
[recogna - nlp/bode - 13b - alpaca - pt - br - no - peft](https://huggingface.co/recogna - nlp/bode - 13b - alpaca - pt - br - no - peft) |
7b - gguf |
|
[recogna - nlp/bode - 7b - alpaca - pt - br - gguf](https://huggingface.co/recogna - nlp/bode - 7b - alpaca - pt - br - gguf) |
13b - gguf |
|
[recogna - nlp/bode - 13b - alpaca - pt - br - gguf](https://huggingface.co/recogna - nlp/bode - 13b - alpaca - pt - br - gguf) |
Training and Data
The BODE model was trained by fine - tuning the Llama 2 model using the Portuguese Alpaca dataset, which is an instruction - based dataset. The training was originally conducted on the Santos Dumont Supercomputer of LNCC through the Fundunesp project 2019/00697 - 8, but the version provided here is a replica, trained with the same data and parameters in the internal environment of Recogna.
Citation
If you want to use BODE in your research, you can cite this article that discusses the model in more detail. Cite it as follows:
@misc{bode2024,
title={Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task},
author={Gabriel Lino Garcia and Pedro Henrique Paiola and Luis Henrique Morelli and Giovani Candido and Arnaldo Cândido Júnior and Danilo Samuel Jodas and Luis C. S. Afonso and Ivan Rizzo Guilherme and Bruno Elias Penteado and João Paulo Papa},
year={2024},
eprint={2401.02909},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contributions
Contributions to improve this model are welcome. Feel free to open issues and pull requests.
Acknowledgments
We thank the National Laboratory for Scientific Computing (LNCC/MCTI, Brazil) for providing the CAD resources of the SDumont supercomputer.
Detailed results can be found [here](https://huggingface.co/datasets/eduagarcia - temp/llm_pt_leaderboard_raw_results/tree/main/recogna - nlp/bode - 7b - alpaca - pt - br)
Metric |
Value |
Average |
53.21 |
ENEM Challenge (No Images) |
34.36 |
BLUEX (No Images) |
28.93 |
OAB Exams |
30.84 |
Assin2 RTE |
79.83 |
Assin2 STS |
43.47 |
FaQuAD NLI |
67.45 |
HateBR Binary |
85.06 |
PT Hate Speech Binary |
65.73 |
tweetSentBR |
43.25 |
📄 License
This project is under the MIT license.