Bode-7B-Alpaca-PT-BR Open Source Large Language Model - Free Deployment for Portuguese Natural Language Processing

Bode 7b Alpaca Pt Br

Developed by recogna-nlp

BODE is a large language model developed by fine-tuning the Llama 2 model with a Portuguese Alpaca dataset, specifically designed for Portuguese natural language processing tasks.

Large Language Model Supports Multiple LanguagesOpen Source License:MIT #Portuguese Large Language Model #Instruction Fine-tuning #Llama2 Architecture

Downloads 164

Release Time : 10/11/2023

Model Overview

BODE is a Portuguese large language model aimed at addressing the scarcity of Portuguese large language models. Based on the Llama 2 architecture and fine-tuned with the Alpaca dataset, it supports various tasks such as text generation and question answering.

Model Features

Portuguese Optimization

Specifically optimized for Portuguese, reducing grammatical errors and issues with English responses

Multi-task Support

Supports various natural language processing tasks, including text generation, question answering, sentiment analysis, and more

Multiple Parameter Sizes

Offers model versions with 7B and 13B parameters

PEFT Support

Some versions support Parameter-Efficient Fine-Tuning (PEFT) technology

Model Capabilities

Text Generation

Question Answering

Sentiment Analysis

Natural Language Inference

Text Similarity Calculation

Use Cases

Education

ENEM Exam Answers

Used for question-answering tasks in Brazil's National High School Exam (ENEM)

Accuracy 34.36%

Brazilian Bar Exam

Used for question-answering tasks in the Brazilian Bar Exam (OAB)

Accuracy 30.84%

Natural Language Processing

Textual Entailment Recognition

Textual entailment recognition task on the Assin2 RTE dataset

Macro F1 Score 79.83

Semantic Similarity Calculation

Semantic similarity calculation task on the Assin2 STS dataset

Pearson Coefficient 43.47

Social Media Analysis

Hate Speech Detection

Hate speech detection task on the HateBR dataset

Macro F1 Score 85.06

Sentiment Analysis

Sentiment analysis task on a Brazilian Twitter dataset

Macro F1 Score 43.25

🚀 BODE

BODE is a Portuguese language model (LLM) developed from the Llama 2 model through fine - tuning on the Portuguese - translated Alpaca dataset by the authors of Cabrita. It is designed for various natural language processing tasks in Portuguese, such as text generation, machine translation, text summarization, and more.

🚀 Quick Start

We strongly recommend using Kaggle with GPU. You can easily use BODE with the HuggingFace Transformers library. However, you need to obtain authorization to access Llama 2. We also provide a Jupyter notebook on Google Colab. Click here to access it.

Here is a simple example of how to load the model and generate text:

# Necessary downloads
!pip install transformers
!pip install einops accelerate bitsandbytes
!pip install sentence_transformers
!pip install git+https://github.com/huggingface/peft.git

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from peft import PeftModel, PeftConfig

llm_model = 'recogna-nlp/bode-7b-alpaca-pt-br'
hf_auth = 'HF_ACCESS_KEY'
config = PeftConfig.from_pretrained(llm_model)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True, return_dict=True, load_in_8bit=True, device_map='auto', token=hf_auth)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, token=hf_auth)
model = PeftModel.from_pretrained(model, llm_model) # If you encounter the following error: "ValueError: We need an `offload_dir`... You should add the parameter: offload_folder="./offload_dir".
model.eval()

#Testing text generation
def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, along with an input that provides more context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""
     
generation_config = GenerationConfig(
    temperature=0.2,
    top_p=0.75,
    num_beams=2,
    do_sample=True
)

def evaluate(instruction, input=None):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_length=300
    )
    for s in generation_output.sequences:
        output = tokenizer.decode(s)
        print("Response:", output.split("### Response:")[1].strip())

evaluate("Answer in detail: What is a goat?")
#Example of the obtained response (may vary due to temperature): A goat is an animal of the genus Bubalus, of the family Bovidae, which is a member of the order Artiodactyla. Goats are herbivorous mammals that are native to Asia, Africa, and Europe. They are known for their horns, which can be used for defense and as a tool.

✨ Features

BODE is a large language model (LLM) for Portuguese, developed from the Llama 2 model through fine - tuning on the Alpaca dataset translated into Portuguese by the authors of Cabrita. This model is designed for natural language processing tasks in Portuguese, such as text generation, automatic translation, text summarization, and more.

The goal of developing BODE is to address the shortage of LLMs for the Portuguese language. Classic models, like Llama itself, can respond to prompts in Portuguese, but they are prone to many grammar errors and sometimes generate responses in English. There are still few Portuguese models available for free use, and to our knowledge, there are no available models with 13b parameters or more specifically trained with Portuguese data.

Access the article for more information about BODE.

The version of the BODE model provided on this page was trained with the internal resources available in the advanced research laboratory of Recogna. Soon, after the necessary authorizations, we should provide the original version of the model, trained on Santos Dumont.

📦 Installation

The installation steps are included in the usage example code above. You need to run the following commands:

!pip install transformers
!pip install einops accelerate bitsandbytes
!pip install sentence_transformers
!pip install git+https://github.com/huggingface/peft.git

📚 Documentation

Model Details

Property	Details
Model Type	Llama 2
Training Data	Alpaca
Language	Portuguese

Available Versions

Number of Parameters	PEFT	Model
7b	✓	[recogna - nlp/bode - 7b - alpaca - pt - br](https://huggingface.co/recogna - nlp/bode - 7b - alpaca - pt - br)
13b	✓	[recogna - nlp/bode - 13b - alpaca - pt - br](https://huggingface.co/recogna - nlp/bode - 13b - alpaca - pt - br)
7b		[recogna - nlp/bode - 7b - alpaca - pt - br - no - peft](https://huggingface.co/recogna - nlp/bode - 7b - alpaca - pt - br - no - peft)
13b		[recogna - nlp/bode - 13b - alpaca - pt - br - no - peft](https://huggingface.co/recogna - nlp/bode - 13b - alpaca - pt - br - no - peft)
7b - gguf		[recogna - nlp/bode - 7b - alpaca - pt - br - gguf](https://huggingface.co/recogna - nlp/bode - 7b - alpaca - pt - br - gguf)
13b - gguf		[recogna - nlp/bode - 13b - alpaca - pt - br - gguf](https://huggingface.co/recogna - nlp/bode - 13b - alpaca - pt - br - gguf)

Training and Data

The BODE model was trained by fine - tuning the Llama 2 model using the Portuguese Alpaca dataset, which is an instruction - based dataset. The training was originally conducted on the Santos Dumont Supercomputer of LNCC through the Fundunesp project 2019/00697 - 8, but the version provided here is a replica, trained with the same data and parameters in the internal environment of Recogna.

Citation

If you want to use BODE in your research, you can cite this article that discusses the model in more detail. Cite it as follows:

    @misc{bode2024,
      title={Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task}, 
      author={Gabriel Lino Garcia and Pedro Henrique Paiola and Luis Henrique Morelli and Giovani Candido and Arnaldo Cândido Júnior and Danilo Samuel Jodas and Luis C. S. Afonso and Ivan Rizzo Guilherme and Bruno Elias Penteado and João Paulo Papa},
      year={2024},
      eprint={2401.02909},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Contributions to improve this model are welcome. Feel free to open issues and pull requests.

Acknowledgments

We thank the National Laboratory for Scientific Computing (LNCC/MCTI, Brazil) for providing the CAD resources of the SDumont supercomputer.

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found [here](https://huggingface.co/datasets/eduagarcia - temp/llm_pt_leaderboard_raw_results/tree/main/recogna - nlp/bode - 7b - alpaca - pt - br)

Metric	Value
Average	53.21
ENEM Challenge (No Images)	34.36
BLUEX (No Images)	28.93
OAB Exams	30.84
Assin2 RTE	79.83
Assin2 STS	43.47
FaQuAD NLI	67.45
HateBR Binary	85.06
PT Hate Speech Binary	65.73
tweetSentBR	43.25

📄 License

This project is under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご