ホーム

Qwen2.5 0.5B Portuguese V1

cnmoroによって開発

Qwen2.5-0.5B-Instructをファインチューニングしたポルトガル語大規模言語モデルで、テキスト生成タスクに特化

大規模言語モデル

Safetensors

その他オープンソースライセンス:MIT #ポルトガル語最適化 #教育試験Q&A #法律文書処理

ダウンロード数 2,218

リリース時間 : 2/25/2025

モデル概要

このモデルはポルトガル語に最適化されたテキスト生成モデルで、Qwen2.5アーキテクチャを基にファインチューニングされており、様々なポルトガル語自然言語処理タスクに適用可能

モデル特徴

ポルトガル語最適化

ポルトガル語に特化したファインチューニングを行い、ポルトガル語の理解力と生成能力を向上

マルチタスク対応

Q&A、推論、感情分析など様々なテキスト生成タスクをサポート

効率的な推論

0.5Bパラメータ規模で性能を維持しながら高い推論効率を提供

モデル能力

ポルトガル語テキスト生成

Q&Aシステム

テキスト分類

意味的類似度計算

感情分析

試験問題解答

使用事例

教育

ENEM試験問題解答

ブラジル国家中等教育試験(ENEM)の問題を解答

正解率37.86%

OAB弁護士資格試験

ブラジル弁護士資格試験の問題を解答

正解率33.12%

法律

法律文書分析

法律関連文書の処理と分析

ソーシャルメディア分析

ヘイトスピーチ検出

ポルトガル語ソーシャルメディアにおけるヘイトスピーチを識別

マクロF1 55.1

感情分析

ポルトガル語ツイートの感情傾向を分析

マクロF1 45.96

license: mit language:

pt base_model:
Qwen/Qwen2.5-0.5B-Instruct pipeline_tag: text-generation datasets:
adalbertojunior/openHermes_portuguese
cnmoro/smoltalk-555k-ptbr
cnmoro/RagMixPTBR-Legal-Alpaca-2M model-index:
name: Qwen2.5-0.5B-Portuguese-v1 results:
- task: type: text-generation name: Text Generation dataset: name: ENEM Challenge (No Images) type: eduagarcia/enem_challenge split: train args: num_few_shot: 3 metrics:
  - type: acc value: 37.86 name: accuracy source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: BLUEX (No Images) type: eduagarcia-temp/BLUEX_without_images split: train args: num_few_shot: 3 metrics:
  - type: acc value: 34.63 name: accuracy source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: OAB Exams type: eduagarcia/oab_exams split: train args: num_few_shot: 3 metrics:
  - type: acc value: 33.12 name: accuracy source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: Assin2 RTE type: assin2 split: test args: num_few_shot: 15 metrics:
  - type: f1_macro value: 86.3 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: Assin2 STS type: eduagarcia/portuguese_benchmark split: test args: num_few_shot: 15 metrics:
  - type: pearson value: 54.3 name: pearson source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: FaQuAD NLI type: ruanchaves/faquad-nli split: test args: num_few_shot: 15 metrics:
  - type: f1_macro value: 65.33 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: HateBR Binary type: ruanchaves/hatebr split: test args: num_few_shot: 25 metrics:
  - type: f1_macro value: 44.06 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: PT Hate Speech Binary type: hate_speech_portuguese split: test args: num_few_shot: 25 metrics:
  - type: f1_macro value: 55.1 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: tweetSentBR type: eduagarcia/tweetsentbr_fewshot split: test args: num_few_shot: 25 metrics:
  - type: f1_macro value: 45.96 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=cnmoro/Qwen2.5-0.5B-Portuguese-v1 name: Open Portuguese LLM Leaderboard

Qwen2.5-0.5B finetuned for proficiency in Portuguese language and increased intelligence.

https://ollama.com/cnmoro/Qwen2.5-0.5B-Portuguese-v1

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "cnmoro/Qwen2.5-0.5B-Portuguese-v1"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Escreva uma breve introdução sobre LLMs (Large Language Models) e suas aplicações."

# System prompt is always injected and hardcoded automatically
# for ideal performance in portuguese language.
# No need to write it again.
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response
# LLM significa Large Language Models, que são modelos de linguagem computacional
# projetados para simular a inteligência humana no processamento e geração de texto.
# Esses modelos usam técnicas avançadas de aprendizado de máquina e redes neurais para
# compreender e gerar texto com base em dados de entrada. As aplicações de LLM incluem
# tradução automática, análise de sentimento, modelagem de tópicos e resposta a perguntas
# automatizadas. Eles estão sendo cada vez mais utilizados em diversas áreas, como
# saúde, educação e finanças, para melhorar a comunicação, as experiências dos clientes
# e os resultados da pesquisa.

Overall Results

Task	Metric	Value	Stdev
assin2_rte	f1_macro	0.391	0.006
assin2_rte	acc	0.527	0.007
assin2_sts	pearson	0.115	0.014
assin2_sts	mse	1.011	N/A
bluex	acc	0.349	0.010
enem_challenge	acc	0.363	0.007
faquad_nli	f1_macro	0.595	0.017
faquad_nli	acc	0.791	0.011
hatebr_offensive	f1_macro	0.338	0.005
hatebr_offensive	acc	0.502	0.009
oab_exams	acc	0.326	0.006
portuguese_hate_speech	f1_macro	0.412	0.004
portuguese_hate_speech	acc	0.702	0.011
tweetsentbr	f1_macro	0.455	0.005
tweetsentbr	acc	0.594	0.008

Detailed Results

assin2_rte

Metric	Value	Stdev
f1_macro	0.391	0.006
acc	0.527	0.007

assin2_sts

Metric	Value	Stdev
pearson	0.115	0.014
mse	1.011	N/A

bluex

Exam ID	Metric	Value	Stdev
all	acc	0.349	0.010
USP_2019	acc	0.225	0.038
USP_2024	acc	0.293	0.041
USP_2021	acc	0.423	0.040
UNICAMP_2018	acc	0.241	0.034
UNICAMP_2024	acc	0.444	0.043
USP_2020	acc	0.393	0.038
UNICAMP_2020	acc	0.291	0.035
UNICAMP_2021_1	acc	0.326	0.040
UNICAMP_2022	acc	0.487	0.046
USP_2022	acc	0.388	0.040
UNICAMP_2019	acc	0.280	0.037
UNICAMP_2021_2	acc	0.294	0.037
UNICAMP_2023	acc	0.558	0.044
USP_2023	acc	0.364	0.042
USP_2018	acc	0.278	0.035

enem_challenge

Exam ID	Metric	Value	Stdev
all	acc	0.363	0.007
2016_2	acc	0.390	0.025
2015	acc	0.319	0.025
2011	acc	0.410	0.026
2013	acc	0.398	0.027
2017	acc	0.319	0.025
2022	acc	0.376	0.024
2009	acc	0.226	0.023
2010	acc	0.444	0.026
2012	acc	0.345	0.025
2014	acc	0.339	0.026
2016	acc	0.397	0.026
2023	acc	0.385	0.024

faquad_nli

Metric	Value	Stdev
f1_macro	0.595	0.017
acc	0.791	0.011

hatebr_offensive

Metric	Value	Stdev
f1_macro	0.338	0.005
acc	0.502	0.009

oab_exams

Exam ID	Metric	Value	Stdev
all	acc	0.326	0.006
2018-25	acc	0.400	0.032
2016-20a	acc	0.238	0.027
2011-05	acc	0.400	0.032
2012-08	acc	0.325	0.030
2012-09	acc	0.260	0.029
2014-13	acc	0.325	0.030
2011-03	acc	0.313	0.027
2016-20	acc	0.275	0.029
2012-06a	acc	0.325	0.030
2017-22	acc	0.338	0.031
2015-16	acc	0.325	0.030
2013-12	acc	0.300	0.030
2017-24	acc	0.250	0.028
2012-06	acc	0.238	0.027
2014-14	acc	0.325	0.030
2013-11	acc	0.325	0.030
2013-10	acc	0.413	0.032
2010-02	acc	0.390	0.028
2016-21	acc	0.375	0.031
2015-18	acc	0.300	0.030
2015-17	acc	0.282	0.029
2016-19	acc	0.333	0.031
2012-07	acc	0.388	0.031
2017-23	acc	0.325	0.030
2011-04	acc	0.350	0.031
2010-01	acc	0.282	0.028
2014-15	acc	0.385	0.032

portuguese_hate_speech

Metric	Value	Stdev
f1_macro	0.412	0.004
acc	0.702	0.011

tweetsentbr

Metric	Value	Stdev
f1_macro	0.455	0.005
acc	0.594	0.008

Model Meta Information

Truncated Samples: 3863
Non-Truncated Samples: 10287
Padded Samples: 0
Non-Padded Samples: 14150
Fewshots Truncated: 3863
Has Chat Template: True
Chat Type: system_user_assistant
Number of GPUs: 1
Accelerate Number of Processes: N/A
Model SHA: None
Model Data Type: torch.bfloat16
Model Memory Footprint: 988065664 bytes
Model Number of Parameters: 494032768
Model is Loaded in 4bit: N/A
Model is Loaded in 8bit: N/A
Model is Quantized: N/A
Model Device: cuda:0
Batch Size: 1
Max Length: 512
Max Context Length 480
Max Generation Tokens: 32
Effective Batch Size: 1.0

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here and on the 🚀 Open Portuguese LLM Leaderboard

Metric	Value
Average	50.74
ENEM Challenge (No Images)	37.86
BLUEX (No Images)	34.63
OAB Exams	33.12
Assin2 RTE	86.30
Assin2 STS	54.30
FaQuAD NLI	65.33
HateBR Binary	44.06
PT Hate Speech Binary	55.10
tweetSentBR	45.96