CantoneseLLMChat-v1.0-32B Open-Source Cantonese Large Language Model - Proficient in Hong Kong Knowledge and Cantonese Conversations

Cantonesellmchat V1.0 32B

Developed by hon9kon9ize

Cantonese LLM Chat v1.0 is the first-generation Cantonese large language model developed by the hon9kon9ize team, excelling in Hong Kong-related professional knowledge and Cantonese conversation.

Large Language Model

Transformers

Open Source License:Other #Cantonese conversation #Hong Kong cultural understanding #Large language model

Downloads 117

Release Time : 3/22/2025

Model Overview

A Cantonese large language model obtained through continued pre-training based on Qwen 2.5 32B, focusing on Hong Kong culture and Cantonese conversation understanding.

Model Features

Cantonese optimization

Specifically optimized for Cantonese conversation, capable of fluent Cantonese communication

Hong Kong cultural understanding

Deep understanding of local Hong Kong culture and knowledge

Large-scale training

Trained using 600 million Hong Kong news articles and Cantonese website data

High-quality instruction data

Fine-tuned with 75,000 carefully selected instruction pairs

Model Capabilities

Cantonese conversation generation

Hong Kong cultural Q&A

Cantonese text understanding

Multi-turn dialogue

Use Cases

Dialogue systems

Cantonese chatbot

Build chatbots capable of natural conversation in Cantonese

Performed excellently in the HK-Eval benchmark

Cultural education

Hong Kong cultural Q&A

Answer various questions about Hong Kong's history, culture, and society

Outperformed similar models in Hong Kong cultural understanding tasks

🚀 CantoneseLLMChat-v1.0-32B

CantoneseLLMChat-v1.0-32B is the first-generation Cantonese LLM from hon9kon9ize, excelling in Hong Kong-related specific knowledge and Cantonese conversation.

front_image

🚀 Quick Start

Cantonese LLM Chat v1.0 is the first generation Cantonese LLM from hon9kon9ize. Building upon the success of v0.5 preview, the model excels in Hong Kong related specific knowledge and Cantonese conversation.

✨ Features

Enhanced Knowledge: Specialized in Hong Kong-related specific knowledge.
Fluent Conversation: Capable of smooth Cantonese conversations.

📦 Installation

This section is skipped as no installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "hon9kon9ize/CantoneseLLMChat-v1.0-32B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    device_map="auto", 
)
def chat(messages, temperature=0.9, max_new_tokens=200):
    input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to('cuda:0')
    output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, temperature=temperature)
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=False)
    return response
prompt = "邊個係香港特首？"
messages = [
    {"role": "system", "content": "you are a helpful assistant."},
    {"role": "user", "content": prompt}
]
print(chat(messages)) # 香港特別行政區行政長官係李家超。<|im_end|>

📚 Documentation

Model description

The base model is obtained via Continuous Pre-Training of Qwen 2.5 32B with 600 million publicly available Hong Kong news articles and Cantonese websites. The instructions fine-tuned model is trained with a dataset consisting of 75,000 instruction pairs. 45,000 pairs are Cantonese instructions generated by other LLMs and reviewed by humans.

The model is trained with 16 Nvidia H100 96GB HBM2e GPUs on Genkai Supercomputer.

🔧 Technical Details

The model's performance is evaluated in the HK-Eval Benchmark. It shows excellent performance in understanding Cantonese and Hong Kong culture among open-source LLMs. However, reasoning models have better performance, and the development team is currently working on reasoning models for v2.

Property	Details
Model Type	CantoneseLLMChat-v1.0-32B
Base Model	hon9kon9ize/CantoneseLLM-v1.0-32B-cpt
Training Data	600 million publicly available Hong Kong news articles and Cantonese websites for pre - training; a dataset of 75,000 instruction pairs for fine - tuning
Training Hardware	16 Nvidia H100 96GB HBM2e GPUs on Genkai Supercomputer

Model	HK Culture (zero-shot)	Cantonese Linguistics
CantonesellmChat v0.5 6B	52.0%	12.8%
CantonesellmChat v0.5 34B	72.5%	54.5%
CantonesellmChat v1.0 3B	56.0%	45.7%
CantonesellmChat v1.0 7B	60.3%	46.5%
CantonesellmChat v1.0 32B	69.8%	52.7%
CantonesellmChat v1.0 72B	75.4%	59.6%
Llama 3.1 8B Instruct	45.6%	35.1%
Llama 3.1 70B Instruct	63.0%	50.3%
Qwen2.5 7B Instruct	51.2%	30.3%
Qwen2.5 32B Instruct	59.9%	45.1%
Qwen2.5 72B Instruct	65.9%	45.9%
Claude 3.5 Sonnet	71.7%	63.2%
DeepSeek R1	88.8%	77.5%
Gemini 2.0 Flash	80.2%	75.3%
Gemini 2.5 Pro	92.1%	87.3%
GPT4o	77.5%	63.8%
GPT4o-mini	55.6%	57.3%

📄 License

The license information is "other".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご