CantoneseLLMChat-v1.0-7B Open-source Cantonese Large Language Model - Focus on Hong Kong Knowledge and Chat Freely about Cantonese Topics

Cantonesellmchat V1.0 7B

Developed by hon9kon9ize

Cantonese LLM Chat v1.0 is the first-generation Cantonese large language model launched by hon9kon9ize, focusing on Hong Kong-related knowledge and Cantonese conversation.

Large Language Model

Transformers

Open Source License:Other #Cantonese conversation #Hong Kong cultural understanding #Cantonese instruction fine-tuning

Downloads 2,198

Release Time : 10/2/2024

Model Overview

This model is based on Qwen 2.5 7B for continuous pre-training, using a large amount of Hong Kong news and Cantonese website data, and employs a manually reviewed Cantonese instruction dataset during the instruction fine-tuning phase.

Model Features

Cantonese optimization

Specifically optimized for Cantonese conversation, capable of fluent Cantonese communication

Hong Kong cultural knowledge

Includes rich local Hong Kong knowledge and cultural background information

Multiple size options

Offers model versions with parameter sizes ranging from 3B to 72B

Model Capabilities

Cantonese conversation generation

Hong Kong knowledge Q&A

Cantonese text understanding

Multi-turn dialogue

Use Cases

Dialogue systems

Cantonese chatbot

Used to build Cantonese conversational robots

Capable of fluent daily Cantonese conversation

Hong Kong cultural Q&A

Answers questions about Hong Kong history, culture, and current affairs

Excellent performance in understanding Hong Kong culture

Education

Cantonese learning aid

Helps non-native speakers learn Cantonese

🚀 CantoneseLLMChat-v1.0-7B

The first generation Cantonese LLM from hon9kon9ize, excelling in Hong Kong related specific knowledge and Cantonese conversation.

front_image

Cantonese LLM Chat v1.0 is the first generation Cantonese LLM from hon9kon9ize. Building upon the success of v0.5 preview, the model excels in Hong Kong related specific knowledge and Cantonese conversation.

✨ Features

Specific Knowledge: Specialized in Hong Kong related knowledge.
Language Proficiency: Skilled in Cantonese conversations.

📚 Documentation

Model description

The base model is obtained via Continuous Pre-Training of Qwen 2.5 7B with 600 million publicly available Hong Kong news articles and Cantonese websites. The instructions fine-tuned model is trained with a dataset consisting of 75,000 instruction pairs. 45,000 pairs are Cantonese instructions generated by other LLMs and reviewed by humans.

The model is trained with 1 Nvidia H100 80GB HBM3 GPU on Genkai Supercomputer.

Model Information

Property	Details
Base Model	Qwen 2.5 7B
Training Data	600 million Hong Kong news articles and Cantonese websites for pre - training; 75,000 instruction pairs for fine - tuning (45,000 Cantonese instructions generated by other LLMs and reviewed by humans)
Training Hardware	1 Nvidia H100 80GB HBM3 GPU on Genkai Supercomputer

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "hon9kon9ize/CantoneseLLMChat-v1.0-7B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    device_map="auto", 
)

def chat(messages, temperature=0.9, max_new_tokens=200):
    input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to('cuda:0')
    output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, temperature=temperature)
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=False)
    return response

prompt = "邊個係香港特首？"

messages = [
    {"role": "system", "content": "you are a helpful assistant."},
    {"role": "user", "content": prompt}
]

print(chat(messages)) # 香港特別行政區行政長官係李家超。<|im_end|>

📈 Performance

This model is the best in class open source LLM in understanding Cantonese and Hong Kong culture in the HK-Eval Benchmark. However, as one could observe, reasoning models have performed dramatically better than their counterparts. We are currently working on reasoning models for v2.

Model	HK Culture (zero - shot)	Cantonese Linguistics
CantonesellmChat v0.5 6B	52.0%	12.8%
CantonesellmChat v0.5 34B	72.5%	54.5%
CantonesellmChat v1.0 3B	56.0%	45.7%
CantonesellmChat v1.0 7B	60.3%	46.5%
CantonesellmChat v1.0 32B	69.8%	52.7%
CantonesellmChat v1.0 72B	75.4%	59.6%
Llama 3.1 8B Instruct	45.6%	35.1%
Llama 3.1 70B Instruct	63.0%	50.3%
Qwen2.5 7B Instruct	51.2%	30.3%
Qwen2.5 32B Instruct	59.9%	45.1%
Qwen2.5 72B Instruct	65.9%	45.9%
Claude 3.5 Sonnet	71.7%	63.2%
DeepSeek R1	88.8%	77.5%
Gemini 2.0 Flash	80.2%	75.3%
Gemini 2.5 Pro	92.1%	87.3%
GPT4o	77.5%	63.8%
GPT4o - mini	55.6%	57.3%

📄 License

The license for this model is other.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご