Open-source gemma-2-2b-it-chinese-kyara-dpo model - Enhanced knowledge retrieval, optimized responses in Traditional Chinese

Gemma 2 2b It Chinese Kyara Dpo

Developed by zake7749

Kyara is a language model fine-tuning project enhanced by knowledge retrieval, focusing on improving the model's performance on languages with fewer resources such as Traditional Chinese.

Large Language Model

Transformers

Supports Multiple Languages#Traditional Chinese optimization #Retrieval Augmented Generation #Efficient with small parameters

Downloads 2,334

Release Time : 8/18/2024

Model Overview

Kyara improves the language model through knowledge adaptive retrieval enhancement technology, especially optimizing for languages with fewer resources such as Traditional Chinese, and enhancing the model's knowledge adaptation ability and language understanding ability.

Model Features

Knowledge adaptive retrieval enhancement

Through Retrieval Augmented Generation (RAG) technology, relevant knowledge is incorporated during the supervised fine-tuning stage to enhance the model's knowledge adaptation ability.

Multilingual support

Particularly optimize the performance of Traditional Chinese, while supporting Simplified Chinese and English.

High-quality data training

Integrate multiple high-quality open-source datasets, and perform semantic deduplication and strict quality control.

Preference learning optimization

Adopt Direct Preference Optimization (DPO) technology to make the model's responses more in line with human preferences.

Model Capabilities

Text generation

Knowledge Q&A

Language understanding

Mathematical reasoning

Logical reasoning

Writing assistance

Use Cases

Education

Chinese learning assistance

Help students understand the differences between Traditional Chinese and Simplified Chinese and provide language learning support.

Showed significant improvement in Chinese evaluations

Mathematics problem solving

Solve various mathematical problems, especially Chinese math questions.

Performed excellently in GSM8K and MATH-L5 benchmark tests

Research

Knowledge-intensive Q&A

Answer questions that require professional knowledge, especially in the Chinese field.

Performed better than the original model in knowledge benchmark tests such as TMMLUPlus

Content creation

Chinese writing assistance

Help creators generate or optimize Chinese content.

Scored high in writing ability evaluations

🚀 Kyara: Knowledge Yielding Adaptive Retrieval Augmentation for LLM Fine-tuning

Kyara is an experimental project that aims to enhance language models through knowledge retrieval. It improves the model's knowledge adaptation and language comprehension, especially for underrepresented languages like Traditional Chinese.

🤗 Hugging Face ｜ 🚀Github ｜ 📑 Paper ｜ 📖 English ｜ 📖 Chinese ｜ 💻 Kaggle Notebook

🚀 Quick Start

Kyara (Knowledge Yielding Adaptive Retrieval Augmentation) is an experimental project aimed at improving language models through knowledge retrieval processes. The project seeks to enhance the model’s ability to adapt knowledge and improve language comprehension, particularly in underrepresented languages like Traditional Chinese. Given the relatively scarce availability of Traditional Chinese data compared to the vast corpus of English data used for model training, Kyara addresses this gap by expanding the limited corpus for this language.

To validate Kyara's effectiveness, we conducted full-parameter fine-tuning on Gemma-2-2b-it, resulting in the first iteration of the Kyara model. Initial evaluation results, as detailed in the Benchmark section, demonstrate that Kyara outperforms the original Gemma-2-2b-it across various benchmarks, with notable improvements in Chinese language evaluations.

✨ Features

Retrieval Augmented Generation (Experimental)

Benefiting from Kyara's training method, we incorporated RAG-related content during the SFT phase. You can refer to the following examples to construct task templates.

📚 Documentation

Benchmark

General Benchmark

The following evaluations are based on zero-shot.

Property	Details
TMMLUPlus	Kyara-2b-it: 41.98; Gemma-2-2b-it: 36.73
- STEM	Kyara-2b-it: 43.73; Gemma-2-2b-it: 37.84
- Humanities	Kyara-2b-it: 38.72; Gemma-2-2b-it: 33.40
- Other	Kyara-2b-it: 40.61; Gemma-2-2b-it: 36.00
- Social-Science	Kyara-2b-it: 44.88; Gemma-2-2b-it: 39.69
MMLU-Redux	Kyara-2b-it: 55.44; Gemma-2-2b-it: 51.94
GSM8K	Kyara-2b-it: 54.21; Gemma-2-2b-it: 51.63
MATH-L5	Kyara-2b-it: 8.88; Gemma-2-2b-it: 4.3
CRUX	Kyara-2b-it: 22.75; Gemma-2-2b-it: 21.5
ZebraLogic	Kyara-2b-it: 5.2; Gemma-2-2b-it: 4.2
Chinese-Reason-Bench	Kyara-2b-it: 4.21; Gemma-2-2b-it: 3.44

The aggregation method for the groups in TMMLUPlus is macro average, following the practice in the official implementation.

Open-LLM Leaderboard

As of now, Kyara-2b-it is the leading competitor among all 2b-scale models on the OpenLLM Leaderboard.

Alignment Benchmark

Property	Kyara	Gemma-2-2b-it	ChatGPT-3.5-1106
AlpacaEval-LC	35.35	32.37	19.30
AlpacaEval	43.34	32.94	9.20
MT-Bench-TW	7.43	6.35	7.10
MT-Bench	8.28	8.17	8.32
Chatbot-Arena-Hard	22.60	19.4	18.87

AlignBench

Fold	Kyara-2b-it-CHT	Kyara-2b-it-CHS	Gemma-2-2b-it	ChatGPT-3.5-0613
Fundamental Language Ability	6.72	6.54	6.42	6.92
Advanced Chinese Understanding	5.78	5.24	5.03	5.91
Open-ended Questions	8.16	7.79	7.52	6.47
Writing Ability	7.90	7.24	7.76	7.28
Logical Reasoning	5.26	4.27	4.20	4.79
Mathematics	5.99	5.44	5.05	5.38
Task-oriented Role Play	8.07	8.00	7.42	7.00
Professional Knowledge	6.97	6.86	5.79	6.81
Reasoning AVG.	5.62	4.85	4.63	5.00
Chinese Language AVG.	7.26	6.94	6.66	6.73
Overall	6.44	5.90	5.64	5.91

where the postfixes CHT and CHS represent Traditional Chinese and Simplified Chinese, respectively. To evaluate the performance on Traditional Chinese in AlignBench, we used OpenCC with the s2tw configuration to convert all questions from Simplified Chinese to Traditional Chinese.

Usage

Kyara adopts the same architecture as Gemma2, utilizing identical inference and training methods. We have created a Jupyter Notebook on Kaggle to demonstrate Kyara’s basic functionality. For service-level deployment, we recommend using Sglang or vllm to achieve greater throughput and robustness.

Method

Dataset Summary

We have collected a total of 3.6M conversations, approximately 4.51 billion tokens. The following provides an overview of the language distribution and conversation rounds.

Language:
Conversation Rounds:

Dataset Construction

The data construction for Kyara is divided into two parts: English and Chinese. For the English part, we have incorporated multiple high-quality open-source datasets, such as teknium/OpenHermes-2.5 and arcee-ai/The-Tome, and performing semantic deduplication to drop out near-similar examples. As for the Chinese part, the construction follows the process outlined below.

Base Dataset: Knowledge Injection with Retrieval Augmentation

We developed a knowledge search system using open Chinese knowledge corpora, integrated with QDrant. To construct Supervised Fine-Tuning(SFT) pairs, we followed this process:

Sample documents from the knowledge base and generate knowledge-intensive questions that users might ask based on these texts.
(Optional) Increase instruction complexity using Evol-Instruct.
Apply query expansion on the generated instructions to retrieve additional Top K documents and individually assess their relevance:
- For relevant documents, use an LLM to summarize key information related to the question.
- For irrelevant documents, ignore them.
Let the LLM generate a detailed and comprehensive response according to the original document and K supplementary references.

Besides, we would also ask the LLM to generate a user prompt for high-quality documents, and pair the (generated prompt, original document) as an SFT example.

Chinese Math Dataset

Dataset: zake7749/kyara-chinese-math-sft-s0-30K

While the aforementioned strategy can generate a wide range of knowledge-based texts, it primarily falls within the scope of information-seeking tasks and is not very effective in constructing mathematical and reasoning-related content. To address this, we generated 50,000 math problems based on PersonaHub. We then used Gemini-1.5-Flash to filter out data with obvious errors in calculation and reasoning, thereby creating kyara-chinese-math-sft-s0-30K.

High Quality Dataset: Model Refinement

After completing supervised learning using the base dataset, we will fine-tune the LLM again on a high-quality subset, primarily to address the following three issues:

Some responses in the Base Dataset were generated from small models, which sometimes performed poorly in following instructions.
We used various LLMs in the previous step to introduce knowledge diversity and language adaptability. However, we discovered subtle differences in response templates and reasoning approaches between different LLMs, leading to occasional instability in the trained Chat Model. Therefore, we would like to introduce a high-quality small dataset, using a single strong LLM to generate QA Pairs.
The Base Dataset includes some Q&A Pairs composed of generated queries and original documents. While these data are rich in knowledge, they are relatively weak in terms of instruction following.

To balance data diversity and quality, we adopted a strategy similar to InsTag to classify the data. We then used ArmoRM and an LLM Judge to evaluate data quality, finally extracting the best training data from each category to create the Stage 1 Dataset of about 500K, which was used to fine-tune the Kyara-SFT Model again.

Preference Learning

We introduced Preference Learning in Kyara, which allows the model's responses to better align with human preferences while enhancing programming skills and mathematical reasoning abilities.

Kyara’s preference learning strategy utilizes Direct Preference Optimization (DPO), integrating two custom-built Chinese datasets alongside two English datasets.

Here, we summarize the construction strategy of the Chinese datasets.

Chinese DPO

SPIN/SPPO

We followed the original design, using Kyara-SFT to generate a set of contrastive data for the High Quality Dataset.

RLAIF

Dataset: zake7749/kyara-chinese-preference-dpo-s0-30K

We extracted Chinese Prompts from Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese, hfl/stem_zh_instruction, and FreedomIntelligence/Evol-Instruct-Chinese-GPT4, and distributed the same prompt to four different LLMs. The competitors include:

GPT-4o
GPT-4-0618
ChatGPT-3.5-0513
Claude-Sonnet-3.5
Yi-Large
Mixtral 8x22B
Gemini-Flash
Qwen2-72B-Instruct
DeepSeek V2

After response generation, we ask the LLMs to judge which one is better, using the following prompt:

**[Task]**

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Your evaluation should consider correctness and helpfulness.  
1. First, independently solve the user question step-by-step.  
2. Then, compare both assistants’ answers with your answer. Identify and correct any mistakes.  
3. Do not allow the length of the responses to influence your evaluation.  
4. Be as objective as possible.

After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie or if both A and B are bad.  

If the answers from A and B are very similar in terms of correctness, helpfulness, and relevance, meaning there is no "obvious" winner, judge it as a tie and output [[C]].

**[User Question]**  
{prompt}

---

**[Assistant A’s Answer]**  
{answer}

---

**[Assistant B’s Answer]**  
{prediction}

---

Finally, all four datasets were combined for DPO training.

💻 Usage Examples

Basic Usage

The input example for constructing task templates:

# Reference Document
<reference>
<document>
Document ID: id_27025b13
* Document Title: Flash_memory
* Document Text:
Another limitation of flash memory is its limited number of erase cycles (most commercial SLC flash memory guarantees around 100,000 erase cycles for the "0" zone, but due to manufacturing precision, other blocks are not guaranteed, and some might even have factory defects and be unusable). This limitation is partly offset by firmware or file systems that calculate write counts and perform dynamic remapping to distribute writes across different blocks; this technique is called wear leveling. Another method is known as Bad Block Management (BBM), where blocks are dynamically tested during write operations, and failed blocks are discarded. For most mobile devices, these wear management techniques can extend the life of internal flash memory (sometimes even beyond the device's lifespan). Moreover, partial data loss in these devices may be acceptable. However, for high-reliability data storage applications that require heavy data write cycles, flash memory is not recommended. But this limitation does not apply to read-only applications, such as routers and thin clients, which often only write once or a few times throughout their lifespan.

### Read Disturbance
</document>
<document>
Document ID: id_858b1787
* Document Title: Flash_memory
* Document Text:
* TLC NAND flash memory typically has an endurance of around 1,000 or more cycles (Samsung 840); using multi-layer structures and adopting LDPC correction have extended the endurance.
* QLC NAND flash memory can have an endurance ranging from 500 to 1,000 cycles.
* SLC floating-gate NOR flash memory typically has a write endurance of 100,000 to 1,000,000 cycles (Numonyx M58BW 100k; Spansion S29CD016J 1,000k).
* MLC floating-gate NOR flash memory usually has a write endurance of 100,000 cycles (Numonyx J3 flash).

These values are approximate and depend on the technology and positioning of different manufacturers' products. Finer process technologies can improve read/write performance and capacity, but they may also pose greater challenges in terms of write endurance. Specific algorithms and design examples, such as wear leveling and memory over-provisioning, can be used to adjust storage system endurance to meet specific needs. Wear leveling is essential for ensuring the lifespan of flash memory products, and it is supported in products like USB flash drives and SSDs.

## Flash Memory File Systems
</document>
<document>
Document ID: id_df34eb65
* Document Title: Memory_over-provisioning
* Document Text:
## Basic SSD Operations

Due to the nature of flash memory operations, data cannot be overwritten directly like in hard drives. When data is first written to an SSD, the cells are in an erased state, so the data can be written directly, one page at a time (usually 4 to 8 KB in size). The SSD controller, which manages the flash memory and interfaces with the main control system, uses a logical-to-physical mapping system called Logical Block Addressing (LBA), part of the flash translation layer (FTL). When new data needs to replace old data, the SSD controller writes the new data to a new location and updates the logical mapping to point to the new physical location. The original data becomes invalid and must be erased before it can be rewritten.

Flash memory has a limited number of program/erase (P/E) cycles. Typically, this is expressed as the maximum number of P/E cycles that flash memory can endure over its lifetime. Single-level cell (SLC) flash memory is generally designed for high performance and long life, typically supporting 50,000 to 100,000 cycles. As of 2011, multi-level cell (MLC) flash memory, designed for low-cost applications, has far fewer cycles, usually only 3,000 to 5,000 cycles. Since 2013, triple-level cell (TLC) flash memory has been introduced, with P/E cycles dropping to around 1,000. The lower the write amplification, the better, as it corresponds to fewer P/E cycles, which extends the lifespan of the SSD.
</document>
</reference>

---

# Task Description
Please refer to the content in the <reference> above and answer the user's quest

🔧 Technical Details

Model Information

Property	Details
Library Name	transformers
Base Model	google/gemma-2-2b-it
Datasets	zake7749/kyara-chinese-math-sft-s0-30K, zake7749/kyara-chinese-preference-rl-dpo-s0-30K, zake7749/chinese-sft-stem-zh-hant, zake7749/chinese-sft-stem-zh-hans
Model Name	gemma-2-2b-it-chinese-kyara-dpo
Results - Task Type	text-generation
Results - Dataset Name	IFEval (0-Shot), BBH (3-Shot), MATH Lvl 5 (4-Shot), GPQA (0-shot), MuSR (0-shot), MMLU-PRO (5-shot)
Results - Metrics	strict accuracy (53.82), normalized accuracy (19.06), exact match (6.12), acc_norm (2.24, 16.76), accuracy (17.48)
Results - Source	Open LLM Leaderboard

📄 License

The license for this project is Gemma.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご