Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Kyara: Knowledge Yielding Adaptive Retrieval Augmentation for LLM Fine-tuning
Kyara is an experimental project that aims to enhance language models through knowledge retrieval. It improves the model's knowledge adaptation and language comprehension, especially for underrepresented languages like Traditional Chinese.
🤗 Hugging Face | 🚀Github | 📑 Paper | 📖 English | 📖 Chinese | 💻 Kaggle Notebook

🚀 Quick Start
Kyara (Knowledge Yielding Adaptive Retrieval Augmentation) is an experimental project aimed at improving language models through knowledge retrieval processes. The project seeks to enhance the model’s ability to adapt knowledge and improve language comprehension, particularly in underrepresented languages like Traditional Chinese. Given the relatively scarce availability of Traditional Chinese data compared to the vast corpus of English data used for model training, Kyara addresses this gap by expanding the limited corpus for this language.
To validate Kyara's effectiveness, we conducted full-parameter fine-tuning on Gemma-2-2b-it
, resulting in the first iteration of the Kyara model. Initial evaluation results, as detailed in the Benchmark section, demonstrate that Kyara outperforms the original Gemma-2-2b-it
across various benchmarks, with notable improvements in Chinese language evaluations.
✨ Features
Retrieval Augmented Generation (Experimental)
Benefiting from Kyara's training method, we incorporated RAG-related content during the SFT phase. You can refer to the following examples to construct task templates.
📚 Documentation
Benchmark
General Benchmark
The following evaluations are based on zero-shot.
Property | Details |
---|---|
TMMLUPlus | Kyara-2b-it: 41.98; Gemma-2-2b-it: 36.73 |
- STEM | Kyara-2b-it: 43.73; Gemma-2-2b-it: 37.84 |
- Humanities | Kyara-2b-it: 38.72; Gemma-2-2b-it: 33.40 |
- Other | Kyara-2b-it: 40.61; Gemma-2-2b-it: 36.00 |
- Social-Science | Kyara-2b-it: 44.88; Gemma-2-2b-it: 39.69 |
MMLU-Redux | Kyara-2b-it: 55.44; Gemma-2-2b-it: 51.94 |
GSM8K | Kyara-2b-it: 54.21; Gemma-2-2b-it: 51.63 |
MATH-L5 | Kyara-2b-it: 8.88; Gemma-2-2b-it: 4.3 |
CRUX | Kyara-2b-it: 22.75; Gemma-2-2b-it: 21.5 |
ZebraLogic | Kyara-2b-it: 5.2; Gemma-2-2b-it: 4.2 |
Chinese-Reason-Bench | Kyara-2b-it: 4.21; Gemma-2-2b-it: 3.44 |
The aggregation method for the groups in TMMLUPlus is macro average, following the practice in the official implementation.
Open-LLM Leaderboard
As of now, Kyara-2b-it is the leading competitor among all 2b-scale models on the OpenLLM Leaderboard.

Alignment Benchmark
Property | Kyara | Gemma-2-2b-it | ChatGPT-3.5-1106 |
---|---|---|---|
AlpacaEval-LC | 35.35 | 32.37 | 19.30 |
AlpacaEval | 43.34 | 32.94 | 9.20 |
MT-Bench-TW | 7.43 | 6.35 | 7.10 |
MT-Bench | 8.28 | 8.17 | 8.32 |
Chatbot-Arena-Hard | 22.60 | 19.4 | 18.87 |
AlignBench
Fold | Kyara-2b-it-CHT | Kyara-2b-it-CHS | Gemma-2-2b-it | ChatGPT-3.5-0613 |
---|---|---|---|---|
Fundamental Language Ability | 6.72 | 6.54 | 6.42 | 6.92 |
Advanced Chinese Understanding | 5.78 | 5.24 | 5.03 | 5.91 |
Open-ended Questions | 8.16 | 7.79 | 7.52 | 6.47 |
Writing Ability | 7.90 | 7.24 | 7.76 | 7.28 |
Logical Reasoning | 5.26 | 4.27 | 4.20 | 4.79 |
Mathematics | 5.99 | 5.44 | 5.05 | 5.38 |
Task-oriented Role Play | 8.07 | 8.00 | 7.42 | 7.00 |
Professional Knowledge | 6.97 | 6.86 | 5.79 | 6.81 |
Reasoning AVG. | 5.62 | 4.85 | 4.63 | 5.00 |
Chinese Language AVG. | 7.26 | 6.94 | 6.66 | 6.73 |
Overall | 6.44 | 5.90 | 5.64 | 5.91 |
where the postfixes CHT and CHS represent Traditional Chinese and Simplified Chinese, respectively. To evaluate the performance on Traditional Chinese in AlignBench, we used OpenCC with the s2tw
configuration to convert all questions from Simplified Chinese to Traditional Chinese.
Usage
Kyara adopts the same architecture as Gemma2, utilizing identical inference and training methods. We have created a Jupyter Notebook on Kaggle to demonstrate Kyara’s basic functionality. For service-level deployment, we recommend using Sglang or vllm to achieve greater throughput and robustness.
Method
Dataset Summary
We have collected a total of 3.6M conversations, approximately 4.51 billion tokens. The following provides an overview of the language distribution and conversation rounds.
-
Language:
-
Conversation Rounds:
Dataset Construction
The data construction for Kyara is divided into two parts: English and Chinese. For the English part, we have incorporated multiple high-quality open-source datasets, such as teknium/OpenHermes-2.5 and arcee-ai/The-Tome, and performing semantic deduplication to drop out near-similar examples. As for the Chinese part, the construction follows the process outlined below.
Base Dataset: Knowledge Injection with Retrieval Augmentation
We developed a knowledge search system using open Chinese knowledge corpora, integrated with QDrant. To construct Supervised Fine-Tuning(SFT) pairs, we followed this process:
- Sample documents from the knowledge base and generate knowledge-intensive questions that users might ask based on these texts.
- (Optional) Increase instruction complexity using Evol-Instruct.
- Apply query expansion on the generated instructions to retrieve additional Top K documents and individually assess their relevance:
- For relevant documents, use an LLM to summarize key information related to the question.
- For irrelevant documents, ignore them.
- Let the LLM generate a detailed and comprehensive response according to the original document and K supplementary references.
Besides, we would also ask the LLM to generate a user prompt for high-quality documents, and pair the (generated prompt, original document) as an SFT example.
Chinese Math Dataset
While the aforementioned strategy can generate a wide range of knowledge-based texts, it primarily falls within the scope of information-seeking tasks and is not very effective in constructing mathematical and reasoning-related content. To address this, we generated 50,000 math problems based on PersonaHub. We then used Gemini-1.5-Flash
to filter out data with obvious errors in calculation and reasoning, thereby creating kyara-chinese-math-sft-s0-30K.
High Quality Dataset: Model Refinement
After completing supervised learning using the base dataset, we will fine-tune the LLM again on a high-quality subset, primarily to address the following three issues:
- Some responses in the Base Dataset were generated from small models, which sometimes performed poorly in following instructions.
- We used various LLMs in the previous step to introduce knowledge diversity and language adaptability. However, we discovered subtle differences in response templates and reasoning approaches between different LLMs, leading to occasional instability in the trained Chat Model. Therefore, we would like to introduce a high-quality small dataset, using a single strong LLM to generate QA Pairs.
- The Base Dataset includes some Q&A Pairs composed of generated queries and original documents. While these data are rich in knowledge, they are relatively weak in terms of instruction following.
To balance data diversity and quality, we adopted a strategy similar to InsTag to classify the data. We then used ArmoRM and an LLM Judge to evaluate data quality, finally extracting the best training data from each category to create the Stage 1 Dataset of about 500K, which was used to fine-tune the Kyara-SFT Model again.
Preference Learning
We introduced Preference Learning in Kyara, which allows the model's responses to better align with human preferences while enhancing programming skills and mathematical reasoning abilities.
Kyara’s preference learning strategy utilizes Direct Preference Optimization (DPO), integrating two custom-built Chinese datasets alongside two English datasets.
Here, we summarize the construction strategy of the Chinese datasets.
Chinese DPO
SPIN/SPPO
We followed the original design, using Kyara-SFT to generate a set of contrastive data for the High Quality Dataset.
RLAIF
Dataset: zake7749/kyara-chinese-preference-dpo-s0-30K
We extracted Chinese Prompts from Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese
, hfl/stem_zh_instruction
, and FreedomIntelligence/Evol-Instruct-Chinese-GPT4
, and distributed the same prompt to four different LLMs. The competitors include:
- GPT-4o
- GPT-4-0618
- ChatGPT-3.5-0513
- Claude-Sonnet-3.5
- Yi-Large
- Mixtral 8x22B
- Gemini-Flash
- Qwen2-72B-Instruct
- DeepSeek V2
After response generation, we ask the LLMs to judge which one is better, using the following prompt:
**[Task]**
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Your evaluation should consider correctness and helpfulness.
1. First, independently solve the user question step-by-step.
2. Then, compare both assistants’ answers with your answer. Identify and correct any mistakes.
3. Do not allow the length of the responses to influence your evaluation.
4. Be as objective as possible.
After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie or if both A and B are bad.
If the answers from A and B are very similar in terms of correctness, helpfulness, and relevance, meaning there is no "obvious" winner, judge it as a tie and output [[C]].
**[User Question]**
{prompt}
---
**[Assistant A’s Answer]**
{answer}
---
**[Assistant B’s Answer]**
{prediction}
---
Finally, all four datasets were combined for DPO training.
💻 Usage Examples
Basic Usage
The input example for constructing task templates:
# Reference Document
<reference>
<document>
Document ID: id_27025b13
* Document Title: Flash_memory
* Document Text:
Another limitation of flash memory is its limited number of erase cycles (most commercial SLC flash memory guarantees around 100,000 erase cycles for the "0" zone, but due to manufacturing precision, other blocks are not guaranteed, and some might even have factory defects and be unusable). This limitation is partly offset by firmware or file systems that calculate write counts and perform dynamic remapping to distribute writes across different blocks; this technique is called wear leveling. Another method is known as Bad Block Management (BBM), where blocks are dynamically tested during write operations, and failed blocks are discarded. For most mobile devices, these wear management techniques can extend the life of internal flash memory (sometimes even beyond the device's lifespan). Moreover, partial data loss in these devices may be acceptable. However, for high-reliability data storage applications that require heavy data write cycles, flash memory is not recommended. But this limitation does not apply to read-only applications, such as routers and thin clients, which often only write once or a few times throughout their lifespan.
### Read Disturbance
</document>
<document>
Document ID: id_858b1787
* Document Title: Flash_memory
* Document Text:
* TLC NAND flash memory typically has an endurance of around 1,000 or more cycles (Samsung 840); using multi-layer structures and adopting LDPC correction have extended the endurance.
* QLC NAND flash memory can have an endurance ranging from 500 to 1,000 cycles.
* SLC floating-gate NOR flash memory typically has a write endurance of 100,000 to 1,000,000 cycles (Numonyx M58BW 100k; Spansion S29CD016J 1,000k).
* MLC floating-gate NOR flash memory usually has a write endurance of 100,000 cycles (Numonyx J3 flash).
These values are approximate and depend on the technology and positioning of different manufacturers' products. Finer process technologies can improve read/write performance and capacity, but they may also pose greater challenges in terms of write endurance. Specific algorithms and design examples, such as wear leveling and memory over-provisioning, can be used to adjust storage system endurance to meet specific needs. Wear leveling is essential for ensuring the lifespan of flash memory products, and it is supported in products like USB flash drives and SSDs.
## Flash Memory File Systems
</document>
<document>
Document ID: id_df34eb65
* Document Title: Memory_over-provisioning
* Document Text:
## Basic SSD Operations
Due to the nature of flash memory operations, data cannot be overwritten directly like in hard drives. When data is first written to an SSD, the cells are in an erased state, so the data can be written directly, one page at a time (usually 4 to 8 KB in size). The SSD controller, which manages the flash memory and interfaces with the main control system, uses a logical-to-physical mapping system called Logical Block Addressing (LBA), part of the flash translation layer (FTL). When new data needs to replace old data, the SSD controller writes the new data to a new location and updates the logical mapping to point to the new physical location. The original data becomes invalid and must be erased before it can be rewritten.
Flash memory has a limited number of program/erase (P/E) cycles. Typically, this is expressed as the maximum number of P/E cycles that flash memory can endure over its lifetime. Single-level cell (SLC) flash memory is generally designed for high performance and long life, typically supporting 50,000 to 100,000 cycles. As of 2011, multi-level cell (MLC) flash memory, designed for low-cost applications, has far fewer cycles, usually only 3,000 to 5,000 cycles. Since 2013, triple-level cell (TLC) flash memory has been introduced, with P/E cycles dropping to around 1,000. The lower the write amplification, the better, as it corresponds to fewer P/E cycles, which extends the lifespan of the SSD.
</document>
</reference>
---
# Task Description
Please refer to the content in the <reference> above and answer the user's quest
🔧 Technical Details
Model Information
Property | Details |
---|---|
Library Name | transformers |
Base Model | google/gemma-2-2b-it |
Datasets | zake7749/kyara-chinese-math-sft-s0-30K, zake7749/kyara-chinese-preference-rl-dpo-s0-30K, zake7749/chinese-sft-stem-zh-hant, zake7749/chinese-sft-stem-zh-hans |
Model Name | gemma-2-2b-it-chinese-kyara-dpo |
Results - Task Type | text-generation |
Results - Dataset Name | IFEval (0-Shot), BBH (3-Shot), MATH Lvl 5 (4-Shot), GPQA (0-shot), MuSR (0-shot), MMLU-PRO (5-shot) |
Results - Metrics | strict accuracy (53.82), normalized accuracy (19.06), exact match (6.12), acc_norm (2.24, 16.76), accuracy (17.48) |
Results - Source | Open LLM Leaderboard |
📄 License
The license for this project is Gemma.

