Model Overview
Model Features
Model Capabilities
Use Cases
đ TAIDE Language Model
The TAIDE project is dedicated to developing generative AI dialogue engine models that suit Taiwan's language and cultural characteristics, while constructing a trustworthy AI environment. By integrating industry, academia, and research resources, it promotes the development of trustworthy generative AI, enhances Taiwan's international competitiveness, boosts industrial development, and reduces dependence on foreign technologies.
đ Quick Start
The models developed in this project are based on Meta's LLaMA2-7b. They incorporate text and training materials from various Taiwanese domains to improve the models' performance in Traditional Chinese responses and specific tasks. The publicly released models are as follows:
- TAIDE-LX-7B: A model based on LLaMA2-7b, pre-trained only on Traditional Chinese data (continuous pretraining). It is suitable for scenarios where users need to further fine-tune the model. Note that since the pre-trained model has not been fine-tuned or preference-aligned, it may generate malicious or unsafe outputs. Use it with caution.
- TAIDE-LX-7B-Chat: Based on TAIDE-LX-7B, this model enhances the ability to handle common office tasks and multi-round Q&A conversations through instruction tuning. It is suitable for chat conversations or task assistance. Additionally, a 4-bit quantized model is available. Quantized models are provided for user convenience but may affect performance and cause unexpected issues. Please understand and be aware of this.
⨠Features
- Enhanced Chinese Character Support: Expanded with 24,720 additional Chinese characters and words to strengthen the model's ability to process Traditional Chinese.
- High - Quality Training Data: Strictly screened training data to improve the trustworthiness and applicability of the model's generated content.
- Optimized for Office Tasks: Enhanced performance in common office tasks such as automatic summarization, letter writing, article writing, Chinese - English and English - Chinese translation.
- Local Knowledge Enriched: Strengthened with knowledge of Taiwan's local culture, language usage, and national conditions.
- Multi - Round Q&A Capability: Capable of handling multi - round question - answer conversations.
đĻ Installation
The example code and documentation will be released on GitHub later.
đģ Usage Examples
Basic Usage
- Tokenizer Setup: The tokenizer must be set to slow mode.
tokenizer = AutoTokenizer.from_pretrained("taide/TAIDE-LX-7B-Chat", use_fast=False)
- Prompt Templates:
- General Q&A:
f"<s>[INST] {question} [/INST]"
Replace {question}
with the user's input.
- With System Prompt:
f"<s>[INST] <<SYS>>\n{sys}\n<</SYS>>\n\n{question} [/INST]"
Replace {sys}
with an instruction, e.g., "You are an AI assistant from Taiwan named TAIDE. You are willing to help users from a Taiwanese perspective and will answer questions in Traditional Chinese." Replace {question}
with the user's question.
- Multi - Round Q&A:
f"<s>[INST] <<SYS>>\n{sys}\n<</SYS>>\n\n{question1} [/INST] {model_answer_1} </s><s>[INST] {question2} [/INST]"
Replace {sys}
with an instruction, {question1}
with the user's first question, {model_answer_1}
with the model's first answer, and {question2}
with the user's second question.
- Huggingface Chat Templates:
- General Q&A:
chat = [
{"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)
Replace {question}
with the user's input.
- With System Prompt:
chat = [
{"role": "system", "content": "{sys}"},
{"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)
Replace {sys}
with an instruction and {question}
with the user's question.
- Multi - Round Q&A:
chat = [
{"role": "system", "content": "{sys}"},
{"role": "user", "content": "{question1}"},
{"role": "assistant", "content": "{model_answer_1}"},
{"role": "user", "content": "{question2}"},
]
prompt = tokenizer.apply_chat_template(chat)
Replace {sys}
with an instruction, {question1}
with the user's first question, {model_answer_1}
with the model's first answer, and {question2}
with the user's second question.
Advanced Usage
Here are some application examples:
Task | User Input | Model Output |
---|---|---|
Article Writing | Write an article based on the following: Compose an essay titled "An Old Photo". The content should choose a photo that impresses you, explain the reason, and describe the image in the photo and the story behind it. Record the growth process, scenarios with others, environmental changes, and beautiful scenery. | Here is an essay created according to the instruction: "An Old Photo" In my long - sealed family photo album, there is an old photo that particularly attracts me. It records the happy gathering of our family members in the past, and the memory is still fresh. This photo was taken by my grandmother at a family reunion more than twenty years ago. The people in the photo include my parents, grandparents, uncles, aunts, and cousins. Everyone's faces were filled with happy smiles, as if time had stood still. The scene in the photo was very warm. Everyone was sitting around a rectangular table filled with delicious food and drinks. The dishes on the table were traditional Chinese cuisine, including fragrant roasted whole fish, tender chicken, and sweet vegetables. These dishes were cooked by my grandmother herself. She was a great cook, and everyone enjoyed the meal. What impressed me most was that the photo also captured my young self. I was sitting at the other end of the table, holding a bowl of soup with a satisfied smile on my face. I was just one year old at that time and was very curious about this grand family gathering. From this photo, I can feel the intimacy and warmth among family members at that time. Even after so many years, that emotion still remains strong. There is a touching story behind this photo. In that era, my grandparents and parents were still in middle school. They would go to the farmland to help after school every day, working hard to support the family. Although the living conditions were not good, they still maintained a positive attitude and love for life. Every autumn, when the harvest season came, they would hold a family reunion and invite all relatives and friends to share the joy of the harvest. As time passed, I grew up and started working and living. Every time I go back to my hometown, I will visit the elderly and listen to their stories from the past, feeling that unforgettable time. And this old photo will always accompany me, becoming a part of my precious memories. It witnesses our growth and changes and records those unforgettable moments. This photo not only shows the family affection and warmth but also reflects the hard - working and simple lifestyle and respect for traditions of people in that era. Nowadays, with the rapid development of society, many traditional virtues and customs are gradually disappearing. We should cherish such memories and cultures and pass them on to future generations so that they can also feel the warmth and inheritance from history. Therefore, I decided to keep this photo well and often take it out to look at. It is not just an old photo but also a symbol of precious memories and family spirit. I hope that in the future, I will have the opportunity to recreate the grand occasion of that year, gather together again, share our lives and stories, and let that family affection and friendship continue to spread in time and never fade away. |
Letter Writing | Write a letter as a brand manager to the creative director of an advertising company, putting forward creative suggestions for the advertising campaign of a new product. | Dear Creative Director, I am writing this letter to propose some creative suggestions for our new product's advertising campaign. Our company is about to launch a brand - new product, and to ensure its success in the market, we hope to attract consumers' attention through a striking advertising campaign. When considering the creativity and strategy of the advertisement, several factors need to be taken into account. First, we need to conduct research on the target audience to understand their needs and preferences. Second, we need to ensure that the advertising content is attractive and persuasive, resonating with the target audience. Finally, we need to utilize various media platforms and digital marketing tools to expand the influence of the advertisement. Based on these factors, I put forward the following creative suggestions: 1. Product Positioning: We can position this new product as a representative of high - end, environmentally friendly, healthy, and fashionable products, emphasizing its unique functions and features. In the advertisement, we can convey these characteristics through vivid visual effects and concise language. 2. Storytelling Marketing: Tell a touching story related to the product's functions in the advertisement, allowing consumers to have a deeper understanding of the changes in life brought by the product. For example, we can tell the story of a busy professional woman who uses our new product to achieve a balance between work and life. 3. Celebrity Endorsement: Invite a popular public figure or opinion leader to be the product spokesperson, using their influence to promote the product. This can not only increase the product's exposure but also make more consumers trust and rely on the product. 4. Cross - border Collaboration: Collaborate with some related industries (such as fashion, environmental protection, health, etc.) to jointly launch limited - edition or co - branded products. This kind of collaboration can not only attract consumers from different fields but also bring more topics and collectible value to the product. 5. Digital Marketing: Make full use of digital tools such as social media, search engines, and email marketing to maintain a consistent message and image throughout the advertising campaign. We can also produce a series of short videos and online activities to increase consumer participation and brand loyalty. 6. Innovative Forms: Try using novel advertising forms such as AR (Augmented Reality), VR (Virtual Reality), and 360 - degree videos to give consumers a brand - new advertising experience. This innovative form can not only attract consumers' attention but also improve the product's recognition in the market. 7. Interactive Elements: Add interactive elements such as product trials, lotteries, and coupons to the advertising campaign to encourage consumers to actively participate and share their experiences. This will help with word - of - mouth promotion and further expand the product's influence. Based on the above suggestions, I hope we can formulate a comprehensive and creative advertising campaign plan to maximize the competitiveness of the new product in the market. If you have any questions or need further discussion, please feel free to contact me. Thank you for your attention and support! Best regards! Your Brand Manager (Note: This letter can be adjusted according to the actual situation and product characteristics.) |
Summarization | Please summarize and streamline this article: The amendment to Articles 10 - 2 and 72 of the Industrial Innovation Act, commonly known as the "Taiwan Semiconductor Act", provides a maximum 25% corporate income tax investment credit for companies in semiconductor, electric vehicle, 5G, and other technology - innovative fields that hold a key position in the international supply chain. The eligibility requirements for enterprises include a certain scale of R & D expenses and R & D intensity in the current year, as well as an effective tax rate reaching a certain ratio. To respond to the minimum tax rate adjustment of the Organization for Economic Co - operation and Development (OECD) countries, the effective tax rate threshold was set at 12% in 2023 and is expected to increase to 15% in 2024, but the implementation of the international minimum tax rate system will also be considered. An official from the Ministry of Economic Affairs said that the negotiation with the Ministry of Finance has entered the final stage. In addition to setting the R & D intensity of enterprises at 6%, it has been confirmed that enterprises with equipment investment of more than 10 billion yuan in advanced manufacturing processes can claim a tax credit. An official from the Ministry of Finance said that during the research process, in - depth research was conducted on Taiwanese industries and their international counterparts. In terms of equipment, since the enterprises applying for the preferential policy under Article 10 - 2 represent Taiwan in the "international competition", if the investment amount is less than 10 billion yuan, they may not be competitive. As for the much - concerned R & D expense threshold, an official from the Ministry of Economic Affairs said that after close discussions with the Ministry of Finance, the R & D expense threshold is expected to be between 6 billion and 7 billion yuan. An official from the Ministry of Finance pointed out that R & D is crucial for Taiwan's future economic growth. The threshold should not be "unreachable". Although it was initially set at 10 billion yuan, it was lowered to make enterprises feel that they can meet the requirements and thus apply for the tax incentives, so as to have the motivation to continue investing in R & D and maintain their key position in the international supply chain. An official from the Ministry of Economic Affairs said that since the average R & D expenses of manufacturers are 3 - 4 billion yuan, and for IC design companies, it is between 3 billion and 6 billion yuan. If the threshold is set at 10 billion yuan, fewer enterprises will meet the conditions, and the incentive effect will be insufficient. In addition, if more enterprises meet the application threshold, it will increase the investment in Taiwan and also boost the tax revenue of the Ministry of Finance. IC design companies have recently been vocal about the preferential policy under Article 10 - 2, hoping to lower the application threshold. With countries competing for supply chain autonomy and increasing subsidies for the semiconductor industry, the Ministry of Economic Affairs and the Ministry of Finance have reached a consensus on Article 10 - 2, aiming to benefit more enterprises, enhance corporate investment, and consolidate Taiwan's technological position. An official from the Ministry of Finance said that the formulation of tax incentives must be "prudent and effective" and should be based on the principle of achieving the purpose of the incentives. Currently, enterprises in the "domestic competition" can still apply for the tax incentives under Articles 10 and 10 - 1 of the Industrial Innovation Act to jointly promote Taiwan's economic development. The Ministry of Economic Affairs and the Ministry of Finance are making the final confirmation on the R & D expense threshold. After the subsidiary regulations are announced in the next two days, there will be about 30 days for further discussion and adjustment with the industry. It is hoped that Article 10 - 2 will come into effect in June. |
The amendment to the "Industrial Innovation Act" in Taiwan, also known as the "Taiwan Semiconductor Act", aims to support innovative companies in semiconductor, electric vehicle, and 5G technologies. It offers a maximum 25% income tax credit. Enterprises need to meet requirements such as a certain scale of R & D expenses, an R & D intensity of over 6%, and an effective tax rate. The effective tax rate threshold is expected to be 12% in 2023 and 15% in 2024. After discussions, it is confirmed that an equipment investment of over 10 billion yuan can be credited, and the R & D expense threshold is expected to be between 6 - 7 billion yuan. The Ministry of Economic Affairs and the Ministry of Finance have reached a consensus to lower the threshold to benefit more enterprises and encourage investment. The implementation of this act is expected to start in June. |
Translation | Translate the following English content into Chinese: Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. | Dongshan coffee is well - known for its favorable location and continuously refined production techniques. Its flavor is highly praised by many coffee lovers. |
đ Documentation
Model Parameters
Property | Details |
---|---|
Parameter Quantity | 7B |
Maximum Context Length | 4K |
Token Quantity of Traditional Chinese Training Data | 41.44B |
Training Time | 1531.82 H100 GPU Hours |
Training Method
- Hardware and Software Specifications:
- H100 from the National Center for High - Performance Computing.
- Training Framework: PyTorch.
- Data Pre - processing:
- Character Standardization.
- Duplicate Removal.
- Noise Removal:
- HTML tags and JavaScript in web data.
- Non - standard characters or garbled characters.
- Articles with too few characters.
- Specific formats in articles, such as line breaks for typesetting.
- Personal Information Removal, such as emails and phone numbers.
- Inappropriate Text Removal, such as gambling and pornographic content.
- Character and Word Expansion:
- To enhance the performance of Traditional Chinese input and output, the expanded data includes two parts:
- Chinese characters obtained from the [Standard Character Table of Variant Characters Dictionary of the Ministry of Education](https://dict.variants.moe.edu.tw/appendix.jsp?ID = 1&ID = 0).
- Training the tokenizer for Chinese words using 5 million sentences (2.1G) with more than 100 characters extracted from Traditional Chinese Wikipedia, news, and Chinese common crawl data.
- To enhance the performance of Traditional Chinese input and output, the expanded data includes two parts:
- Continuous Pretraining (CP):
- Supplement a large amount of reliable Traditional Chinese knowledge.
- Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 1e - 4
- Batch Size: 1M tokens
- Epoch: 1
- Fine - Tuning (FT):
- Enable the model to answer questions in Traditional Chinese.
- Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 5e - 5
- Batch Size: 256K tokens
- Epoch: 3
Training Data
- Continuous Pretraining Data (about 140G): | Dataset | Data Description | | --- | -------- | | Litigation Data | Civil, criminal, and administrative litigation data from various levels of courts from January 2013 to December 2023 in Judicial Yuan Judgments. | | Central News Agency | News articles from the Central News Agency from June 1993 to June 2023, covering domestic and international politics, society, finance, culture, education, and life. | | ETtoday News Cloud | Data from ETtoday News Cloud from October 2011 to December 2023. | | Legislative Yuan Gazette | Gazette data from the 8th Session, 1st Meeting to the 10th Session, 7th Meeting in Legislative Yuan Gazette. | | Publishers' Website Book Introductions | Book introductions from publishers' websites such as Suncolor and Gotop. | | GRB Research Project Abstracts | Abstracts of research projects and their results funded by the government from 1993 to 2023 in GRB, including Chinese and English translations. | | Academic Conference Paper Abstracts | Academic conference papers held in Taiwan from 1988 to 2009 in Academic Conference Paper Abstract Database. | | Taiwan Panorama Magazine | Articles from [Taiwan Panorama Magazine](https://www.taiwan - panorama.com/) from July 1993 to June 2023, focusing on Taiwan's culture, tourism, and local conditions. | | Terminology Network | About 1.87 million academic terms and their translations in liberal arts and science fields in Terminology Network. | | Government Departments' Data | Partial data from government department websites such as the "State Introduction" on the Executive Yuan website (https://www.ey.gov.tw/state/), the "National Cultural Memory Bank" on the Ministry of Culture website (https://memory.culture.tw/), the "Archive - Supported Teaching Network" on the National Development Council website (https://art.archives.gov.tw/index.aspx), and the "Traffic Safety Portal" on the Ministry of Transportation and Communications website (https://168.motc.gov.tw/). | | Business Today | Articles from Business Today from January 2008 to July 2023, a weekly magazine focusing on finance. | | Ministry of Education's Chinese Dictionaries | Including three datasets: [Ministry of Education's "Idiom Dictionary"](https://dict.idioms.moe.edu.tw/search.jsp?webMd = 1&la = 0) with 5,338 idioms, including definitions, original texts of allusions, vernacular explanations, usage instructions, and examples; [Ministry of Education's "Revised Dictionary of Mandarin"](https://dict.revised.moe.edu.tw/?la = 0&powerMode = 0) with about 165,539 entries of Chinese single characters and various vocabulary, including pronunciation, radicals, and definitions; [Ministry of Education's "Concise Dictionary of Mandarin"](https://dict.concised.moe.edu.tw/?la = 0&powerMode = 0), a concise version of the "Revised Dictionary of Mandarin" with 45,247 entries. | | Science and Technology Park Data | Scientific knowledge and popular science articles on the Science and Technology Park website. | | iKnow Science and Technology Industry Information Center | Information on Taiwan and global science and technology market trends, strategic analysis, patent knowledge, and technology trading from 2008 to 2023 in iKnow Science and Technology Industry Information Center. | | Science Development Monthly | Popular science articles from October 2004 to December 2020 in Science Development Monthly. Since 2021, it has been re - launched as the quarterly magazine Charming Science and Technology, providing new knowledge on internationally - concerned science and technology topics. | | Law Database | Central regulations, administrative rules, draft regulatory orders, and local autonomous regulations issued by various government departments as of October 2023 in Law Database. | | Local Government Tourism Websites | Partial data from local government tourism websites in some Taiwanese counties and cities. | | Curriculum Guidelines of the 12 - Year National Education | The general outline and curriculum guidelines for different subjects in schools under the 12 - year national education system. | | CNA Translation Archive Database | Translation pairs of Chinese and foreign surnames, names, organizations, and place names collected by the Central News Agency in its news operations. | | Fairy Tales | A total of 20 fairy tales, including The Adventures of Tom Sawyer, Peter Pan, Alice's Adventures in Wonderland, Daddy - Long - Legs, etc. | | RedPajama - Data - V2 | English data extracted from the international open - source multi - language corpus [RedPajama - Data - v2](https://github.com/togethercomputer/RedPajama - Data). | | MathPile - commercial | The international open - source mathematics corpus MathPile - commercial. | | Chinese Wikipedia | All entries in [Chinese Wikipedia](https://zh.wikipedia.org/zh - tw/%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91) as of January 2023. | | github - code - clean | An open - source code dataset from GitHub, with unlicensed code and documentation removed. |
- Fine - Tuning Data: The TAIDE team trained Llama2 - series models to generate fine - tuning data, including about 128K single - round or multi - round Q&A tasks in world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwan - related values. The fine - tuning data will be released later.
Model Evaluation
- taide - bench:
- Evaluation Data:
- 500 questions in article writing, letter writing, summarization, English - Chinese translation, and Chinese - English translation.
- Data Link: [taide - bench](https://huggingface.co/datasets/taide/taide - bench).
- Evaluation Method:
- Scored by GPT4.
- Scoring Program: [taide - bench - eval](https://github.com/taide - taiwan/taide - bench - eval).
- Evaluation Scores: | Model | Chinese - English Translation | English - Chinese Translation | Summarization | Article Writing | Letter Writing | Average | | --- | ----- | ----- | ---- | ---- | ---- | --- | | TAIDE - LX - 7B - Chat | 7.165 | 7.685 | 7.720 | 9.635 | 9.110 | 8.263 | | GPT3.5 | 8.880 | 8.810 | 7.450 | 9.490 | 8.750 | 8.676 | | LLAMA2 7B | 6.075 | 4.475 | 5.905 | 2.625 | 3.040 | 4.424 | | LLAMA2 13B | 6.480 | 6.135 | 6.110 | 2.565 | 3.000 | 4.858 | | LLAMA2 70B | 6.975 | 6.375 | 6.795 | 2.625 | 2.990 | 5.152 |
- Evaluation Data:
đ License
- [TAIDE L - type Model Community License Agreement](https://drive.google.com/file/d/1FcUZjbUH6jr4xoCyAronN_slLgcdhEUd/view?usp = drive_link)
đ§ Technical Details
The model development process involves multiple steps, including data pre - processing, continuous pretraining, and fine - tuning. Each step has specific hyperparameters and operations to ensure the model's performance in Traditional Chinese tasks.
â ī¸ Important Note
Due to the limitations of the design architecture and potential biases in the data, any responses from the language model do not represent the stance of TAIDE. Users need to add additional security mechanisms before use, and the response content may also contain incorrect information. Please do not fully trust it.
đ Related Links
- TAIDE Official Website
- TAIDE on Huggingface
- [TAIDE on Github](https://github.com/taide - taiwan)
- Kuwa AI
đ Citation

