Model Overview
Model Features
Model Capabilities
Use Cases
đ Llama3 TAIDE Series Models
The TAIDE project is dedicated to developing generative AI dialogue engine models that suit Taiwan's language and cultural characteristics while building a trustworthy AI environment. By integrating industry, academia, and research resources, it promotes the development of trustworthy generative AI, enhances Taiwan's international competitiveness, boosts industrial development, and reduces dependence on foreign technologies.
- The Llama3 TAIDE series of models are based on LLaMA3-8b released by Meta. They incorporate text and training materials applicable to different fields in Taiwan, improving the models' ability to respond in Traditional Chinese and performance in specific tasks. The publicly released models are as follows:
- Llama3-TAIDE-LX-8B-Chat-Alpha1: Based on LLaMA3-8b, it undergoes continuous pretraining with Traditional Chinese data and instruction tuning to enhance its capabilities in common office tasks and multi-round Q&A conversations. It is suitable for chat and task assistance scenarios. Additionally, a 4-bit quantized model is available. The quantized model is provided for user convenience but may affect performance and cause unexpected issues. Please understand and take note.
đ Quick Start
You need to agree to the license terms before using this model. Fill in the following information and click "Submit" to indicate your acceptance of the Llama3-TAIDE Model Community License Agreement and the Privacy Policy.
Field | Type |
---|---|
Name | Text |
Date of birth | Date picker |
Country | Country selector |
Affiliation | Text |
Geo (IP location) | Automatically detected |
I accept the terms of the license and privacy policy | Checkbox |
Click "Submit" to proceed.
⨠Features
- Highly reliable training data: Strictly screen the training data to improve the trustworthiness and applicability of the generated content.
- Enhanced office task performance: Strengthen the model's performance in common office tasks such as automatic summarization, letter writing, article writing, Chinese-English translation, and English-Chinese translation.
- Local knowledge enhancement: Incorporate knowledge of Taiwan's local culture, language usage, and national conditions.
- Multi-round Q&A capabilities: Support multi-round conversations to provide more comprehensive assistance.
đģ Usage Examples
Basic Usage
General Q&A
chat = [
{"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)
Replace {question}
with the user's input.
With System Prompt
chat = [
{"role": "system", "content": "{sys}"},
{"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)
Replace {sys}
with an instruction, e.g., "You are an AI assistant from Taiwan named TAIDE. You are willing to help users from a Taiwanese perspective and will answer questions in Traditional Chinese." Replace {question}
with the user's question.
Multi-round Q&A
chat = [
{"role": "system", "content": "{sys}"},
{"role": "user", "content": "{question1}"},
{"role": "assistant", "content": "{model_anwer_1}"},
{"role": "user", "content": "{question2}"},
]
prompt = tokenizer.apply_chat_template(chat)
Replace {sys}
with an instruction, {question1}
with the user's first question, {model_anwer_1}
with the model's first answer, and {question2}
with the user's second question.
For more details, please refer to Llama3 Documentation.
Advanced Usage
Here are some application examples:
Task | User Input | Model Output |
---|---|---|
Article Writing | Write an article based on the following: Compose an essay titled "An Old Photo." Choose a photo that impresses you, explain the reason, and describe the image in the photo and the story behind it. Record the growth process, scenes with others, environmental changes, and beautiful scenery. | An old photo holds the traces of time and the warmth of memories. I still remember how that photo came into being. It was a yellowed black-and-white photo framed on the wall, telling a story of my family. In the photo, my grandparents stood in the fields, with rolling mountains in the background and a golden rice field in front. This scene impressed me deeply because it was my childhood memory and the most beautiful scenery in my growing - up process. I was born in a rural area in Taiwan and grew up in such a simple environment. Whenever I worked in the fields with my family, I would see such beautiful scenery. The golden rice ears swayed with the gentle breeze, and the mountains stood quietly in the distance, giving a sense of peace and tranquility. The photo was taken in the 1950s when Taiwan's economy had not yet taken off, and agriculture was the main industry. My grandparents were farmers. They worked hard to provide for our family. Whenever I see that photo, I think of the sacrifices and efforts they made for us. Besides the beautiful scenery, there is also a touching story in the photo. When I was very young, there was a fire in our house, which burned down our house and all our belongings. Fortunately, the whole family was safe, but our life became very difficult. We were forced to move to another village and start a new life there. The photo also records that difficult time. Although we lost many things, the relationship between family members became even stronger. We supported each other and faced various challenges in life together. Now, I have grown up and work in the city. But whenever I see that old photo, I think of that simple and beautiful time. The beautiful scenery, the smiles of my relatives, and that challenging but hopeful time will always be engraved in my memory. This old photo is not just a photo but also a precious family history, a story about how we overcame difficulties and cherish the present. I will keep this photo well so that future generations can also know our roots and remember those precious memories. |
Letter Writing | As a brand manager, write a letter to the creative director of an advertising company, proposing creative suggestions for the advertising campaign of a new product. | [Your Name] [Your Title] [Your Company Name] [Email Address] [Phone Number] [Date] [Creative Director's Name] [Advertising Company Name] [Advertising Company Address] Subject: Creative Ideas for the New Product Advertising Campaign Dear [Creative Director's Name], I hope this letter finds you well during your busy days. I am writing to share some creative ideas for our upcoming new product advertising campaign. I firmly believe that these ideas will help our new product stand out in the market and attract consumers' attention. First, I suggest that we use social media platforms such as Facebook, Instagram, and YouTube to create a series of engaging short videos and posts. These contents can showcase the features and advantages of the new product and convey our brand's core values of emphasizing environmental protection and health. We can cooperate with influencers or opinion leaders, asking them to try out and share their experiences, which will help increase the credibility and visibility of our product. Second, we can hold a new product tasting/trial event, inviting the public to experience our new product in person. We can hold these events in major department stores, supermarkets, and our stores, allowing participants to taste/trial the new product for free and fill out questionnaires to provide feedback. This interactive way can not only collect valuable market feedback but also create a positive impression of our brand among the public. Third, we can cooperate with local catering businesses, including our new product in their menus. This will expose our product to a wider customer base and prove its value in the cooking and food industries. Fourth, we can hold a new product launch event, inviting the media, industry insiders, and important customers. At this event, we can showcase the features and advantages of the new product, and professional chefs can demonstrate how to incorporate the new product into various dishes. This will be an excellent opportunity to show our brand's innovative spirit and professional ability to the public. Finally, I suggest that we create a series of beautiful packaging and gift boxes to attract consumers to choose our product during festivals and for gifting. Exquisite packaging will enhance the value of our product and convey our brand's commitment to details and quality. I firmly believe that these creative ideas will contribute to the success of our new product in the market. If you have any comments or suggestions on these ideas, please feel free to share them with me. I look forward to working with you to contribute to the growth and success of [Your Company Name]. Thank you for taking the time to read my letter, and I look forward to discussing these ideas with you further. Best regards, [Your Name] [Your Title] [Your Company Name] |
Summarization | Summarize the following article: The amendment to Articles 10 - 2 and 72 of the Industrial Innovation Act, commonly known as the "Taiwan Semiconductor Act," provides a maximum 25% corporate income tax investment credit for companies in semiconductor, electric vehicle, 5G, and other technology - innovative fields that play a key role in the international supply chain. The requirements for companies to apply include a certain scale of R & D expenses and R & D intensity in the current year, and an effective tax rate reaching a certain ratio. To respond to the adjustment of the minimum tax system by OECD countries, the effective tax rate threshold was set at 12% in 2023 and is expected to increase to 15% in 2024, but it may be adjusted according to the implementation of the international minimum tax system. Officials from the Ministry of Economic Affairs said that the negotiation with the Ministry of Finance is in the final stage. In addition to setting the R & D intensity at 6%, it has been confirmed that companies can offset the investment if they invest more than NT $10 billion in advanced - process equipment. Officials from the Ministry of Finance said that during the discussion, in - depth research was conducted on Taiwan's industries and their similar international counterparts. Regarding equipment investment, since companies applying for Article 10 - 2 of the Industrial Innovation Act are like a "national team" in the "international competition," if the investment amount is less than NT $10 billion, they may not be competitive. As for the much - concerned R & D expense threshold, officials from the Ministry of Economic Affairs said that after in - depth discussions with the Ministry of Finance, the R & D expense threshold is expected to be between NT $6 billion and NT $7 billion. Officials from the Ministry of Finance pointed out that R & D is crucial for Taiwan's future economic growth. The threshold should not be "unreachable." Although it was initially set at NT $10 billion, it was lowered to encourage companies to reach the threshold, apply for tax incentives, and continue to invest in R & D to maintain their key position in the international supply chain. Officials from the Ministry of Economic Affairs said that since the average R & D expenses of manufacturers are NT $3 - 4 billion, and those of IC design companies range from NT $3 billion to NT $6 billion, if the threshold is set at NT $10 billion, few companies will meet the requirements, and the incentive will be insufficient. In addition, if more companies meet the application requirements, it will increase the investment in Taiwan, and the tax revenue of the Ministry of Finance will also increase. Recently, IC design companies have frequently called for lowering the application threshold. With countries striving for supply - chain autonomy and increasing subsidies for the semiconductor industry, officials from the Ministry of Economic Affairs said that the Ministry of Economic Affairs and the Ministry of Finance have reached a consensus on Article 10 - 2 of the Industrial Innovation Act, aiming to benefit more companies, enhance investment, and consolidate Taiwan's technological position. Officials from the Ministry of Finance said that the formulation of tax incentives should be "prudent and purpose - driven," with the highest principle of achieving the purpose of the incentives. Currently, companies operating in Taiwan can still apply for tax incentives under Articles 10 and 10 - 1 of the Industrial Innovation Act to jointly promote Taiwan's economic development. The Ministry of Economic Affairs and the Ministry of Finance are finalizing the R & D expense threshold. After the subsidiary regulations are announced in the next two days, there will be about 30 days for further discussion and adjustment with the industry, aiming to implement Article 10 - 2 in June. | The main points of the article are as follows: - The "Taiwan Semiconductor Act" or the amendment to Articles 10 - 2 and 72 of the Industrial Innovation Act aims to encourage key technology - innovative companies, especially those in semiconductor, electric vehicle, and 5G technology fields that play a key role in the international supply chain. - Companies that meet specific conditions, including a certain scale of R & D expenses and R & D intensity in the current year, can enjoy a maximum 25% corporate income tax credit. The R & D intensity target is set at 6%, and companies must prove that their effective tax rate reaches a certain ratio. The effective tax rate threshold was initially set at 12% in 2023 and is expected to increase to 15% in 2024, but it may be adjusted according to the implementation of the international tax system. - The government considers the adjustment of the OECD minimum tax system when formulating policies and may adjust the effective tax rate threshold accordingly. - The government departments have negotiated with stakeholders and determined some details, such as setting the investment threshold for advanced - process equipment at NT $10 billion for offset. - After discussion, the R & D expense threshold may be lowered from the initially proposed NT $10 billion to between NT $6 billion and NT $7 billion to encourage more companies to meet the requirements and invest in R & D, maintaining Taiwan's key position in the international supply chain. - Officials from the Ministry of Economic Affairs said that the purpose of lowering the threshold is to make more companies feel that they can meet the standards, thus motivating them to apply for tax incentives and potentially increasing investment in Taiwan and government tax revenue. - Since the R & D expenses of IC design companies are generally lower, it is necessary to adjust the threshold. Otherwise, few companies will meet the requirements. - As countries promote supply - chain autonomy and increase subsidies for the semiconductor industry, the Ministry of Economic Affairs is working to ensure that more companies can benefit from Article 10 - 2 of the Industrial Innovation Act to enhance investment and consolidate Taiwan's technological position. - The government emphasizes that the formulation of tax incentives should meet the incentive goals. At the same time, it will continue to provide other tax incentives for domestic - operating companies to promote Taiwan's economic growth. - The final R & D expense threshold will be determined by the Ministry of Economic Affairs and the Ministry of Finance. After the subsidiary regulations are announced, there will be about 30 days for discussion and adjustment, aiming to implement the new policy in June. |
Translation | Translate the following English content into Chinese: Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. | Dongshan coffee is famous for its unique location and the continuous improvement of production methods. Its flavor is highly appreciated by many coffee lovers. |
đ Documentation
Model Parameters
Property | Details |
---|---|
Number of Parameters | 8B |
Maximum Context Length | 8K |
Token Quantity of Traditional Chinese Training Data | 43B |
Training Time | 2336 H100 GPU Hours |
Training Method
Hardware and Software Specifications
- Hardware: H100 from the National Center for High - Performance Computing
- Training Framework: PyTorch
Data Preprocessing
- Character Standardization: Standardize characters to a unified format.
- Duplicate Removal: Eliminate duplicate data.
- Noise Removal:
- Remove HTML tags and JavaScript code from web data.
- Eliminate non - standard characters or garbled characters.
- Filter out articles with too few words.
- Remove specific formatting in articles, such as line breaks for typesetting.
- Personal Information Removal: Remove personal information such as email addresses and phone numbers.
- Inappropriate Content Removal: Eliminate inappropriate text such as gambling and pornographic content.
Continuous Pretraining (CP)
- Knowledge Enrichment: Supplement a large amount of reliable Traditional Chinese knowledge.
- Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 1e - 4
- Batch Size: 1M tokens
- Epoch: 1
Fine - Tuning (FT)
- Task - Specific Adaptation: Enable the model to answer questions in Traditional Chinese.
- Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 5e - 5
- Batch Size: 256K tokens
- Epoch: 3
Training Data
Continuous Pretraining Data (Approximately 140G)
Dataset | Description |
---|---|
Litigation Data | Civil, criminal, and administrative litigation data from various levels of courts in Taiwan from January 2013 to December 2023, sourced from Judicial Yuan Judgments. |
Central News Agency | News articles from the Central News Agency from June 1993 to June 2023, covering domestic and international politics, society, finance, culture, education, and daily life. |
ETtoday News Cloud | News data from ETtoday News Cloud from October 2011 to December 2023. |
Legislative Yuan Gazette | Gazette data from the 8th Session to the 10th Session, 7th Period of the Legislative Yuan, available at Legislative Yuan Gazette. |
Publishers' Book Introductions | Book introductions from publishers such as Suncolor and Gotop. |
GRB Research Project Summaries | Summaries of research projects and reports funded by the government from 1993 to 2023, sourced from GRB, including Chinese - English translations. |
Academic Conference Paper Summaries | Papers from academic conferences held in Taiwan from 1988 to 2009, sourced from Academic Conference Paper Summary Database. |
Taiwan Panorama Magazine | Articles from [Taiwan Panorama Magazine](https://www.taiwan - panorama.com/) from July 1993 to June 2023, focusing on Taiwan's culture, tourism, and local conditions. |
Terminology Database | Approximately 1.87 million academic terms and their translations in liberal arts and science fields, sourced from Terminology Database. |
Government Department Data | Partial data from government department websites such as the "National Conditions Introduction" on the Executive Yuan's website (https://www.ey.gov.tw/state/), the "National Cultural Memory Bank" on the Ministry of Culture's website (https://memory.culture.tw/), the "Archives for Teaching Support Network" on the National Development Council's website (https://art.archives.gov.tw/index.aspx), and the "Traffic Safety Portal" on the Ministry of Transportation and Communications' website (https://168.motc.gov.tw/). |
Business Today | Articles from Business Today from January 2008 to July 2023, mainly focusing on finance. |
Ministry of Education Dictionaries | Three dictionaries from the Ministry of Education: [Idiom Dictionary](https://dict.idioms.moe.edu.tw/search.jsp?webMd = 1&la = 0) with 5,338 idioms, including definitions, allusions, usage explanations, and examples; [Revised Dictionary of the National Language](https://dict.revised.moe.edu.tw/?la = 0&powerMode = 0) with about 165,539 Chinese characters and vocabulary entries, including pronunciation, radicals, and definitions; [Concise Dictionary of the National Language](https://dict.concised.moe.edu.tw/?la = 0&powerMode = 0), a concise version of the Revised Dictionary, with 45,247 entries. |
Science and Technology Park Data | Scientific knowledge and popular science articles from Science and Technology Park Website. |
iKnow Technology Industry Information Center | Information on Taiwan and global technology market trends, strategic analysis, patent knowledge, and technology trading from 2008 to 2023, sourced from iKnow Technology Industry Information Center. |
Science Development Monthly | Popular science articles from Science Development Monthly from October 2004 to December 2020. Since 2021, it has been re - launched as Charming Science and Technology quarterly, providing new knowledge articles on internationally - concerned science and technology issues. |
Law Database | Central regulations, administrative rules, draft regulatory orders, and local autonomous regulations issued by various government departments as of October 2023, sourced from Law Database. |
Local Government Tourism Websites | Partial data from the tourism websites of some local governments in Taiwan. |
Curriculum Guidelines of the 12 - Year National Education | General guidelines and subject - specific curriculum guidelines for the 12 - year national education. |
CNA Translation Database | Translation records of Chinese and foreign surnames, names, organizations, and place names in the news business of the Central News Agency. |
Fairy Tales | 20 fairy tales, including The Adventures of Tom Sawyer, Peter Pan, Alice's Adventures in Wonderland, Daddy - Long - Legs, etc. |
RedPajama - Data - V2 | English data extracted from the foreign open - source multilingual corpus [RedPajama - Data - v2](https://github.com/togethercomputer/RedPajama - Data). |
MathPile - commercial | Foreign open - source mathematics corpus MathPile - commercial. |
Chinese Wikipedia | All entries on [Chinese Wikipedia](https://zh.wikipedia.org/zh - tw/%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91) as of January 2023. |
github - code - clean | An open - source code dataset from GitHub, with unlicensed code and documentation removed. |
Fine - Tuning Data
The TAIDE team trained the Llama2 series of models to generate fine - tuning data. The generated tasks include single - round or multi - round Q&A on world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwan - related values, totaling 128K records. The fine - tuning data will be released later.
Model Evaluation
taide - bench
- Evaluation Data: 500 questions in article writing, letter writing, summarization, English - Chinese translation, and Chinese - English translation. Data link: [taide - bench](https://huggingface.co/datasets/taide/taide - bench).
- Evaluation Method: Scored by GPT4. Scoring program: [taide - bench - eval](https://github.com/taide - taiwan/taide - bench - eval).
- Evaluation Scores | Model | Chinese - English Translation | English - Chinese Translation | Summarization | Article Writing | Letter Writing | Average | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | Llama3 - TAIDE - LX - 8B - Chat - Alpha1 | 7.770 | 8.280 | 8.495 | 9.605 | 8.950 | 8.620 | | GPT3.5 | 8.880 | 8.810 | 7.450 | 9.490 | 8.750 | 8.676 | | TAIDE - LX - 7B - Chat | 7.165 | 7.685 | 7.720 | 9.635 | 9.110 | 8.263 | | LLAMA2 7B | 6.075 | 4.475 | 5.905 | 2.625 | 3.040 | 4.424 | | LLAMA2 13B | 6.480 | 6.135 | 6.110 | 2.565 | 3.000 | 4.858 | | LLAMA2 70B | 6.975 | 6.375 | 6.795 | 2.625 | 2.990 | 5.152 |
đ§ Technical Details
The model is based on the LLaMA3 - 8b architecture. Through continuous pretraining and fine - tuning with a large amount of Traditional Chinese data, it enhances the model's ability to understand and generate Traditional Chinese text. The data preprocessing steps ensure the quality of the training data, and the carefully selected hyperparameters optimize the training process.
đ License
- [Llama3 - TAIDE Model Community License Agreement](https://drive.google.com/file/d/12 - Q0WWSjG0DW6CqJQm_jr5wUGRLeb - 8p/view)
â ī¸ Important Note
Due to the limitations of the design architecture and potential biases in the data, any responses from the language model do not represent the stance of TAIDE. It is necessary to add additional security protection mechanisms before use, and the response content may also contain incorrect information. Please do not fully trust it.
đĨ Development Team
đ Related Links
- TAIDE Official Website
- TAIDE Huggingface
- [TAIDE Github](https://github.com/taide - taiwan)
- Kuwa AI
đ Citation

