Model Overview
Model Features
Model Capabilities
Use Cases
đ TAIDE L Language Model
The TAIDE Project is committed to developing a generative AI dialogue engine model that suits the characteristics of Taiwan's language and culture, while constructing a trustworthy AI environment. By integrating the energy of industry, academia, and research, it promotes the development of trustworthy generative AI, enhances Taiwan's position in international competition, promotes industrial development, and avoids over - reliance on foreign technologies.
- The large - language models developed in this project are based on LLaMA2 - 7b released by Meta. By introducing available texts and training materials from different fields in Taiwan, the models' ability to respond in Traditional Chinese and performance in specific tasks are improved. The publicly released models are as follows:
- [TAIDE - LX - 7B](https://huggingface.co/taide/TAIDE - LX - 7B): A model based on LLaMA2 - 7b, pre - trained (continuous pretraining) only with Traditional Chinese data. It is suitable for scenarios where users will further fine - tune the model. Since the pre - trained model has not been fine - tuned and preference - aligned, it may produce malicious or unsafe outputs. Please use it with caution.
- [TAIDE - LX - 7B - Chat](https://huggingface.co/taide/TAIDE - LX - 7B - Chat): Based on TAIDE - LX - 7B, it enhances the ability to handle common office tasks and multi - round Q&A dialogues through instruction tuning. It is suitable for chat dialogues or task assistance. TAIDE - LX - 7B - Chat also provides a [4 - bit quantized model](https://huggingface.co/taide/TAIDE - LX - 7B - Chat - 4bit). The quantized model is mainly for user convenience, but it may affect performance and cause more unexpected problems. Please understand and note this.
đ Quick Start
You need to agree to the license terms before using this model.
License Information
Property | Details |
---|---|
License | TAIDE L Models Community License Agreement |
Gated Heading | You need to agree to the license terms before using this model. |
Gated Fields | Name: text; Date of birth: date_picker; Country: country; Affiliation: text; Geo: ip_location; By clicking Submit below I accept the terms of the license and privacy policy: checkbox |
Gated Prompt | * TAIDE L Models Community License Agreement * Privacy policy |
Gated Button Content | Submit |
⨠Features
- Enhanced Chinese Character Support: An additional 24,720 Chinese characters and words are added to strengthen the model's ability to process Traditional Chinese.
- High - quality Training Data: The training data of the model is strictly screened to improve the trustworthiness and applicability of the generated data.
- Optimized for Office Tasks: The model is enhanced for common office tasks such as automatic summarization, letter writing, article writing, Chinese - to - English translation, and English - to - Chinese translation.
- Cultural and Regional Knowledge: It has strengthened knowledge of Taiwan's local culture, language usage, and national conditions.
- Multi - round Q&A Capability: The model can handle multi - round question - answer dialogues.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
The example programs and documentation will be released on GitHub later.
Tokenizer Setup
tokenizer = AutoTokenizer.from_pretrained("taide/TAIDE-LX-7B-Chat", use_fast=False)
Prompt Templates
General Q&A
f"<s>[INST] {question} [/INST]"
# Replace {question} with the user's input.
With System Prompt
f"<s>[INST] <<SYS>>\n{sys}\n<</SYS>>\n\n{question} [/INST]"
# Replace {sys} with instructions, e.g., "You are an AI assistant from Taiwan named TAIDE, willing to help users from a Taiwanese perspective and answer questions in Traditional Chinese."
# Replace {question} with the user's question.
Multi - round Q&A
f"<s>[INST] <<SYS>>\n{sys}\n<</SYS>>\n\n{question1} [/INST] {model_answer_1} </s><s>[INST] {question2} [/INST]"
# Replace {sys} with instructions.
# Replace {question1} with the user's first question.
# Replace {model_answer_1} with the model's first answer.
# Replace {question2} with the user's second question.
Huggingface Chat Templates
General Q&A
chat = [
{"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)
# Replace {question} with the user's input.
With System Prompt
chat = [
{"role": "system", "content": "{sys}"},
{"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)
# Replace {sys} with instructions, e.g., "You are an AI assistant from Taiwan named TAIDE, willing to help users from a Taiwanese perspective and answer questions in Traditional Chinese."
# Replace {question} with the user's question.
Multi - round Q&A
chat = [
{"role": "system", "content": "{sys}"},
{"role": "user", "content": "{question1}"},
{"role": "assistant", "content": "{model_answer_1}"},
{"role": "user", "content": "{question2}"},
]
prompt = tokenizer.apply_chat_template(chat)
# Replace {sys} with instructions.
# Replace {question1} with the user's first question.
# Replace {model_answer_1} with the model's first answer.
# Replace {question2} with the user's second question.
đ Documentation
Model Parameters
Property | Details |
---|---|
Number of Parameters | 7B |
Maximum Context Length | 4K |
Token Quantity of Traditional Chinese Training Data | 41.44B |
Training Time | 1531.82 H100 GPU Hours |
Application Examples
Task | User Input | Model Output |
---|---|---|
Article Writing | Write an article based on the following: Write a composition titled 'An Old Photo'. The content should select an impressive photo, explain the reason for its impressiveness, and describe the image in the photo and the story behind it. Record the growth process, scenes with others, environmental changes, and beautiful scenery. | Here is a composition created according to the instructions: 'An Old Photo' In my long - forgotten family photo album, there is an old photo that particularly attracts me. It records a happy family gathering of our family members in the past, which is still memorable. This photo was taken by my grandmother at a family gathering more than twenty years ago. The people in the photo include my parents, grandparents, uncles, aunts, and cousins. Everyone's faces were filled with happy smiles, as if time had stood still. The scene in the photo was very warm. Everyone was sitting around a rectangular table filled with delicious food and drinks. The dishes on the table were traditional Chinese cuisine, including a fragrant whole roasted fish, tender chicken, and sweet vegetables. These dishes were cooked by my grandmother herself. She was very good at cooking, and everyone enjoyed the meal. What impressed me most was that I was also in the photo when I was a child. I was sitting at the other end of the table, holding a bowl of soup with a satisfied smile on my face. I was just one year old at that time and was very curious about this grand family gathering. From this photo, I can feel the intimacy and warmth among family members at that time. Even after so many years, that emotion still remains strong. There is a touching story behind this photo. In that era, my grandparents and parents were still in middle school. They would go to the farmland to help after school every day, working hard to support the family. Although the living conditions were not rich, they still maintained an optimistic attitude and love for life. Every autumn, when the harvest season came, they would hold a family gathering and invite all relatives and friends to share the joy of the harvest. As the years passed, I grew up and started working and living. Every time I go back to my hometown, I will take the time to visit the elderly, listen to their stories from the past, and feel that unforgettable time. And this old photo will always accompany me and become a part of my precious memories. It witnesses our growth and changes and records those unforgettable moments. This photo not only shows the family affection and warmth but also reflects the hard - working and simple lifestyle and respect for tradition of people in that era. Nowadays, with the rapid development of society, many traditional virtues and customs are gradually disappearing. We should cherish such memories and cultures and pass them on so that future generations can also feel the warmth and inheritance from history. Therefore, I decided to keep this photo well and often take it out to look at. It is not only an old photo but also a symbol of precious memories and family spirit. I hope that I will have the opportunity to recreate the grand occasion of that year, gather together again, share our lives and stories, and let that family affection and friendship continue to spread in time and never fade away. |
Letter Writing | Write a letter as a brand manager to the creative director of an advertising company, proposing creative suggestions for the advertising campaign of a new product. | Dear Creative Director, I am writing this letter to propose some creative suggestions for the advertising campaign of our new product. Our company is about to launch a brand - new product. To ensure its success in the market, we hope to attract consumers' attention through a striking advertising campaign. When considering the creativity and strategy of the advertisement, several factors need to be taken into account. First, we need to conduct research on the target audience to understand their needs and preferences. Second, we need to ensure that the advertisement content is attractive and persuasive and can resonate with the target audience. Finally, we need to use various media platforms and digital marketing tools to expand the influence of the advertisement. Based on these factors, I propose the following creative suggestions: 1. Product Positioning: We can position this new product as a representative of high - end, environmentally friendly, healthy, and fashionable products, emphasizing its unique functions and features. In the advertisement, we can convey these characteristics through vivid visual effects and concise language. 2. Storytelling Marketing: Tell a touching story related to the product's functions in the advertisement so that consumers can better understand the changes in life brought by the product. For example, we can tell the story of a busy professional woman who uses our new product to balance work and life. 3. Celebrity Endorsement: Invite a popular public figure or opinion leader to be the product spokesperson and use their influence to promote the product. This can not only increase the product's exposure but also make more consumers trust and rely on the product. 4. Cross - industry Collaboration: Collaborate with some related industries (such as fashion, environmental protection, and health) to jointly launch limited - edition or co - branded products. This kind of collaboration can not only attract consumers from different fields but also bring more topics and collectible value to the product. 5. Digital Marketing: Make full use of digital tools such as social media, search engines, and email marketing to maintain a consistent message and image throughout the advertising campaign. We can also produce a series of short videos and online activities to increase consumer participation and brand loyalty. 6. Innovative Forms: Try to use novel advertising forms, such as AR (Augmented Reality), VR (Virtual Reality), and 360 - degree videos, to bring consumers a brand - new advertising experience. This innovative form can not only attract consumers' attention but also improve the product's recognition in the market. 7. Interactive Activities: Add interactive elements such as product trials, lotteries, and coupons to the advertising campaign to encourage consumers to actively participate and share their experiences. This will help with word - of - mouth promotion and further expand the product's influence. Based on the above suggestions, I hope that we can develop a comprehensive and creative advertising campaign plan to maximize the competitiveness of the new product in the market. If you have any questions or need further discussion, please feel free to contact me. Thank you for your attention and support! Best regards! Your Brand Manager (Note: This letter can be adjusted according to the actual situation and product characteristics) |
Summarization | Please summarize and streamline this article: The amendment to Articles 10 - 2 and 72 of the Industrial Innovation Act, commonly known as the 'Taiwan Semiconductor Act', provides a maximum 25% corporate income tax investment credit for companies in semiconductor, electric vehicle, 5G, and other technology - innovative fields that are in a key position in the international supply chain. The requirements for enterprises to apply include a certain scale of R & D expenses and R & D intensity in the current year and an effective tax rate reaching a certain ratio. To respond to the minimum tax system adjustment of OECD countries, the effective tax rate threshold is set at 12% in 2023 and is expected to increase to 15% in 2024, but the implementation of the international minimum tax system can still be considered. The official of the Ministry of Economic Affairs said that the negotiation with the Ministry of Finance has entered the final stage. In addition to setting the R & D intensity of enterprises at 6%, it has been confirmed that enterprises can offset the investment amount of more than 10 billion yuan in advanced - process equipment. The official of the Ministry of Finance said that during the discussion, in - depth research was conducted on Taiwan's industries and similar companies internationally. For equipment, after all, enterprises applying for Article 10 - 2 of the Industrial Innovation Act are like a 'national team' playing in the 'international cup'. If the investment amount is less than 10 billion yuan, they may not be able to compete. As for the much - concerned R & D expense threshold, the official of the Ministry of Economic Affairs said that after close discussions with the Ministry of Finance, the R & D expense threshold is expected to be between 6 billion and 7 billion yuan. The official of the Ministry of Finance pointed out that R & D is crucial for Taiwan's future economic growth momentum. The threshold should not be 'unreachable'. Although it was initially set at 10 billion yuan, it was lowered to make enterprises feel that they can reach the threshold and then apply for tax incentives, so as to have the motivation to continue investing in R & D and maintain a key position in the international supply chain. The official of the Ministry of Economic Affairs said that since the average R & D expenses of manufacturers are 3 - 4 billion yuan, and for IC design companies, it is between 3 billion and 6 billion yuan. If the threshold is set at 10 billion yuan, fewer enterprises will meet the conditions, and the incentive will be insufficient. In addition, if more enterprises meet the application threshold, it will increase the investment amount of enterprises in Taiwan, and the tax revenue of the Ministry of Finance will also increase. IC design companies have recently voiced their opinions on Article 10 - 2 of the Industrial Innovation Act, hoping to lower the application threshold. In addition, as countries are striving for supply - chain autonomy and increasing subsidies for the semiconductor industry, the Ministry of Economic Affairs and the Ministry of Finance have reached a consensus on Article 10 - 2 of the Industrial Innovation Act, aiming to benefit more enterprises, enhance the investment strength of enterprises, and consolidate Taiwan's technological position. The official of the Ministry of Finance said that the formulation of tax incentives must be 'purposeful and prudent' and take achieving the purpose of the incentives as the highest principle. At present, enterprises playing in the 'domestic cup' can still apply for the tax incentives of Articles 10 and 10 - 1 of the Industrial Innovation Act to jointly promote Taiwan's economic development. The Ministry of Economic Affairs and the Ministry of Finance are making the final confirmation on the R & D expense threshold. After the subsidiary regulations are announced in the next two days, there will be about 30 days to further discuss and adjust with the industry. It is hoped that Article 10 - 2 of the Industrial Innovation Act can be implemented in June. | The amendment to the 'Industrial Innovation Act' in Taiwan, also known as the 'Taiwan Semiconductor Act', aims to support innovation in leading enterprises in semiconductor, electric vehicle, and 5G technologies. The new policy provides a maximum 25% income tax credit. Enterprises need to meet the requirements of a certain scale of R & D expenses in the current year, an R & D intensity of more than 6%, and an effective tax rate reaching a certain ratio. In response to the adjustment of the international minimum tax system for enterprises, the thresholds in Taiwan are expected to reach 12% in 2023 and 15% in 2024. After discussions with the Ministry of Finance, it has been determined that an equipment investment of 10 billion yuan can be used as a credit incentive, and the R & D expense threshold is expected to be between 6 billion and 7 billion yuan. Regarding the request to lower the threshold, the Ministry of Economic Affairs and the Ministry of Finance have reached a consensus, aiming to expand the scope of beneficiaries and encourage enterprises to invest to maintain a key position in the supply chain. In addition, the demands of IC design companies are also considered for threshold reduction. It is expected that this bill will be implemented soon, and further discussions and adjustments will be carried out after the announcement period. |
Translation | Translate the following English content into Chinese: Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. | Dongshan coffee is famous for its unique geographical location and continuously refined production techniques. Its flavor is highly praised by many coffee lovers. |
đ§ Technical Details
Training Methods
Hardware and Software Specifications
- National Center for High - performance Computing H100
- Training Framework: PyTorch
Data Pre - processing
- Character Standardization: Standardize characters.
- Duplicate Removal: Remove duplicate data.
- Noise Removal:
- Remove HTML tags and JavaScript from web data.
- Remove non - standard characters or garbled characters.
- Remove articles with too few words.
- Remove specific formats in articles, such as line breaks added for typesetting.
- Personal Information Removal: Remove personal information such as emails and phone numbers.
- Inappropriate Text Removal: Remove inappropriate text such as gambling and pornographic content.
Character and Word Expansion
- To enhance the performance of Traditional Chinese input and output, the expanded data includes the following two parts:
- Obtain Chinese characters from the [Standard Chinese Characters List of the Ministry of Education Variant Characters Dictionary](https://dict.variants.moe.edu.tw/appendix.jsp?ID = 1&ID = 0).
- Extract 5 million sentences (2.1G) with more than 100 characters from Traditional Chinese Wikipedia, news, and Chinese common crawl data to train the tokenizer for Chinese words.
Continuous Pretraining (CP)
- Supplement a large amount of reliable Traditional Chinese knowledge.
- Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 1e - 4
- Batch Size: 1M tokens
- Epoch: 1
Fine - Tuning (FT)
- Enable the model to answer questions in response to Traditional Chinese queries.
- Hyperparameters:
- Optimizer: AdamW
- Learning Rate: 5e - 5
- Batch Size: 256K tokens
- Epoch: 3
Training Data
Continuous Pretraining Data (about 140G)
Dataset | Data Description |
---|---|
Litigation Data | Civil, criminal, and administrative litigation data of various courts from January 2013 to December 2023 from Judicial Yuan Judgments. |
Central News Agency | Data from Central News Agency Chinese News, including daily news articles from June 1993 to June 2023, covering domestic and international politics, society, finance, culture, education, and life. |
ETtoday News Cloud | Data from ETtoday News Cloud, including data from October 2011 to December 2023. |
Legislative Yuan Gazette | Gazette data from the 8th Session, 1st Meeting to the 10th Session, 7th Meeting of the Legislative Yuan Gazette. |
Publishers' Website Book Introductions | Book introductions from publishers' websites such as Suncolor and Gotop. |
GRB Research Project Abstracts | GRB is an information system that collects research projects and their results reports funded by the government. This dataset mainly collects research project abstracts and research report abstracts from 1993 to 2023, including Chinese and their English translations. |
Academic Conference Paper Abstracts | Papers from academic conferences held in Taiwan from 1988 to 2009, collected from the Academic Conference Paper Abstract Database. |
Taiwan Panorama Magazine | Articles from [Taiwan Panorama Magazine](https://www.taiwan - panorama.com/) from July 1993 to June 2023, focusing on Taiwan's culture, tourism, and people's livelihood. |
Terms Database | About 1.87 million academic terms and their translations in liberal arts and science fields from Terms Database. |
Government Departments' Data | Partial data from government departments' websites such as the Executive Yuan's National Conditions Introduction, the Ministry of Culture's National Cultural Memory Bank, the National Development Council's Archives for Teaching Network, and the Ministry of Transportation and Communications' Traffic Safety Portal. |
Business Today | Articles from Business Today, a weekly financial magazine, from January 2008 to July 2023. |
Ministry of Education Dictionaries | Includes the following three datasets: [Ministry of Education Idiom Dictionary](https://dict.idioms.moe.edu.tw/search.jsp?webMd = 1&la = 0), containing 5,338 idioms with explanations, original allusions, vernacular explanations, usage instructions, and examples. [Ministry of Education Revised National Language Dictionary](https://dict.revised.moe.edu.tw/?la = 0&powerMode = 0), containing Chinese single characters and various vocabulary, including pronunciation, radicals, and explanations, with about 165,539 entries. [Ministry of Education Concise National Language Dictionary](https://dict.concised.moe.edu.tw/?la = 0&powerMode = 0), a concise version of the Revised National Language Dictionary, with 45,247 entries. |
Science and Technology Park Data | Scientific news and popular science articles from the Science and Technology Park Website. |
iKnow Science and Technology Industry Information Room | iKnow Science and Technology Industry Information Room provides information on Taiwan and global science and technology market trends, strategic analysis, patent knowledge, and technology trading, focusing on the innovation and development of the science and technology industry from 2008 to 2023. |
Science Development Monthly | Popular science articles from Science Development Monthly from October 2004 to December 2020. Since 2021, it has been re - launched as the quarterly Charming Science and Technology, providing new articles on internationally - concerned science and technology topics. |
Laws and Regulations Database | Central laws, administrative rules, draft regulatory orders, and local autonomous regulations newly issued by various government departments as of October 2023 from the Laws and Regulations Database. |
Local Government Tourism Websites | Partial data from the tourism websites of some local governments in Taiwan. |
Curriculum Guidelines of the 12 - year National Education | The general outline of the 12 - year national education curriculum guidelines and the curriculum guidelines for different subjects in different levels of schools. |
Central News Agency Translation Database | A database of translations of Chinese and foreign surnames, names, organizations, and place names used in the news business of the Central News Agency. |
Fairy Tales | A total of 20 fairy tales, including 'The Adventures of Tom Sawyer', 'Peter Pan', 'Alice's Adventures in Wonderland', 'Daddy - Long - Legs', etc. |
RedPajama - Data - V2 | English data extracted from the foreign open - source multilingual corpus [RedPajama - Data - v2](https://github.com/togethercomputer/RedPajama - Data). |
MathPile - commercial | Foreign open - source mathematics corpus MathPile - commercial. |
Chinese Wikipedia | The content of all entries in [Chinese Wikipedia](https://zh.wikipedia.org/zh - tw/%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91) as of January 2023. |
github - code - clean | An open - source code dataset from GitHub, with unlicensed code and documents removed. |
Fine - Tuning Data
The TAIDE team trained the Llama2 series of models to generate fine - tuning data. The generated tasks include single - round or multi - round Q&A dialogues on world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwanese values, totaling 128K records. The fine - tuning data will be released later.
đ License
â ī¸ Disclaimer
Due to the limitations of the design architecture of LLM models and the possible biases in the data, any responses from the language model do not represent the stance of TAIDE. A security protection mechanism needs to be added before use, and the response content may also contain incorrect information. Please do not fully trust it.
đĨ Development Team
đ Related Links
- TAIDE Official Website
- TAIDE Huggingface
- [TAIDE Github](https://github.com/taide - taiwan)
- Kuwa AI
đ Citation

