Telechat-7B Open-Source Large Language Model - Relying on Massive Corpora to Support Efficient Conversation and Communication

Telechat 7B

Developed by Tele-AI

TeleChat is a large language model developed and trained by China Telecom AI Technology Co., Ltd. The 7B model base is trained on 1.5 trillion tokens of high-quality Chinese and English corpus, while the 12B model base is trained on 3 trillion tokens of high-quality Chinese and English corpus.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Trained on Trillion-Level Corpus #Mixed Chinese-English Understanding #Long-Text Generation

Downloads 238

Release Time : 1/8/2024

Model Overview

TeleChat is a high-performance large language model that supports various tasks such as multi-turn dialogue, long-text generation, and code generation. It excels in general Q&A, knowledge-based, code-based, and mathematics-based benchmarks.

Model Features

High-Performance Architecture

Utilizes advanced technologies such as rotary position encoding, SwiGLU activation function, and RMSNorm layer normalization to enhance model training speed and effectiveness.

Multi-Turn Dialogue Support

Incorporates mask loss training for multi-turn dialogue datasets to better focus on multi-turn answers.

Long-Text Generation Capability

Excels in long-text writing tasks such as work summaries, work plans, and PPT outlines.

Extrapolation Capability

Employs NTK-aware extrapolation and attention scaling methods, enabling extrapolation up to 96K.

Model Capabilities

Text generation

Multi-turn dialogue

Q&A systems

Code generation

Mathematical reasoning

Long-text writing

Use Cases

Office Automation

Work Summary Generation

Automatically generates detailed work summary reports.

Produces well-structured and comprehensive work summaries.

PPT Outline Generation

Automatically generates PPT content outlines based on topics.

Creates logical and well-structured PPT outlines.

Education

Mathematical Problem Solving

Solves various mathematical problems and proofs.

Performs excellently in GSM8K and MATH evaluations.

Programming Assistance

Code Generation

Generates code based on natural language descriptions.

Performs well in HumanEval evaluations.

🚀 Star Semantic Large Model - TeleChat

Star Semantic Large Model - TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., Ltd. It offers high - performance language processing capabilities, with its 7B and 12B models trained on large - scale high - quality Chinese and English corpora.

🚀 Quick Start

Latest News

March 20, 2024: Open - sourced the 12B version chat model and its quantized versions.
January 11, 2024: Open - sourced a 1T Chinese dataset.
January 10, 2024: Open - sourced the 7B version chat model and its quantized versions.

Model Introduction

Star Semantic Large Model - TeleChat

The Star Semantic Large Model TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., Ltd. The 7B model base is trained on 1.5 trillion Tokens of high - quality Chinese and English corpora, and the 12B model base is trained on 3 trillion Tokens of high - quality Chinese and English corpora.
We have open - sourced the dialogue models TeleChat - 7B - bot and TeleChat - 12B - bot, along with their huggingface format weight files. In addition, we have also open - sourced the int8 and int4 quantized versions of the 7B and 12B models.
TeleChat - 12B - bot has been improved in terms of model structure, training data, and training methods. It has significantly outperformed TeleChat - 7B - bot on general Q&A, knowledge, code, and math leaderboards. In terms of model structure, we used small - scale models to experiment with various model structure combinations and selected the optimal one. Compared with the TeleChat - 7B - bot model, the TeleChat - 12B - bot model uses a decoupled structure for the word embedding layer and the output layer, separating the parameters of the word embedding layer and the output lm head layer, which helps enhance training stability and convergence. In terms of training data, we collected a large amount of Chinese and English data covering books, encyclopedias, news, government affairs, law, medicine, patents, papers, mathematics, code, and many other fields. By optimizing the data cleaning strategy, we have greatly improved the text cleanliness, view unbiasedness, content effectiveness, and format standardization of the data. In terms of training methods, we used scientific data ratio learning and curriculum learning. We used small - parameter models to fit data with various data ratios to obtain a priori estimates of the difficulty of each dataset. During the training process, we automatically evaluated the loss of the current model on all datasets and the generation effect on the evaluation set at regular intervals, and dynamically increased the weight of the more difficult - to - learn datasets to ensure that the model has a better fitting effect on each dataset.

Model Structure

We designed the TeleChat model using a standard Decoder - only structure and made the following improvements in the model dimensions:

Positional Encoding: We use the Rotary Embedding positional encoding method, which integrates relative position information dependence into self - attention and has good position extrapolation ability. Rotary Embedding can also work well with Flash - Attention v2, increasing the model training speed by about 20%.
Activation Function: We use the SwiGLU activation function instead of the GELU activation function. To reduce the computational load, we set the ffn_hidden_size to be less than 4 times the hidden layer size in the original SwiGLU.
Layer Normalization: Pre - Normalization based on RMSNorm.
Decoupling of Word Embedding Layer and Output Layer: We separated the parameters of the word embedding layer and the output lm head layer of the TeleChat - 12B - bot model, which helps enhance training stability and convergence.

	layer_num	hidden_size	ffn_hidden_size	head_num	tie_word_embeddings
7B	30	4096	12288	32	Yes
12B	38	5120	12288	32	No

The open - sourced TeleChat models support deepspeed fine - tuning, and we have open - sourced the training code based on deepspeed, which supports Zero parallel video memory optimization and integrates FlashAttention2. They also support multi - turn capabilities, and we have open - sourced the multi - turn data construction method and integrated a mask loss training method for multi - turn model training to better focus on multi - turn answers and improve the Q&A effect. The extrapolation ability has been improved, and we have open - sourced an 8K training version model that can be extrapolated to 96K using NTK - aware extrapolation and attention scaling extrapolation. The models have good long - text generation capabilities and perform well in long - text writing tasks such as work summaries, work plans, PPT outlines, essays, tender documents, emails, proposals, weekly reports, and JD writing.

The following table shows the released versions and download links:

Model Version	Download Link
7B - FP16	[TeleChat - 7B - FP16](https://huggingface.co/Tele - AI/Telechat - 7B)
7B - int8	[TeleChat - 7B - int8](https://huggingface.co/Tele - AI/Telechat - 7B - int8)
7B - int4	[TeleChat - 7B - int4](https://huggingface.co/Tele - AI/Telechat - 7B - int4)
12B - FP16	[TeleChat - 12B - FP16](https://huggingface.co/Tele - AI/TeleChat - 12B)
12B - int8	[TeleChat - 12B - int8](https://huggingface.co/Tele - AI/TeleChat - 12B - int8)
12B - int4	[TeleChat - 12B - int4](https://huggingface.co/Tele - AI/TeleChat - 12B - int4)

Data Open - Source

Data Introduction

TeleChat - PTD is a comprehensive large - scale Chinese dataset extracted from the pre - training corpus of the Telecom Star Large Model TeleChat. The data mainly comes from web pages, books, official media, etc. We used a combination of rules and models for filtering and removed similar data to extract high - quality data as much as possible.

The TeleChat - PTD dataset publicly releases approximately 270 million pieces of data, which are composed of pure Chinese text. The original size is about 1TB, and after compression, it is 480G, with a total of 189 files. Redundant information has been removed from the dataset.

Data Download

Huggingface download address: [TeleChat - PTD](https://huggingface.co/datasets/Tele - AI/TeleChat - PTD) China Telecom Cloud Disk download address: Data Download (Access code: pkg8)

Effect Evaluation

The TeleChat model also performs well in evaluation compared to models of the same scale. Our evaluation set covers datasets such as MMLU, C - Eval, GAOKAO, AGIEval, CMMLU, GSM8K, MATH, HumanEval, CHID, etc. The evaluation capabilities include natural language understanding, knowledge, mathematical calculation and reasoning, code generation, etc.

The following table shows the evaluation results:

Model	MMLU	C - Eval	CMMLU	AGIEval	GAOKAO	GSM8K	MATH	HumanEval	CSL	CHID	EPRSTMT	BBH	HellaSwag
	5 - shot	5 - shot	5 - shot	zero - shot	zero - shot	4 - shot	4 - shot	zero - shot	zero - shot	zero - shot	zero - shot	3 - shot	zero - shot
LLaMA2 - 7B - chat	46.2	31.9	31.5	28.5	16.1	26.3	3.9	12.2	58.8	44.1	57.5	35.6	74.1
LLaMA2 - 13B - chat	54.6	36.2	38.7	32.3	18.6	29.6	5.0	18.9	61.2	48.0	59.4	40.2	78.2
ChatGLM2 - 6B - chat	45.9	52.6	49.3	39.0	46.4	28.8	6.5	11.0	61.2	57.9	71.2	32.7	57.0
ChatGLM3 - 6B - chat	51.9	53.8	54	38.9	49.3	56.7	18.7	61	65.6	63.4	85	44.6	62.7
Baichuan2 - 7B - chat	52.8	55.6	54.0	35.3	39.7	32.8	6	13.4	60	75.2	87.5	35.8	61.6
Baichuan2 - 13B - chat	57	56.7	58.4	40	51.4	55.3	8.6	17.7	63.1	78.2	87.5	49.9	66.9
Qwen - 7B - chat	56.6	59.3	59.5	41.3	63.3	52.5	10.3	26.2	63.1	72.3	88.8	46.9	59.9
Qwen - 14B - chat	66.4	71.7	70.0	47.3	76.5	61.0	26.8	36.6	55.6	72.3	91.2	58.0	65.2
TeleChat - 7B - chat	60.5	64.6	64.3	46.8	59	36.7	10.3	20.1	66.8	88.0	87.5	19.5	36.7
TeleChat - 12B - chat	73.3	66.6	74.2	51.7	53.1	57.2	16.0	22.0	60.6	83.2	86.3	52.2	71.5

Note: CMMLU, AGIEval, GAOKAO, CSL, CHID, EPRSTMT are all evaluated based on the evaluation method provided by the [OpenCompass](https://github.com/open - compass/OpenCompass/) platform. For the comparison models, we also refer to the official reported results and the OpenCompass results. We used our own evaluation script to evaluate the MMLU and CEVAL leaderboards. The specific method can be found in the evaluation/ folder.

Model Inference

>>> import os
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
>>> os.environ["CUDA_VISIBLE_DEVICES"] = '0'
>>> tokenizer = AutoTokenizer.from_pretrained('../models/7B')
>>> model = AutoModelForCausalLM.from_pretrained('../models/7B', trust_remote_code=True, device_map="auto", torch_dtype=torch.float16)
>>> generate_config = GenerationConfig.from_pretrained('../models/7B')
>>> question="What's the difference between light soy sauce and dark soy sauce?"
>>> answer, history = model.chat(tokenizer = tokenizer, question=question, history=[], generation_config=generate_config, stream=False)
>>> print(answer)
Light soy sauce and dark soy sauce are two different types of soy sauce, and their differences are as follows:

1. Different raw materials: Light soy sauce is made from soybeans, wheat, and other grains; while dark soy sauce is made from fermented seasonings such as soybean paste and flour paste.

2. Different production processes: Light soy sauce is made by soaking soybeans in water and then going through processes such as steaming and fermentation; while dark soy sauce is made by adding a certain proportion of salt, sugar, monosodium glutamate, and other seasonings to light soy sauce and then fermenting it.

3. Different tastes and flavors: Light soy sauce has a salty and fresh taste and a relatively refreshing mouthfeel; while dark soy sauce has a special aroma and taste and a relatively heavier mouthfeel.

In general, light soy sauce and dark soy sauce are different types of soy sauce, and they differ in terms of raw materials, production processes, and tastes.

Statement, Agreement, Citation

Statement

We hereby declare that the TeleChat model and its derivative models should not be used for any activities that endanger national and social security or violate the law. At the same time, we also require users not to use the TeleChat model for Internet services without safety review and filing. We hope that all users abide by the above principles to ensure that technological development takes place in a legal and compliant environment.

We have done our best to ensure the compliance of the data used in the model training process. However, due to the complexity of the model and data, there may still be some unforeseeable problems. Therefore, we will not be responsible for any problems caused by the use of the open - sourced TeleChat model, including but not limited to data security issues, public opinion risks, or any risks and problems caused by the model being misled, misused, spread, or improperly exploited.

Agreement

The community needs to follow the TeleChat Model Community License Agreement when using the TeleChat model. The TeleChat model supports commercial use. If you plan to use the TeleChat model or its derivatives for commercial purposes, you need to submit the application materials required by the TeleChat Model Community License Agreement through the following contact email: tele_ai@chinatelecom.cn. After the review is approved, you will be granted a non - exclusive, global, non - transferable, non - sublicenseable, revocable commercial copyright license.

Citation

If you need to cite our work, please use the following reference:

@misc{wang2024telechat,
      title={TeleChat Technical Report}, 
      author={Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei he and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi Xiong and Yuxiang Zhang and Chao Wang and Shuangyong Song},
      year={2024},
      eprint={2401.03804},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご