đ 35b-beta-long
This release, CausalLM/35b-beta-long, combines our experience and large amounts of training data in fine - tuning large language models. We're open - sourcing the weights to promote development in the open - source community.
đ Quick Start
This release, CausalLM/35b-beta-long, represents the culmination of our experience and accumulated training data in fine - tuning large language models. We are open - sourcing these weights to foster development within the open - source community.
⨠Features
- We chose Cohere's multilingual, 35B - parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In evaluation, it was the most responsive to training data quality during Supervised Fine - Tuning, outperforming other open - source LLMs.
- Utilized extensive factual content from web crawls to synthesize over 30 million multi - turn dialogue data entries, based on multiple web - pages or documents, with substantial human oversight and a high - quality data pipeline.
- Our data synthesis approach addressed limitations in typical LLM training corpora. We focused on generating fact - based data using multiple documents in a long - context setting, leveraging existing SOTA LLMs with human guidance.
- This approach led to significant improvements in model performance during fine - tuning, including reduced hallucinations, enhanced long - context capabilities, and better general abilities like math, coding, and knowledge recall.
- The further fine - tuned model shows more robust recall in long - context scenarios without specific document formatting or prompt engineering, and has comparable performance to models twice its size in quantifiable benchmarks.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Base
We chose Cohere's multilingual, 35B - parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In our evaluation, it proved to be the most responsive to the quality of training data throughout the Supervised Fine - Tuning process, outperforming other open - source LLMs. Although its initial SFT/RL focuses on specific tasks and comes with a non - commercial license, we believe it's currently the best foundation for personal and internal use cases.
Data Synthesis
Utilizing extensive factual content from web crawls, we synthesized over 30 million multi - turn dialogue data entries, grounded in multiple web - pages or documents. This process involved substantial human oversight and a data pipeline designed to ensure high quality. The model was then trained on this data in full 128K context using BF16 precision. We also incorporated widely - used open - source dialogue datasets to enhance general conversational fluency.
Our data synthesis approach addressed crucial limitations in typical LLM training corpora. LLMs often struggle to extract thematic summaries, key information, or perform comparisons at the paragraph or document level. Therefore, we focused on generating fact - based data using multiple documents within a long context setting. This involved leveraging existing SOTA LLMs with human guidance to synthesize information through thematic summarization, information extraction, and comparison of source materials.
Model Performance
This approach yielded significant improvements in model performance during fine - tuning. We observed reductions in hallucinations, enhanced long - context capabilities, and improvements in general abilities such as math, coding, and knowledge recall. The training process incorporated both the original source material and the synthesized outputs, further reinforcing the model's ability to recall and utilize abstract concepts embedded within the pre - training data. Our analysis revealed that this combination of original and synthesized data was crucial for achieving a more balanced performance profile. Intermediate checkpoints and models trained solely on synthesized data are also released for research purposes.
Compared to the original task - specific model, our further fine - tuned model demonstrates more robust recall in long - context scenarios without requiring specific document formatting or prompt engineering. This fine - tuned model also exhibits performance comparable to models twice its size in quantifiable benchmarks.
Safety Measures
As this model has only undergone SFT, it may still exhibit biases or generate undesirable content. We implemented basic safety measures using open - source refusal datasets to mitigate outputs related to illegal activities, NSFW content, and violence. However, further Reinforcement Learning is necessary for robust alignment with human values.
đ§ Technical Details
The model uses Cohere's multilingual, 35B - parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as the base. It was trained on over 30 million multi - turn dialogue data entries synthesized from web - crawled factual content, in full 128K context using BF16 precision. The data synthesis process involved leveraging existing SOTA LLMs with human guidance to address limitations in typical LLM training corpora. The combination of original source material and synthesized outputs during training enhanced the model's performance in various aspects such as reducing hallucinations and improving long - context capabilities.
đ License
The license for this project is WTFPL.
đ Additional Information
Tokenizer and Chat Template
Tokenizer is different from cohere - and chat template is ChatML.
Pressure Testing
Pressure Testing from: https://github.com/LeonEricsson/llmcontext

Datasets Used
Property |
Details |
Datasets |
JosephusCheung/GuanacoDataset, meta - math/MetaMathQA, jondurbin/airoboros - 3.1, WizardLM/WizardLM_evol_instruct_V2_196k, RyokoAI/ShareGPT52K, RyokoAI/Fandom23K, milashkaarshif/MoeGirlPedia_wikitext_raw_archive, wikipedia, wiki_lingua, garage - bAInd/Open - Platypus, LDJnr/Puffin, BAAI/COIG, TigerResearch/tigerbot - zhihu - zh - 10k, liwu/MNBVC, teknium/openhermes, CausalLM/Refined - Anime - Text, microsoft/orca - math - word - problems - 200k, m - a - p/CodeFeedback - Filtered - Instruction |
Language Support
The model supports English (en), Chinese (zh), Japanese (ja), and German (de).