35b-beta-long Open-Source Large Language Model - Free Deployment, Supports Multilingual Long-Text Processing and Fact Data Generation

35b Beta Long

Developed by CausalLM

A 35-billion-parameter multilingual large language model fine-tuned based on the CohereForAI/c4ai-command-r-v01 architecture, specializing in long-context processing and fact-based data generation

Large Language Model

Transformers

Supports Multiple Languages#128K Long Context #Multi-document Knowledge Synthesis #Multilingual Dialogue Optimization

Downloads 79

Release Time : 4/13/2024

Model Overview

By integrating massive multi-turn dialogue datasets and rigorous quality control processes, this model significantly improves long-context processing capabilities, reduces hallucination phenomena, and enhances general abilities in mathematics/programming/knowledge retrieval

Model Features

Long-context Processing

Supports full 128K context window training, optimizing multi-document information integration and cross-paragraph reasoning capabilities

Multilingual Fact Generation

Enhanced multilingual factual accuracy based on 30 million rounds of human-reviewed multi-turn dialogue data

Hybrid Data Training

Combines raw materials with synthetic data training to balance the model's knowledge retrieval and generation capabilities

Safety Protection

Implements basic safety filtering using open-source rejection datasets to restrict illegal/NSFW content output

Model Capabilities

Long-text Summarization

Cross-document Information Retrieval

Multilingual Dialogue Generation

Mathematical Problem Solving

Code Generation and Explanation

Knowledge Q&A

Thematic Analysis

Use Cases

Knowledge Management

Multi-document Research Assistance

Extracts key information from extensive research materials and generates comparative analysis

Improves researchers' information processing efficiency

Education

Math Problem Tutoring

Explains complex mathematical problems step-by-step and provides similar example questions

Accuracy comparable to professional tutors

Technical Support

Code Review Assistant

Analyzes long code files and identifies potential issues

Supports contextual understanding of multiple programming languages

🚀 35b-beta-long

This release, CausalLM/35b-beta-long, combines our experience and large amounts of training data in fine - tuning large language models. We're open - sourcing the weights to promote development in the open - source community.

🚀 Quick Start

This release, CausalLM/35b-beta-long, represents the culmination of our experience and accumulated training data in fine - tuning large language models. We are open - sourcing these weights to foster development within the open - source community.

✨ Features

We chose Cohere's multilingual, 35B - parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In evaluation, it was the most responsive to training data quality during Supervised Fine - Tuning, outperforming other open - source LLMs.
Utilized extensive factual content from web crawls to synthesize over 30 million multi - turn dialogue data entries, based on multiple web - pages or documents, with substantial human oversight and a high - quality data pipeline.
Our data synthesis approach addressed limitations in typical LLM training corpora. We focused on generating fact - based data using multiple documents in a long - context setting, leveraging existing SOTA LLMs with human guidance.
This approach led to significant improvements in model performance during fine - tuning, including reduced hallucinations, enhanced long - context capabilities, and better general abilities like math, coding, and knowledge recall.
The further fine - tuned model shows more robust recall in long - context scenarios without specific document formatting or prompt engineering, and has comparable performance to models twice its size in quantifiable benchmarks.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Base

We chose Cohere's multilingual, 35B - parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In our evaluation, it proved to be the most responsive to the quality of training data throughout the Supervised Fine - Tuning process, outperforming other open - source LLMs. Although its initial SFT/RL focuses on specific tasks and comes with a non - commercial license, we believe it's currently the best foundation for personal and internal use cases.

Data Synthesis

Utilizing extensive factual content from web crawls, we synthesized over 30 million multi - turn dialogue data entries, grounded in multiple web - pages or documents. This process involved substantial human oversight and a data pipeline designed to ensure high quality. The model was then trained on this data in full 128K context using BF16 precision. We also incorporated widely - used open - source dialogue datasets to enhance general conversational fluency.

Our data synthesis approach addressed crucial limitations in typical LLM training corpora. LLMs often struggle to extract thematic summaries, key information, or perform comparisons at the paragraph or document level. Therefore, we focused on generating fact - based data using multiple documents within a long context setting. This involved leveraging existing SOTA LLMs with human guidance to synthesize information through thematic summarization, information extraction, and comparison of source materials.

Model Performance

This approach yielded significant improvements in model performance during fine - tuning. We observed reductions in hallucinations, enhanced long - context capabilities, and improvements in general abilities such as math, coding, and knowledge recall. The training process incorporated both the original source material and the synthesized outputs, further reinforcing the model's ability to recall and utilize abstract concepts embedded within the pre - training data. Our analysis revealed that this combination of original and synthesized data was crucial for achieving a more balanced performance profile. Intermediate checkpoints and models trained solely on synthesized data are also released for research purposes.

Compared to the original task - specific model, our further fine - tuned model demonstrates more robust recall in long - context scenarios without requiring specific document formatting or prompt engineering. This fine - tuned model also exhibits performance comparable to models twice its size in quantifiable benchmarks.

Safety Measures

As this model has only undergone SFT, it may still exhibit biases or generate undesirable content. We implemented basic safety measures using open - source refusal datasets to mitigate outputs related to illegal activities, NSFW content, and violence. However, further Reinforcement Learning is necessary for robust alignment with human values.

🔧 Technical Details

The model uses Cohere's multilingual, 35B - parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as the base. It was trained on over 30 million multi - turn dialogue data entries synthesized from web - crawled factual content, in full 128K context using BF16 precision. The data synthesis process involved leveraging existing SOTA LLMs with human guidance to address limitations in typical LLM training corpora. The combination of original source material and synthesized outputs during training enhanced the model's performance in various aspects such as reducing hallucinations and improving long - context capabilities.

📄 License

The license for this project is WTFPL.

📋 Additional Information

Tokenizer and Chat Template

Tokenizer is different from cohere - and chat template is ChatML.

Pressure Testing

Pressure Testing from: https://github.com/LeonEricsson/llmcontext

![image/png](https://cdn - uploads.huggingface.co/production/uploads/63468a143ea42ee2cb49ddd1/2XbONpyTeMH1qWCtE9ziH.png)

Datasets Used

Property	Details
Datasets	JosephusCheung/GuanacoDataset, meta - math/MetaMathQA, jondurbin/airoboros - 3.1, WizardLM/WizardLM_evol_instruct_V2_196k, RyokoAI/ShareGPT52K, RyokoAI/Fandom23K, milashkaarshif/MoeGirlPedia_wikitext_raw_archive, wikipedia, wiki_lingua, garage - bAInd/Open - Platypus, LDJnr/Puffin, BAAI/COIG, TigerResearch/tigerbot - zhihu - zh - 10k, liwu/MNBVC, teknium/openhermes, CausalLM/Refined - Anime - Text, microsoft/orca - math - word - problems - 200k, m - a - p/CodeFeedback - Filtered - Instruction

Property

Details

Datasets

JosephusCheung/GuanacoDataset, meta - math/MetaMathQA, jondurbin/airoboros - 3.1, WizardLM/WizardLM_evol_instruct_V2_196k, RyokoAI/ShareGPT52K, RyokoAI/Fandom23K, milashkaarshif/MoeGirlPedia_wikitext_raw_archive, wikipedia, wiki_lingua, garage - bAInd/Open - Platypus, LDJnr/Puffin, BAAI/COIG, TigerResearch/tigerbot - zhihu - zh - 10k, liwu/MNBVC, teknium/openhermes, CausalLM/Refined - Anime - Text, microsoft/orca - math - word - problems - 200k, m - a - p/CodeFeedback - Filtered - Instruction

Language Support

The model supports English (en), Chinese (zh), Japanese (ja), and German (de).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご