Qwen3-8b-192k Open-source Model - Supports Long Text Processing, Free for Edge Device Deployment

Qwen3 8b 192k Context 6X Josiefied Uncensored MLX AWQ 4bit

Developed by Goraint

The 4-bit AWQ quantized version of Qwen3-8B, optimized for the MLX library, supports 192k token long context processing, suitable for edge device deployment.

Large Language Model Open Source License:Apache-2.0 #Apple chip optimization #192k long context #4-bit efficient quantization

Downloads 204

Release Time : 5/15/2025

Model Overview

A 4-bit quantized model based on Qwen3-8B, enabling efficient inference on Apple chips via the MLX library while retaining the core capabilities of the original model and reducing resource consumption.

Model Features

Efficient inference

4-bit quantization reduces memory usage by approximately 75% compared to FP16.

Long context support

192k token processing capability (6x the standard version).

Apple chip optimization

Acceleration on M1/M3 chips via the MLX library.

Edge device deployment

Low resource consumption suitable for local device operation.

Model Capabilities

Long text generation

Conversational interaction

Document analysis

Code generation

Use Cases

Research

Long-context NLP experiments

Supports language modeling research with ultra-long text sequences.

Model compression research

Validation of 4-bit quantization techniques.

Development

Edge device chatbots

Deploy localized dialogue systems on Apple devices.

112.8 tokens/sec on M3 Ultra in real-world tests.

Long document processing

Analysis and summarization of long texts like books/papers.

Enterprise applications

Code generation

Generating complete code snippets based on long context.

🚀 Qwen3-8B 4-bit AWQ Quantized Model

A 4-bit AWQ quantized version of Qwen3-8B optimized for efficient inference using the MLX library, designed to handle long-context tasks with reduced resource usage.

🚀 Quick Start

Installation

# Install MLX (Apple Silicon only)
pip install mlx

# Load model with Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Goraint/Qwen3-8b-192k-Context-6X-Josiefied-Uncensored-MLX-AWQ-4bit", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Goraint/Qwen3-8b-192k-Context-6X-Josiefied-Uncensored-MLX-AWQ-4bit")

Example Usage

prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("mps")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✨ Features

Efficient Inference: 4-bit quantization reduces memory footprint by ~75% vs. FP16.
Long Context Support: 192k tokens for complex tasks (e.g., document analysis, code generation).
Cross-Platform: Works on macOS with MLX for Apple Silicon acceleration.
Customizable Prompting: Adjust templates for compatibility with tools like LM Studio.

📦 Model Details

Property	Details
Base Model	Qwen3-8B
Quantization	AWQ Q4 (4-bit) via MLX library
Context Length	192,000 tokens (6x longer than standard)
Library	MLX (optimized for Apple Silicon, macOS)
License	Apache 2.0
Pipeline	`text-generation`
Tags	`not-for-all-audiences`, `conversational`, `mlx`

📊 Performance Metrics

Metric	Value
Model Size	~4.38 GB (4-bit quantized)
Inference Speed	30.58 tokens/sec (M1 MAX) 112.80 tokens/sec (M3 ULTRA) gguf Q4_K_S: 8.14 tokens/sec (M1 MAX)
Context Support	192,000 tokens

💻 Usage Examples

Basic Usage

prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("mps")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

{%- if tools %}
    {{- '\/system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call>...</tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '\/system\n' + messages[0].content + '\/\n' }}
    {%- endif %}
{%- endif %}

{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- set tool_start = "ÔΩü" %}
    {%- set tool_start_length = tool_start|length %}
    {%- set start_of_message = message.content[:tool_start_length] %}
    {%- set tool_end = "ÔΩ†" %}
    {%- set tool_end_length = tool_end|length %}
    {%- set start_pos = (message.content|length) - tool_end_length %}
    {%- if start_pos < 0 %}
        {%- set start_pos = 0 %}
    {%- endif %}
    {%- set end_of_message = message.content[start_pos:] %}
    {%- if ns.multi_step_tool and message.role == "user" and not(start_of_message == tool_start and end_of_message == tool_end) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}

{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '\/' + message.role + '\n' + message.content + '\/' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set content = message.content %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '\/' in message.content %}
                {%- set content = (message.content.split('\/')|last).lstrip('\n') %}
                {%- set reasoning_content = (message.content.split('\/')|first).rstrip('\n') %}
                {%- set reasoning_content = (reasoning_content.split('')|last).lstrip('\n') %}
            {%- endif %}
        {%- endif %}

        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '\/' + message.role + '\n\n' + reasoning_content.strip('\n') + '\n\/\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '\/' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '\/' + message.role + '\n' + content }}
        {%- endif %}

        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '\/\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '\/user' }}
        {%- endif %}
        {{- '\nÔΩü\n' }}
        {{- message.content }}
        {{- '\nÔΩ†' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '\/\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}

{%- if add_generation_prompt %}
    {{- '\/assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '

📚 Use Cases

Research: Long-context NLP experiments, model compression studies.
Development: Edge deployments, real-time chatbots with extended context.
Enterprise: Cost-effective AI solutions for document processing and code generation.

🔧 Technical Details

Potential Biases

Trained on diverse data but may inherit societal biases (e.g., gender, cultural assumptions).
"Not-for-all-audiences" tag indicates potential for generating sensitive content.

Technical Limitations

4-bit quantization may slightly reduce accuracy on complex tasks.
Performance depends on hardware (MLX optimized for Apple Silicon).

Mitigation Strategies

Review outputs for sensitive content.
Use in controlled environments with monitoring.

🌱 Environmental Impact

Estimated CO2 Emissions: Calculate using ML Impact Calculator
- Hardware: Apple M1 Pro (16GB RAM)
- Training Time: N/A (quantized from pre-trained model)

👥 Community & Resources

Documentation: Hugging Face Docs
GitHub Issues: Report bugs or feature requests
Forums: Hugging Face Discuss

📄 License

Apache 2.0

⚠️ Important Note

This model is a community contribution and may not be officially supported by Alibaba Cloud. Always validate outputs for accuracy and safety in production environments.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご