Mistral-Small-24B-Instruct-2501の量子化版オープンソース - より小型・高速で性能の損失が少ない

Mistral Small 24B Instruct 2501 GPTQ G128 W4A16 MSE

ConfidentialMindによって開発

これはmistralai/Mistral-Small-24B-Instruct-2501モデルの4ビット量子化バージョンで、ConfidentialMind.comによって量子化され、より小さく、より高速なモデルを実現し、性能の損失は極めて小さいです。

大規模言語モデル

Safetensors

英語オープンソースライセンス:Apache-2.0 #4ビット量子化 #効率的な推論 #英文テキスト生成

ダウンロード数 93

リリース時間 : 2/15/2025

モデル概要

Mistral-Small-24B-Instruct-2501に基づく4ビット量子化モデルで、主にテキスト生成タスクに使用され、効率的な推論が必要なシナリオに適しています。

モデル特徴

効率的な4ビット量子化

GPTQ技術を使用して4ビット精度の量子化を実現し、モデルサイズと推論時間を大幅に削減します。

グループサイズ128

グループサイズ128の量子化戦略を採用し、モデルの精度と推論効率のバランスを取ります。

MSE最適化

MSE（平均二乗誤差）と高い減衰係数を使用して量子化を最適化し、損失とパープレキシティを減少させます。

シングルGPUサポート

最適化後、単一のNVIDIA A100 GPU（80GB VRAM）で効率的に動作します。

モデル能力

テキスト生成

効率的な推論

量子化モデル展開

使用事例

効率的なテキスト生成

高速コンテンツ生成

リソースが制限された環境で高品質のテキストコンテンツを迅速に生成します。

高い生成品質を維持しながら、推論速度を大幅に向上させます。

研究応用

量子化技術研究

大規模モデルの量子化技術の研究ケースとして使用します。

4ビット量子化が大規模言語モデルに適用された効果を示します。

🚀 🔥 量子化モデル: Mistral-Small-24B-Instruct-2501_GPTQ_G128_W4A16_MSE 🔥

このモデルは、mistralai/Mistral-Small-24B-Instruct-2501 の4ビット量子化バージョンです。ConfidentialMind.com 🤖✨ によって量子化されています。オープンソースのGPTQModel量子化を利用し、グループサイズ128で4ビットの精度を実現しています。これにより、モデルサイズが小さくなり、高速化し、性能の低下も最小限に抑えられます。

単一のNVIDIA A100 GPU（VRAM 80GB）で実行されています。

注意: モデルが小さいため、batch_size はかなり高く設定されています。GPUのVRAMに合わせて調整する必要があるかもしれません。 注意2: mistral-smallの重みが「パックされた」性質を持っているため、MSEとより高い減衰係数を積極的に使用しました。これにより、損失と困惑度が減少しましたが、G32の方が推奨されます。

🚀 クイックスタート

この量子化モデルは、小さなサイズと高速な推論を実現し、性能の低下を最小限に抑えています。以下のセクションでは、モデルの詳細、使用方法、インストール手順などを説明します。

✨ 主な機能

4ビット量子化によるモデルサイズの縮小と高速化
グループサイズ128のGPTQ量子化方法を使用
最小限の性能低下での量子化

📦 インストール

パッケージバージョンとインストール手順

正確なPythonライブラリのバージョンは pyproject.toml を参照してください（uvが必要です）。

uv venv
source venv/bin/activate
uv sync

環境変数

HF_TOKEN=<YOUR_HF_TOKEN>
TOKENIZERS_PARALLELISM="true"
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

💻 使用例

基本的な使用法

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

# ローカルディレクトリ、またはアップロード後のJustJaro/Mistral-Small-24B-Instruct-2501_gptq_g128_4bitを使用
quantized_model_id = "/home/jaro/models/quantized/Mistral-Small-24B-Instruct-2501_gptq_g128_4bit"  # または "JustJaro/Mistral-Small-24B-Instruct-2501_gptq_g128_4bit"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_id)
model = GPTQModel.load(quantized_model_id, device="cuda:0")  # または "cpu"

input_text = "This is a test prompt"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 ドキュメント

モデル詳細

プロパティ	詳細
オリジナルモデル	mistralai/Mistral-Small-24B-Instruct-2501
量子化モデル	Mistral-Small-24B-Instruct-2501_gptq_g128_4bit (このリポジトリ)
量子化方法	GPTQ (4ビット, グループサイズ128)
量子化ライブラリ	GPTQModel
キャリブレーションデータセット	neuralmagic/LLM_compression_calibration (シーケンス長4096の512サンプル使用)
量子化実施者	ConfidentialMind.com

量子化スクリプト

以下は、このモデルを生成するために使用された正確な quantize.py スクリプトです（依存関係のバージョンも正確です）。

#!/usr/bin/env python3
"""
このスクリプトは、ソースのHugging Faceモデルとキャリブレーションデータセットを読み込み、
GPTQModelを使用してモデルを4ビット精度、グループサイズ128で量子化します。
量子化されたモデルは、Transformers APIを使用してsafetensors（安全なシリアライゼーション）で
~/models/quantized/ 以下に保存され、その後、モデル、トークナイザー、自動生成されたREADME.mdをアップロードして
Hugging Faceリポジトリ（_gptq_g128_4bitサフィックス付き）を作成または更新します。

使用例:
    python quantize.py --source-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
                       --calibration-dataset wikitext/wikitext-2-raw-v1 \
                       --seq-len 1024 --nsamples 256 --hf-token <YOUR_HF_TOKEN>
"""

import os
import shutil
import subprocess
from pathlib import Path
from typing import List

import torch
import typer
from datasets import load_dataset
from dotenv import load_dotenv, find_dotenv
from gptqmodel import GPTQModel, QuantizeConfig
from gptqmodel.utils import Perplexity
# 後でモデルハブにプッシュするため
from huggingface_hub import HfApi
from transformers import AutoTokenizer, PreTrainedTokenizerBase

load_dotenv(find_dotenv())
HF_TOKEN = os.getenv("HF_TOKEN")

app = typer.Typer()


def get_text_from_example(example: dict) -> str:
    """
    データセットのサンプルからテキストを返します。
    サンプルに "text" フィールドがあり、それが空でない場合、そのテキストが使用されます。
    それ以外の場合、"messages" フィールド（"content" キーを持つ辞書のリスト）がある場合、
    関数はすべての空でないメッセージの内容を連結して返します。
    """
    if "text" in example and example["text"]:
        return example["text"]
    elif "messages" in example:
        contents = [msg.get("content", "").strip() for msg in example["messages"]]
        return " ".join([s for s in contents if s])
    else:
        return ""


def get_calibration_dataset(
    tokenizer: PreTrainedTokenizerBase,
    nsamples: int,
    seqlen: int,
    calibration_dataset: str
    ) -> List[dict]:
    """
    Hugging Face Hub（またはローカルファイル）からキャリブレーションデータセットを読み込みます。
    単一の "text" フィールド（wikitextのような）または "messages" フィールド（Neural Magic LLM Compression Calibrationデータセットのような）を持つデータセットを受け入れます。
    抽出されたテキストの長さが少なくとも 'seqlen' であるサンプルのみが保持されます。
    選択された各サンプルは、トークナイズされ（'seqlen' まで切り捨てられ）、辞書として返されます。
    """
    ds = None
    try:
        # HF Hubからの読み込みを試みます。
        try:
            if "/" in calibration_dataset:
                parts = calibration_dataset.split("/", 1)
                ds = load_dataset(parts[0], parts[1], split="train")
            else:
                ds = load_dataset(calibration_dataset, split="train")
        except Exception as e:
            print(f"Error loading dataset '{calibration_dataset}' via load_dataset: {e}")
            ds = load_dataset(calibration_dataset, split="train")
            print(f"Loaded calibration dataset from full remote path {calibration_dataset}.")


    except Exception as e:
        print(f"Error loading dataset '{calibration_dataset}' via load_dataset: {e}")
        # フォールバック: 提供されたcalibration_datasetがローカルパスである場合、JSONラインとして読み込もうとします。
        if os.path.exists(calibration_dataset):
            try:
                ds = load_dataset("json", data_files=calibration_dataset, split="train")
                print(f"Loaded calibration dataset from local file {calibration_dataset}.")
            except Exception as e2:
                print(f"Error loading local json dataset from '{calibration_dataset}': {e2}")
                return []
        else:
            return []

    print(f"Dataset features: {ds.features}")

    # 抽出されたテキストが 'seqlen' の少なくとも80%であるサンプルをフィルタリングします。
    ds = ds.filter(lambda x: len(get_text_from_example(x)) >= int(seqlen*0.8))
    sample_range = min(nsamples, len(ds))
    calibration_data = []
    for i in range(sample_range):
        example = ds[i]
        text = get_text_from_example(example)
        tokenized = tokenizer(text, truncation=True, max_length=seqlen, return_tensors="pt")
        tokenized = {k: v.squeeze(0) for k, v in tokenized.items()}
        calibration_data.append(tokenized)
    return calibration_data


def calculate_avg_ppl(model, tokenizer):
    """
    GPTQModelのPerplexityユーティリティを使用して、wikitext-2-raw-v1のトレインスプリットで平均困惑度を計算します。
    """
    ppl = Perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_path="wikitext",
        dataset_name="wikitext-2-raw-v1",
        split="train",
        text_column="text",
    )
    ppl_values = ppl.calculate(n_ctx=512, n_batch=512)
    avg = sum(ppl_values) / len(ppl_values)
    return avg


def get_pinned_package_versions():
    """
    'uv pip freeze' を使用して固定されたパッケージのバージョンを取得します。
    小文字のパッケージ名をバージョンにマッピングする辞書を返します。
    """
    try:
        result = subprocess.run(["uv", "pip", "freeze"], capture_output=True, text=True, check=True)
        packages_output = result.stdout.strip()
        versions = {}
        for line in packages_output.splitlines():
            if "==" in line:
                package_name, package_version = line.split("==", 1)
                versions[package_name.lower()] = package_version
        return versions
    except subprocess.CalledProcessError as e:
        typer.echo(f"Error running 'uv pip freeze': {e}", err=True)
        return {}
    except FileNotFoundError:
        typer.echo("uv command not found. Make sure uv is installed and in your PATH.", err=True)
        return {}


@app.command()
def main(
    seq_len: int = typer.Option(4096, help="Sequence length for tokenization and calibration."),
    nsamples: int = typer.Option(512, help="Number of samples to use for calibration."),
    source_model: str = typer.Option("mistralai/Mistral-Small-24B-Instruct-2501",
                                     help="Source model HF repository identifier."),
    calibration_dataset: str = typer.Option("wikitext/wikitext-2-raw-v1",
                                              help="Calibration dataset identifier (in 'dataset/config' format) or local file path."),
    hf_token: str = typer.Option(HF_TOKEN,
                                 help="Hugging Face token for creating/updating your repo."),
):
    # 宛先ディレクトリとモデル名を準備します。
    model_name = source_model.split("/")[-1]
    quantized_model_name = f"{model_name}_gptq_g128_4bit"
    quantized_model_dir = os.path.expanduser(os.path.join("~/models/quantized", quantized_model_name))
    if not os.path.exists(quantized_model_dir):
        os.makedirs(quantized_model_dir, exist_ok=True)

        os.makedirs(quantized_model_dir, exist_ok=True)

        typer.echo("Loading tokenizer from source model...")
        tokenizer_obj = AutoTokenizer.from_pretrained(source_model, use_fast=True)

        typer.echo("Loading calibration dataset...")
        typer.echo(f"Calibration dataset: {calibration_dataset}")
        calibration_data = get_calibration_dataset(tokenizer_obj, nsamples, seq_len, calibration_dataset)
        if not calibration_data:
            typer.echo("Calibration dataset is empty. Aborting.", err=True)
            raise typer.Exit(code=1)

        quantize_config = QuantizeConfig(bits=4, group_size=128, mse=0.01, damp_percent=0.015)
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        typer.echo(f"Loading model in {device} mode...")
        model = GPTQModel.load(source_model, quantize_config)

        typer.echo("Quantizing model...")
        model.quantize(calibration_data, auto_gc=False, batch_size=int(nsamples*0.1))
        # README生成のためにHugging Faceのユーザー情報を取得します。
        package_versions = get_pinned_package_versions()
        username = get_my_user(hf_token)

        script_content = self_read_script()

        typer.echo(f"Saving quantized model to {quantized_model_dir} using Transformers safe serialization...")
        try:
            model.save_pretrained(quantized_model_dir)
            tokenizer_obj.save_pretrained(quantized_model_dir)
        except Exception as ex:
            typer.echo(f"Error during saving with safe_serialization: {ex}. Aborting.")
            raise
        typer.echo(f"Model uploaded to Hugging Face repo: {quantized_model_name}")
    else:
        tokenizer_obj = AutoTokenizer.from_pretrained(source_model, use_fast=True)
        package_versions = get_pinned_package_versions()
        username = get_my_user(hf_token)
        script_content = self_read_script()


        device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = GPTQModel.load(quantized_model_dir, device=device)
    avg_ppl = calculate_avg_ppl(model, tokenizer_obj)
    typer.echo(f"Average perplexity (PPL) on wikitext v2 dataset: {avg_ppl}")
    deps = Path("./pyproject.toml")
    shutil.copy(deps, quantized_model_dir)
    generate_readme(calibration_dataset, nsamples, package_versions, quantized_model_dir,
                    quantized_model_name, script_content, seq_len, source_model, username, avg_ppl)
    GPTQModel.push_to_hub(quantized_path=quantized_model_dir, private=False, repo_id=quantized_model_name,
                          token=HF_TOKEN)
    typer.echo(f"Model uploaded to Hugging Face repo: {quantized_model_name}")
    demo_input = tokenizer_obj("test is", return_tensors="pt").to(device)
    generated_ids = model.generate(**demo_input)
    output_text = tokenizer_obj.decode(generated_ids[0])
    typer.echo(f"Inference demo output: {output_text}")
    typer.echo(f"Average perplexity (PPL) on calibration dataset: {avg_ppl}")


def self_read_script():
    try:
        script_path = os.path.abspath(__file__)
        with open(script_path, "r") as f:
            script_content = f.read()
    except Exception as e:
        script_content = "Error reading script content: " + str(e)
    return script_content


def get_my_user(hf_token):
    api = HfApi(token=hf_token)
    user_info = api.whoami()
    try:
        username = user_info.get("name") or user_info.get("username")
    except Exception as e:
        typer.echo(f"Error retrieving username from Hugging Face API: {e}. Using default username.")
        username = api.whoami()
    if not username:
        typer.echo("Could not determine your Hugging Face username from the token, defaulting to hard coded username.",
                   err=True)
        username = "JustJaro"
    return username


def generate_readme(calibration_dataset, nsamples, package_versions, quantized_model_dir,
                    quantized_model_name, script_content, seq_len, source_model, username, avg_ppl):
    readme_content = f"""{MakeYourown}""