Open-Insurance-LLM-Llama3-8B-GGUFオープンソースモデル - 無料でデプロイ可能、保険関連の問い合わせや会話に的確に対応

Home

Open Insurance LLM Llama3 8B GGUF

Developed by Raj-Maharajwala

NVIDIAのLlama 3 - ChatQAをベースにした保険分野の特定言語モデルのGGUF量子化バージョンで、保険関連のクエリと対話に対して微調整されています。

大規模言語モデル

Transformers

English#保険分野の微調整 #GGUF量子化 #コンテキスト感知対話

Downloads 130

Release Time : 11/22/2024

Model Overview

これは保険分野に最適化された言語モデルで、保険関連のクエリと対話を処理し、専門的な保険政策の解釈と相談サービスを提供します。

Model Features

保険分野の微調整

保険分野に特化して微調整されており、保険関連のクエリと対話をより適切に処理できます。

複数の量子化方式

8ビット（Q8_0）、5ビット（Q5_K_M）、4ビット（Q4_K_M）および16ビットの量子化をサポートし、さまざまなハードウェア要件に対応します。

コンテキスト感知

対話履歴を維持し、コンテキスト感知型の応答を実現し、一貫した対話体験を提供します。

Model Capabilities

保険政策の解釈

保険金支払い処理の支援

保険範囲の分析

保険用語の解明

保険政策の比較と推薦

リスク評価の照会

保険規制遵守問題の解答

Use Cases

保険相談

保険政策の理解

ユーザーが複雑な保険政策の条項と条件を理解するのを支援します。

明確で専門的な政策解釈を提供します

保険金申請ガイド

ユーザーが保険金申請プロセスと必要な書類を理解するのを支援します。

保険金申請プロセスを簡素化し、ユーザー満足度を向上させます

リスク評価

保険ニーズ評価

ユーザーの状況に基づいて適切な保険商品を推薦します。

個人化された保険提案を提供します

🚀 Open-Insurance-LLM-Llama3-8B-GGUF

このモデルは、NVIDIAのLlama 3 - ChatQAをベースにした保険分野の特定言語モデルのGGUF量子化バージョンです。保険関連のクエリと会話に対して微調整されています。

🚀 クイックスタート

環境構築

Windowsシステム

python3 -m venv .venv_open_insurance_llm
.\.venv_open_insurance_llm\Scripts\activate

Mac/Linuxシステム

python3 -m venv .venv_open_insurance_llm
source .venv_open_insurance_llm/bin/activate

インストール

Metal対応のMacユーザー

export FORCE_CMAKE=1
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir

CPU対応のWindowsユーザー

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

依存関係のインストール

次に、Files and Versionsに添付されている依存関係（inference_requirements.txt）をインストールします。

pip install -r inference_requirements.txt

✨ 主な機能

特定分野の微調整：保険分野に対して微調整されており、保険関連のクエリと会話をより適切に処理できます。
複数の量子化方式：8ビット（Q8_0）、5ビット（Q5_K_M）、4ビット（Q4_K_M）、16ビットの量子化をサポートしています。
コンテキスト感知：会話履歴を維持し、コンテキストを考慮した返答が可能です。

📦 インストール

環境構築

Windowsシステム

python3 -m venv .venv_open_insurance_llm
.\.venv_open_insurance_llm\Scripts\activate

Mac/Linuxシステム

python3 -m venv .venv_open_insurance_llm
source .venv_open_insurance_llm/bin/activate

インストール

Metal対応のMacユーザー

export FORCE_CMAKE=1
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir

CPU対応のWindowsユーザー

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

依存関係のインストール

pip install -r inference_requirements.txt

💻 使用例

基本的な使用法

# Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py)
import os
import time
from pathlib import Path
from llama_cpp import Llama
from rich.console import Console
from huggingface_hub import hf_hub_download
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple

@dataclass
class ModelConfig:
    # Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2
    model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
    model_file: str = "open-insurance-llm-q4_k_m.gguf"
    # model_file: str = "open-insurance-llm-q8_0.gguf"  # 8-bit quantization; higher precision, better quality, increased resource usage
    # model_file: str = "open-insurance-llm-q5_k_m.gguf"  # 5-bit quantization; balance between performance and resource efficiency
    max_tokens: int = 1000  # Maximum number of tokens to generate in a single output
    temperature: float = 0.1  # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution)
    top_k: int = 15  # After temperature scaling, Consider the top 15 most probable tokens during sampling
    top_p: float = 0.2  # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20%
    repeat_penalty: float = 1.2  # Penalize repeated tokens to reduce redundancy
    num_beams: int = 4  # Number of beams for beam search; higher values improve quality at the cost of speed
    n_gpu_layers: int = -2  # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration
    n_ctx: int = 2048  # Context window size; Llama 3 models support up to 8192 tokens context length
    n_batch: int = 256  # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512)
    verbose: bool = False  # True for enabling verbose logging for debugging purposes
    use_mmap: bool = False  # Memory-map model to reduce RAM usage; set to True if running on limited memory systems
    use_mlock: bool = True  # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM
    offload_kqv: bool = True  # Offload key, query, value matrices to GPU to accelerate inference


class InsuranceLLM:
    def __init__(self, config: ModelConfig):
        self.config = config
        self.llm_ctx = None
        self.console = Console()
        self.conversation_history: List[Dict[str, str]] = []
        
        self.system_message = (
            "This is a chat between a user and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. "
            "The assistant should also indicate when the answer cannot be found in the context. "
            "You are an expert from the Insurance domain with extensive insurance knowledge and "
            "professional writer skills, especially about insurance policies. "
            "Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. "
            "You are willing to help answer the user's query with a detailed explanation. "
            "In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, "
            "complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while "
            "still aiming to make the explanation clear and accessible to a general audience."
        )

    def download_model(self) -> str:
        try:
            with self.console.status("[bold green]Downloading model..."):
                model_path = hf_hub_download(
                    self.config.model_name,
                    filename=self.config.model_file,
                    local_dir=os.path.join(os.getcwd(), 'gguf_dir')
                )
            return model_path
        except Exception as e:
            self.console.print(f"[red]Error downloading model: {str(e)}[/red]")
            raise

    def load_model(self) -> None:
        try:
            quantized_path = os.path.join(os.getcwd(), "gguf_dir")
            directory = Path(quantized_path)

            try:
                model_path = str(list(directory.glob(self.config.model_file))[0])
            except IndexError:
                model_path = self.download_model()

            with self.console.status("[bold green]Loading model..."):
                self.llm_ctx = Llama(
                    model_path=model_path,
                    n_gpu_layers=self.config.n_gpu_layers,
                    n_ctx=self.config.n_ctx,
                    n_batch=self.config.n_batch,
                    num_beams=self.config.num_beams,
                    verbose=self.config.verbose,
                    use_mlock=self.config.use_mlock,
                    use_mmap=self.config.use_mmap,
                    offload_kqv=self.config.offload_kqv
                )
        except Exception as e:
            self.console.print(f"[red]Error loading model: {str(e)}[/red]")
            raise

    def build_conversation_prompt(self, new_question: str, context: str = "") -> str:
        prompt = f"System: {self.system_message}\n\n"
        
        # Add conversation history
        for exchange in self.conversation_history:
            prompt += f"User: {exchange['user']}\n\n"
            prompt += f"Assistant: {exchange['assistant']}\n\n"
        
        # Add the new question
        if context:
            prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n"
        else:
            prompt += f"User: {new_question}\n\n"
            
        prompt += "Assistant:"
        return prompt

    def generate_response(self, prompt: str) -> Tuple[str, int, float]:
        if not self.llm_ctx:
            raise RuntimeError("Model not loaded. Call load_model() first.")
        
        self.console.print("[bold cyan]Assistant: [/bold cyan]", end="")
        complete_response = ""
        token_count = 0
        start_time = time.time()

        try:
            for chunk in self.llm_ctx.create_completion(
                prompt,
                max_tokens=self.config.max_tokens,
                top_k=self.config.top_k,
                top_p=self.config.top_p,
                temperature=self.config.temperature,
                repeat_penalty=self.config.repeat_penalty,
                stream=True
            ):
                text_chunk = chunk["choices"][0]["text"]
                complete_response += text_chunk
                token_count += 1
                print(text_chunk, end="", flush=True)
            
            elapsed_time = time.time() - start_time
            print()
            return complete_response, token_count, elapsed_time
        except Exception as e:
            self.console.print(f"\n[red]Error generating response: {str(e)}[/red]")
            return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0

    def run_chat(self):
        try:
            self.load_model()
            self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]")
            self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n")
            self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n")
            self.console.print("Your conversation history will be maintained for context-aware responses.\n")
            
            total_tokens = 0
            
            while True:
                try:
                    user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip()

                    if user_input.lower() in ["exit", "/bye", "quit"]:
                        self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]")
                        self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]")
                        break

                    # Reset conversation with command
                    if user_input.lower() == "/reset":
                        self.conversation_history = []
                        self.console.print("[yellow]Conversation history has been reset.[/yellow]")
                        continue

                    context = ""
                    question = user_input
                    if "context:" in user_input.lower() and "question:" in user_input.lower():
                        parts = user_input.split("question:", 1)
                        context = parts[0].replace("context:", "").strip()
                        question = parts[1].strip()

                    prompt = self.build_conversation_prompt(question, context)
                    response, tokens, elapsed_time = self.generate_response(prompt)
                    
                    # Add to conversation history
                    self.conversation_history.append({
                        "user": question,
                        "assistant": response
                    })
                    
                    # Update total tokens
                    total_tokens += tokens
                    
                    # Print metrics
                    tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0
                    self.console.print(
                        f"[dim]Tokens: {tokens} || " +
                        f"Time: {elapsed_time:.2f}s || " +
                        f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]"
                    )
                    print()  # Add a blank line after each response
                    
                except KeyboardInterrupt:
                    self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]")
                    continue
                except Exception as e:
                    self.console.print(f"\n[red]Error processing input: {str(e)}[/red]")
                    continue
        except Exception as e:
            self.console.print(f"\n[red]Fatal error: {str(e)}[/red]")
        finally:
            if self.llm_ctx:
                del self.llm_ctx


def main():
    try:
        config = ModelConfig()
        llm = InsuranceLLM(config)
        llm.run_chat()
    except KeyboardInterrupt:
        print("\nProgram interrupted by user")
    except Exception as e:
        print(f"\nApplication error: {str(e)}")


if __name__ == "__main__":
    main()

python3 inference_open-insurance-llm-gguf.py

📚 ドキュメント

モデルの詳細

属性	詳細
モデルタイプ	量子化言語モデル（GGUF形式）
ベースモデル	nvidia/Llama3 - ChatQA - 1.5 - 8B
微調整モデル	Raj - Maharajwala/Open - Insurance - LLM - Llama3 - 8B
量子化モデル	Raj - Maharajwala/Open - Insurance - LLM - Llama3 - 8B - GGUF
モデルアーキテクチャ	Llama
量子化方式	8ビット（Q8_0）、5ビット（Q5_K_M）、4ビット（Q4_K_M）、16ビット
微調整データセット	InsuranceQA（https://github.com/shuzi/insuranceQA）
開発者	Raj Maharajwala
ライセンス	llama3
言語	英語

NVIDIA Llama 3 - ChatQA論文

Arxiv : https://arxiv.org/pdf/2401.10225

使用シーン

このモデルは、以下のシーンを対象に設計されています。

保険契約の理解と説明
損害賠償処理の支援
保険範囲の分析
保険用語の解釈
保険契約の比較と推薦
リスク評価の照会
保険規制に関する質問

制限事項

モデルの知識は、訓練データの締め切り日に制限されています。
専門的な保険アドバイスの代替にはなりません。
時々、合理的に聞こえるが誤った情報を生成することがあります。

バイアスと倫理

このモデルを使用する際には、以下の点に注意してください。

保険業界の訓練データに存在するバイアスを反映する可能性があります。
重要な決定については、出力を保険専門家によって検証する必要があります。
保険決定の唯一の根拠として使用しないでください。
モデルの返答は、情報参考としてのみ使用し、法律や専門的なアドバイスとしてではなく使用してください。

引用と帰属

もしあなたが基礎モデルまたは量子化モデルを研究やアプリケーションで使用した場合、以下を引用してください。

@misc{maharajwala2024openinsurance,
  author = {Raj Maharajwala},
  title = {Open-Insurance-LLM-Llama3-8B-GGUF},
  year = {2024},
  publisher = {HuggingFace},
  linkedin = {https://www.linkedin.com/in/raj6800/},
  url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF}
}