Open-Insurance-LLM-Llama3-8B-GGUF Open-source Model - Free Deployment, Accurately Address Insurance Queries and Conversations

Open Insurance LLM Llama3 8B GGUF

Developed by Raj-Maharajwala

A GGUF quantized version of a specific language model in the insurance field based on NVIDIA Llama 3 - ChatQA, fine-tuned for insurance-related queries and conversations.

Large Language Model

Transformers

English#Fine-tuning in the insurance field #GGUF quantization #Context-aware dialogue

Downloads 130

Release Time : 11/22/2024

Model Overview

This is a language model optimized for the insurance field, capable of handling insurance-related queries and conversations, and providing professional insurance policy interpretation and consulting services.

Model Features

Fine-tuning in the insurance field

Specifically fine-tuned for the insurance field, it can better handle insurance-related queries and conversations.

Multiple quantization methods

Supports 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), and 16-bit quantization to meet different hardware requirements.

Context awareness

Can maintain the conversation history and provide context-aware responses, offering a coherent conversation experience.

Model Capabilities

Insurance policy interpretation

Claims processing assistance

Insurance coverage analysis

Insurance term clarification

Insurance policy comparison and recommendation

Risk assessment query

Insurance compliance question answering

Use Cases

Insurance consultation

Insurance policy understanding

Help users understand complex insurance policy terms and conditions.

Provide clear and professional policy interpretations

Claims guidance

Assist users in understanding the claims process and required documents.

Simplify the claims process and improve user satisfaction

Risk assessment

Insurance needs assessment

Recommend suitable insurance products based on user circumstances.

Personalized insurance advice

🚀 Open-Insurance-LLM-Llama3-8B-GGUF

This model is a GGUF-quantized version of an insurance domain-specific language model based on Nvidia Llama 3-ChatQA. It is fine-tuned for insurance-related queries and conversations, providing specialized support in the insurance field.

✨ Features

Domain-Specific: Tailored for the insurance industry, capable of handling various insurance-related tasks.
Multiple Quantization Options: Supports 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), and 16-bit quantization.
Context-Aware: Maintains conversation history for context-aware responses.

📦 Installation

Environment Setup

For Windows

python3 -m venv .venv_open_insurance_llm
.\.venv_open_insurance_llm\Scripts\activate

For Mac/Linux

python3 -m venv .venv_open_insurance_llm
source .venv_open_insurance_llm/bin/activate

Installation

For Mac Users (Metal Support)

export FORCE_CMAKE=1
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir

For Windows Users (CPU Support)

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

Dependencies

Then install dependencies (inference_requirements.txt) attached under Files and Versions:

pip install -r inference_requirements.txt

💻 Usage Examples

Basic Usage

# Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py)
import os
import time
from pathlib import Path
from llama_cpp import Llama
from rich.console import Console
from huggingface_hub import hf_hub_download
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple

@dataclass
class ModelConfig:
    # Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2
    model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
    model_file: str = "open-insurance-llm-q4_k_m.gguf"
    # model_file: str = "open-insurance-llm-q8_0.gguf"  # 8-bit quantization; higher precision, better quality, increased resource usage
    # model_file: str = "open-insurance-llm-q5_k_m.gguf"  # 5-bit quantization; balance between performance and resource efficiency
    max_tokens: int = 1000  # Maximum number of tokens to generate in a single output
    temperature: float = 0.1  # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution)
    top_k: int = 15  # After temperature scaling, Consider the top 15 most probable tokens during sampling
    top_p: float = 0.2  # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20%
    repeat_penalty: float = 1.2  # Penalize repeated tokens to reduce redundancy
    num_beams: int = 4  # Number of beams for beam search; higher values improve quality at the cost of speed
    n_gpu_layers: int = -2  # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration
    n_ctx: int = 2048  # Context window size; Llama 3 models support up to 8192 tokens context length
    n_batch: int = 256  # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512)
    verbose: bool = False  # True for enabling verbose logging for debugging purposes
    use_mmap: bool = False  # Memory-map model to reduce RAM usage; set to True if running on limited memory systems
    use_mlock: bool = True  # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM
    offload_kqv: bool = True  # Offload key, query, value matrices to GPU to accelerate inference


class InsuranceLLM:
    def __init__(self, config: ModelConfig):
        self.config = config
        self.llm_ctx = None
        self.console = Console()
        self.conversation_history: List[Dict[str, str]] = []
        
        self.system_message = (
            "This is a chat between a user and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. "
            "The assistant should also indicate when the answer cannot be found in the context. "
            "You are an expert from the Insurance domain with extensive insurance knowledge and "
            "professional writer skills, especially about insurance policies. "
            "Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. "
            "You are willing to help answer the user's query with a detailed explanation. "
            "In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, "
            "complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while "
            "still aiming to make the explanation clear and accessible to a general audience."
        )

    def download_model(self) -> str:
        try:
            with self.console.status("[bold green]Downloading model..."):
                model_path = hf_hub_download(
                    self.config.model_name,
                    filename=self.config.model_file,
                    local_dir=os.path.join(os.getcwd(), 'gguf_dir')
                )
            return model_path
        except Exception as e:
            self.console.print(f"[red]Error downloading model: {str(e)}[/red]")
            raise

    def load_model(self) -> None:
        try:
            quantized_path = os.path.join(os.getcwd(), "gguf_dir")
            directory = Path(quantized_path)

            try:
                model_path = str(list(directory.glob(self.config.model_file))[0])
            except IndexError:
                model_path = self.download_model()

            with self.console.status("[bold green]Loading model..."):
                self.llm_ctx = Llama(
                    model_path=model_path,
                    n_gpu_layers=self.config.n_gpu_layers,
                    n_ctx=self.config.n_ctx,
                    n_batch=self.config.n_batch,
                    num_beams=self.config.num_beams,
                    verbose=self.config.verbose,
                    use_mlock=self.config.use_mlock,
                    use_mmap=self.config.use_mmap,
                    offload_kqv=self.config.offload_kqv
                )
        except Exception as e:
            self.console.print(f"[red]Error loading model: {str(e)}[/red]")
            raise

    def build_conversation_prompt(self, new_question: str, context: str = "") -> str:
        prompt = f"System: {self.system_message}\n\n"
        
        # Add conversation history
        for exchange in self.conversation_history:
            prompt += f"User: {exchange['user']}\n\n"
            prompt += f"Assistant: {exchange['assistant']}\n\n"
        
        # Add the new question
        if context:
            prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n"
        else:
            prompt += f"User: {new_question}\n\n"
            
        prompt += "Assistant:"
        return prompt

    def generate_response(self, prompt: str) -> Tuple[str, int, float]:
        if not self.llm_ctx:
            raise RuntimeError("Model not loaded. Call load_model() first.")
        
        self.console.print("[bold cyan]Assistant: [/bold cyan]", end="")
        complete_response = ""
        token_count = 0
        start_time = time.time()

        try:
            for chunk in self.llm_ctx.create_completion(
                prompt,
                max_tokens=self.config.max_tokens,
                top_k=self.config.top_k,
                top_p=self.config.top_p,
                temperature=self.config.temperature,
                repeat_penalty=self.config.repeat_penalty,
                stream=True
            ):
                text_chunk = chunk["choices"][0]["text"]
                complete_response += text_chunk
                token_count += 1
                print(text_chunk, end="", flush=True)
            
            elapsed_time = time.time() - start_time
            print()
            return complete_response, token_count, elapsed_time
        except Exception as e:
            self.console.print(f"\n[red]Error generating response: {str(e)}[/red]")
            return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0

    def run_chat(self):
        try:
            self.load_model()
            self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]")
            self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n")
            self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n")
            self.console.print("Your conversation history will be maintained for context-aware responses.\n")
            
            total_tokens = 0
            
            while True:
                try:
                    user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip()

                    if user_input.lower() in ["exit", "/bye", "quit"]:
                        self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]")
                        self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]")
                        break

                    # Reset conversation with command
                    if user_input.lower() == "/reset":
                        self.conversation_history = []
                        self.console.print("[yellow]Conversation history has been reset.[/yellow]")
                        continue

                    context = ""
                    question = user_input
                    if "context:" in user_input.lower() and "question:" in user_input.lower():
                        parts = user_input.split("question:", 1)
                        context = parts[0].replace("context:", "").strip()
                        question = parts[1].strip()

                    prompt = self.build_conversation_prompt(question, context)
                    response, tokens, elapsed_time = self.generate_response(prompt)
                    
                    # Add to conversation history
                    self.conversation_history.append({
                        "user": question,
                        "assistant": response
                    })
                    
                    # Update total tokens
                    total_tokens += tokens
                    
                    # Print metrics
                    tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0
                    self.console.print(
                        f"[dim]Tokens: {tokens} || " +
                        f"Time: {elapsed_time:.2f}s || " +
                        f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]"
                    )
                    print()  # Add a blank line after each response
                    
                except KeyboardInterrupt:
                    self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]")
                    continue
                except Exception as e:
                    self.console.print(f"\n[red]Error processing input: {str(e)}[/red]")
                    continue
        except Exception as e:
            self.console.print(f"\n[red]Fatal error: {str(e)}[/red]")
        finally:
            if self.llm_ctx:
                del self.llm_ctx


def main():
    try:
        config = ModelConfig()
        llm = InsuranceLLM(config)
        llm.run_chat()
    except KeyboardInterrupt:
        print("\nProgram interrupted by user")
    except Exception as e:
        print(f"\nApplication error: {str(e)}")


if __name__ == "__main__":
    main()

Advanced Usage

python3 inference_open-insurance-llm-gguf.py

📚 Documentation

Model Details

Property	Details
Model Type	Quantized Language Model (GGUF format)
Base Model	nvidia/Llama3-ChatQA-1.5-8B
Finetuned Model	Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B
Quantized Model	Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF
Model Architecture	Llama
Quantization	8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), 16-bit
Finetuned Dataset	InsuranceQA (https://github.com/shuzi/insuranceQA)
Developer	Raj Maharajwala
License	llama3
Language	English

Nvidia Llama 3 - ChatQA Paper

Arxiv : https://arxiv.org/pdf/2401.10225

Use Cases

This model is specifically designed for:

Insurance policy understanding and explanation
Claims processing assistance
Coverage analysis
Insurance terminology clarification
Policy comparison and recommendations
Risk assessment queries
Insurance compliance questions

Limitations

The model's knowledge is limited to its training data cutoff.
Should not be used as a replacement for professional insurance advice.
May occasionally generate plausible-sounding but incorrect information.

Bias and Ethics

This model should be used with awareness that:

It may reflect biases present in insurance industry training data.
Output should be verified by insurance professionals for critical decisions.
It should not be used as the sole basis for insurance decisions.
The model's responses should be treated as informational, not as legal or professional advice.

Citation and Attribution

If you use base model or quantized model in your research or applications, please cite:

@misc{maharajwala2024openinsurance,
  author = {Raj Maharajwala},
  title = {Open-Insurance-LLM-Llama3-8B-GGUF},
  year = {2024},
  publisher = {HuggingFace},
  linkedin = {https://www.linkedin.com/in/raj6800/},
  url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF}
}

📄 License

The model is licensed under llama3.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご