Kaleidoscope_large_v1 Open-source Document Q&A Model - Free Deployment with Russian and English Bilingual Q&A Support

Kaleidoscope Large V1

Developed by 2KKLabs

A document Q&A specialized model fine-tuned based on sberbank-ai/ruBert-large, supporting Russian and English document Q&A tasks.

Question Answering System

Transformers

Supports Multiple Languages#Russian document Q&A #Multimodal processing #BERT fine-tuning

Downloads 214

Release Time : 2/24/2025

Model Overview

This model is a document question-answering system capable of extracting answers from documents based on user queries, suitable for scenarios such as customer support, document retrieval, and intelligent Q&A.

Model Features

Multilingual Support

Optimized primarily for Russian text while also supporting English input processing.

Context Understanding

Uses concatenated input of questions and document context to guide the model to focus on relevant passages.

Efficient Training

Employs mixed-precision training and AdamW optimizer to enhance training effectiveness.

Long Text Processing

Utilizes sliding window strategy for tokenizing long texts.

Model Capabilities

Document Q&A

Text Understanding

Answer Extraction

Use Cases

Customer Service

Automated Customer Support System

Automatically answers customer queries from FAQ documents.

Improves customer service efficiency and reduces manual intervention.

Document Management

Intelligent Document Retrieval

Quickly locates relevant information from large document collections.

Enhances information retrieval efficiency.

🚀 Document Question Answering Model - Kaleidoscope_large_v1

This model is a fine - tuned version of sberbank - ai/ruBert - large, specifically designed for document question answering. It can extract answers from a given document context and has been fine - tuned on a custom JSON dataset with context, question, and answer triples.

🚀 Quick Start

Prerequisites

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model.to(device)

Usage

file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf - 8") as f:
    context = f.read()

while True:
    question = input("Enter question (or 'exit' to quit): ")
    if question.lower() == "exit":
        break
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits)
    answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    print("Answer:", answer)

✨ Features

Objective: Extract answers from documents based on user questions.
Base Model: sberbank - ai/ruBert - large.
Dataset: A custom JSON file with fields: context, question, and answer.
Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.

📦 Installation

This model is based on the transformers library. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model.to(device)

# Assume we have a context and a question
context = "I had a red car."
question = "What kind of car did I have?"

inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print("Answer:", answer)

Advanced Usage

The advanced usage involves using a custom dataset for fine - tuning. The following steps show how to fine - tune the model:

Prepare a custom JSON dataset with context, question, and answer triples.
Use the transformers library to create a custom training pipeline.
Set the training parameters such as the number of epochs, batch size, etc.

# Here is a simple example of a custom training script
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load the custom dataset
dataset = load_dataset('json', data_files='your_custom_dataset.json')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=20,
    per_device_train_batch_size=4,
    warmup_steps=0.1 * total_steps,
    fp16=torch.cuda.is_available(),
    logging_dir='./logs'
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train']
)

# Start training
trainer.train()

📚 Documentation

Training Settings

Number of epochs: 20.
Batch size: 4 per device.
Warmup steps: 0.1 of total steps.
FP16 training enabled (if CUDA is available).
Hardware: Training was performed on an 1xRTX 3070.

Training Process

The model was fine - tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:

Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.
Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.
Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.

Evaluation Metrics

accuracy
f1
recall
exact_match
precision

🔧 Technical Details

The model is based on the transformers library and uses the sberbank - ai/ruBert - large as the base model. The custom training pipeline is designed to fine - tune the model on a specific dataset for document question answering. The input preprocessing method of concatenating the question and the document context helps the model focus on the relevant information.

📄 License

This model is licensed under cc - by - nc - 4.0.

Additional Information

Property	Details
Model Type	Document Question Answering Model
Training Data	A custom JSON file with context, question, and answer triples
Base Model	sberbank - ai/ruBert - large
Library Name	transformers

⚠️ Important Note

This model is primarily focused on Russian texts, but it also supports English language inputs. However, its support for English has not been tested.

Official Kaleidoscope Logo

Finetuned by LaciaStudio | LaciaAI

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご