Kaleidoscope_large_v1 Open-source Documentation Q&A Model - Free Deployment, Supports Answer Extraction in Russian and English

Kaleidoscope Large V1

Developed by LaciaStudio

A document QA model fine-tuned from sberbank-ai/ruBert-large, excelling at extracting answers from documents, supporting Russian and English.

Question Answering System

Transformers

Supports Multiple Languages#Russian document QA #Contextual answer extraction #Multimodal processing

Downloads 297

Release Time : 2/24/2025

Model Overview

A model specifically designed for document QA tasks, suitable for extracting answers from provided document contexts, particularly ideal for applications like customer support, document search, and automated QA systems.

Model Features

Multilingual support

Primarily optimized for Russian text, while also supporting English input (not fully tested).

Context understanding

By concatenating questions and document contexts as input, guiding the model to focus on relevant passages.

Efficient training

Utilizes mixed-precision training and AdamW optimizer, completed training on 1xRTX 3070.

Long text processing

Uses sliding window method for document and question tokenization, effectively handling long texts.

Model Capabilities

Document QA

Text understanding

Answer extraction

Multilingual processing

Use Cases

Customer support

Automated customer service

Automatically answering customer questions from product documentation

Improves customer service efficiency, reduces manual intervention

Document retrieval

Enterprise knowledge base query

Quickly finding relevant information from internal company documents

Enhances employee information retrieval efficiency

🚀 Document Question Answering Model - Kaleidoscope_large_v1

This model is a fine - tuned version of sberbank - ai/ruBert - large, specialized for document question answering. It can extract answers from given document contexts.

🚀 Quick Start

The model is a fine - tuned version of sberbank - ai/ruBert - large, tailored for document question answering. It has been adjusted to extract answers from provided document contexts and fine - tuned on a custom JSON dataset with context, question, and answer triples.

✨ Features

Objective: Extract answers from documents according to user questions.
Base Model: sberbank - ai/ruBert - large.
Dataset: A custom JSON file with fields: context, question, and answer.
Preprocessing: Concatenate the question and the document context as input to guide the model to focus on relevant segments.
Training Settings:
- Number of epochs: 20.
- Batch size: 4 per device.
- Warmup steps: 0.1 of total steps.
- FP16 training: Enabled if CUDA is available.
- Hardware: Trained on an 1xRTX 3070.

📚 Documentation

The model was fine - tuned using the Transformers library with a custom training pipeline. Key aspects of the training process are as follows:

Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.
Training Process: Utilize mixed precision training and the AdamW optimizer to improve optimization.
Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and uses early stopping based on validation loss.
This model is suitable for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.

⚠️ Important Note

While the model primarily focuses on Russian texts, it also supports English language inputs, but its English support has not been tested.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model.to(device)

file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf-8") as f:
    context = f.read()

while True:
    question = input("Enter question (or 'exit' to quit): ")
    if question.lower() == "exit":
        break
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits)
    answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    print("Answer:", answer)

Example of answering

Context:

Альберт Эйнштейн разработал теорию относительности.

Question:

Кто разработал теорию относительности?

Answer:

альберт эинштеин

Context:

I had a red car.

Question:

What kind of car did I have?

Answer:

a red car

📄 License

This model is licensed under cc - by - nc - 4.0.

Finetuned by LaciaStudio | LaciaAI

Property	Details
Pipeline Tag	document - question - answering
Tags	DocumentQA, QuestionAnswering, NLP, DeepLearning, Transformers, Multimodal, HuggingFace, ruBert, MachineLearning, DeepQA, AIForDocs, Docs, NeuralNetworks, torch, pytorch, large, text - generation - inference
Library Name	transformers
Metrics	accuracy, f1, recall, exact_match, precision
Base Model	ai - forever/ruBert - large

Official Kaleidoscope Logo

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご