Kaleidoscope_small_v1 Open-source Document Q&A Model - Supports Russian and English, Accurately Extracts Answers from Documents

Kaleidoscope Small V1

Developed by 2KKLabs

A document question-answering model fine-tuned based on sberbank-ai/ruBert-base, excelling at extracting answers from document contexts, supporting Russian and English.

Question Answering System

Transformers

Supports Multiple Languages#Russian document Q&A #Multimodal processing #BERT fine-tuning

Downloads 98

Release Time : 2/21/2025

Model Overview

This model is specifically designed for document question-answering tasks, fine-tuned with custom JSON datasets, suitable for scenarios like customer support and document search.

Model Features

Multilingual support

Primarily optimized for Russian, with additional English Q&A support (not fully tested)

Context understanding

Processes long documents via sliding window tokenization to effectively capture contextual relationships

Efficient training

Utilizes mixed-precision training and AdamW optimizer, completing 20 epochs of fine-tuning on RTX 3070

Model Capabilities

Document content comprehension

Question-answer extraction

Multilingual text processing

Long-context analysis

Use Cases

Customer support

Automated Q&A system

Automatically answers customer questions from product documentation

Examples demonstrate accurate extraction of facts like 'Albert Einstein proposed the theory of relativity'

Document retrieval

Contract clause lookup

Quickly locates specific clauses in legal/contract documents

🚀 Document Question Answering Model - Kaleidoscope_small_v1

This model is a fine - tuned version of sberbank - ai/ruBert - base, designed to extract answers from documents based on user questions, suitable for tasks like customer support and automated Q&A systems.

🚀 Quick Start

The model is a fine - tuned version of sberbank - ai/ruBert - base for document question answering. It extracts answers from a given document context and was fine - tuned on a custom JSON dataset with context, question, and answer triples.

✨ Features

Objective: Extract answers from documents according to user questions.
Base Model: sberbank - ai/ruBert - base.
Dataset: A custom JSON file with fields: context, question, and answer.
Preprocessing: Concatenate the question and the document context as input to guide the model to focus on relevant parts.
Training Settings:
- Number of epochs: 20.
- Batch size: 4 per device.
- Warmup steps: 0.1 of total steps.
- FP16 training: Enabled if CUDA is available.
- Hardware: Trained on an 1xRTX 3070.

📚 Documentation

The model was fine - tuned using the Transformers library with a custom training pipeline. Key aspects of the training process are as follows:

Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
Feature Preparation: The script tokenizes the document and question with a sliding window approach for long texts.
Training Process: Utilize mixed precision training and the AdamW optimizer for better optimization.
Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and uses early stopping based on validation loss.

This model is suitable for interactive document question answering tasks, such as customer support, document search, and automated Q&A systems. It mainly focuses on Russian texts but also supports English language inputs, though the English support has not been tested.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model.to(device)

file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf-8") as f:
    context = f.read()

while True:
    question = input("Enter question (or 'exit' to quit): ")
    if question.lower() == "exit":
        break
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits)
    answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    print("Answer:", answer)

Example of Answering

Context:

Альберт Эйнштейн разработал теорию относительности.

Question:

Кто разработал теорию относительности?

Answer:

альберт эинштеин

Context:

I had a red car.

Question:

What kind of car did I have?

Answer:

a red car

📄 License

This model is licensed under cc - by - nc - 4.0.

Finetuned by LaciaStudio | LaciaAI

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご