đ Document Question Answering Model - Kaleidoscope_large_v1
This model is a fine - tuned version of sberbank - ai/ruBert - large, specifically designed for document question answering. It can extract answers from a given document context and has been fine - tuned on a custom JSON dataset with context, question, and answer triples.
đ Quick Start
Prerequisites
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model.to(device)
Usage
file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf - 8") as f:
context = f.read()
while True:
question = input("Enter question (or 'exit' to quit): ")
if question.lower() == "exit":
break
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print("Answer:", answer)
⨠Features
- Objective: Extract answers from documents based on user questions.
- Base Model: sberbank - ai/ruBert - large.
- Dataset: A custom JSON file with fields: context, question, and answer.
- Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.
đĻ Installation
This model is based on the transformers
library. You can install it using the following command:
pip install transformers
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_large_v1")
model.to(device)
context = "I had a red car."
question = "What kind of car did I have?"
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print("Answer:", answer)
Advanced Usage
The advanced usage involves using a custom dataset for fine - tuning. The following steps show how to fine - tune the model:
- Prepare a custom JSON dataset with context, question, and answer triples.
- Use the
transformers
library to create a custom training pipeline.
- Set the training parameters such as the number of epochs, batch size, etc.
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
dataset = load_dataset('json', data_files='your_custom_dataset.json')
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=20,
per_device_train_batch_size=4,
warmup_steps=0.1 * total_steps,
fp16=torch.cuda.is_available(),
logging_dir='./logs'
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset['train']
)
trainer.train()
đ Documentation
Training Settings
- Number of epochs: 20.
- Batch size: 4 per device.
- Warmup steps: 0.1 of total steps.
- FP16 training enabled (if CUDA is available).
- Hardware: Training was performed on an 1xRTX 3070.
Training Process
The model was fine - tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:
- Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
- Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.
- Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.
- Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.
Evaluation Metrics
- accuracy
- f1
- recall
- exact_match
- precision
đ§ Technical Details
The model is based on the transformers
library and uses the sberbank - ai/ruBert - large
as the base model. The custom training pipeline is designed to fine - tune the model on a specific dataset for document question answering. The input preprocessing method of concatenating the question and the document context helps the model focus on the relevant information.
đ License
This model is licensed under cc - by - nc - 4.0.
Additional Information
Property |
Details |
Model Type |
Document Question Answering Model |
Training Data |
A custom JSON file with context, question, and answer triples |
Base Model |
sberbank - ai/ruBert - large |
Library Name |
transformers |
â ī¸ Important Note
This model is primarily focused on Russian texts, but it also supports English language inputs. However, its support for English has not been tested.

Finetuned by LaciaStudio | LaciaAI