Charllama-35M Open-Source Micro Language Model - Character-by-Character Tokenization for Experimental Scenarios

Charllama 35M

Developed by inkoziev

CharLLaMa-35M is a miniature language model based on the LLaMa architecture, featuring character-level tokenization, suitable for various experimental scenarios where BPE tokenization underperforms.

Large Language Model

Transformers

OtherOpen Source License:Openrail #Character-level tokenization #Russian poetry generation #Spell checking

Downloads 61

Release Time : 8/31/2023

Model Overview

This model is specifically developed for Russian poetry experiments, pre-trained on a corpus rich in poetic texts, with 35,913,600 parameters. It is suitable for tasks such as generative spell checking, text classification, text transcription, and spelling error detection.

Model Features

Character-level tokenization

Utilizes character-level tokenization, ideal for scenarios where BPE tokenization performs poorly, such as spell checking and text transcription.

Poetic text pre-training

Pre-trained on a large corpus of Russian poetic texts, making it well-suited for poetry-related tasks.

Lightweight model

With only 35,913,600 parameters, it is suitable for resource-constrained experimental scenarios.

Model Capabilities

Text generation

Text classification

Spell checking

Text transcription

Spelling error detection

Use Cases

Text processing

Generative spell checker

Leverages character-level tokenization to detect and correct spelling errors.

Text classification

Replaces TfidfVectorizer(analyzer='char') in scenarios where character-level n-gram baselines perform well.

Text transcription

Suitable for text transcription tasks requiring character-level processing.

Poetry generation

Russian poetry generation

Generates Russian poetry using pre-trained poetic texts.

🚀 CharLLaMa-35M

A tiny language model with the LLaMa architecture and character-level tokenization for various experiments when BPE tokenization on words and their parts fails to solve the task effectively. It can be applied to:

Generative spell checkers
Text classification: Replacing TfidfVectorizer(analyzer='char'), i.e., when the baseline on character n-grams works well
Text transcription
Detection of orthographic errors and typos

The model has 35,913,600 parameters.

🚀 Quick Start

✨ Features

I developed this model for experiments with Russian poetry as part of the "Literary Studio" project. Therefore, the pre-training corpus contains a significant amount of poetic texts, which may impact your downstream tasks. The pre-training corpus is approximately 80B tokens and consists of Russian texts only.

The training curve: pretrain_loss_val

📦 Installation

To use the model, you need to install a special tokenizer:

pip install git+https://github.com/Koziev/character-tokenizer

In addition to Cyrillic characters and punctuation, this tokenizer recognizes special tokens <s>, </s>, <pad>, and <unk>. Since this is a non-standard tokenizer for the transformers library, you should not load it using transformers.AutoTokenizer.from_pretrained. Instead, use the following code:

import charactertokenizer

...
tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained('inkoziev/charllama-35M')

💻 Usage Examples

Basic Usage

To view the tokenization, you can use the following code snippet:

prompt = '<s>У Лукоморья дуб зеленый\n'
encoded_prompt = tokenizer.encode(prompt, return_tensors='pt')
print('Tokenized prompt:', ' | '.join(tokenizer.decode([t]) for t in encoded_prompt[0]))

You will see a list of tokens separated by the | symbol:

Tokenized prompt: <s> | У |   | Л | у | к | о | м | о | р | ь | я |   | д | у | б |   | з | е | л | е | н | ы | й |

Advanced Usage

You can use the model with the transformers library in the same way as a regular GPT model (more precisely, transformers.LlamaModel):

import os
import torch
import transformers
import charactertokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name_or_path = 'inkoziev/charllama-35M'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name_or_path)
model.to(device)
model.eval()

tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained(model_path)

prompt = 'Меня зовут Ар'
encoded_prompt = tokenizer.encode(prompt, return_tensors='pt')

output_sequences = model.generate(
    input_ids=encoded_prompt.to(device),
    max_length=500,
    temperature=1.0,
    top_k=0,
    top_p=0.8,
    repetition_penalty=1.0,
    do_sample=True,
    num_return_sequences=5,
    pad_token_id=0,
)

for o in output_sequences:
    text = tokenizer.decode(o)
    if text.startswith('<s>'):
        text = text.replace('<s>', '')
    text = text[:text.index('</s>')].strip()
    print(text)
    print('-'*80)

Additionally, all other tools for GPT models, such as transformers.AutoModelForSequenceClassification, will also work.

📄 License

The license of this project is OpenRAIL.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご