LVBERT Open-source Latvian Language Model - Free Deployment to Aid Various Natural Language Understanding Tasks

Lvbert

Developed by AiLab-IMCS-UL

Latvian pre-trained language model based on BERT architecture, suitable for various natural language understanding tasks

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Latvian language understanding #Multi-task fine-tuning #Contextual embeddings

Downloads 473

Release Time : 3/10/2022

Model Overview

This model is pre-trained on Latvian data using masked language modeling and next sentence prediction objectives, suitable for downstream tasks such as text classification, named entity recognition, and question answering

Model Features

Latvian language optimization

Specifically pre-trained and optimized for Latvian language characteristics

Case-sensitive

The model can recognize and process case-sensitive language features

Multi-source training data

Trained using diverse corpora including Wikipedia, news articles, and comments

Model Capabilities

Text classification

Named entity recognition

Question answering systems

Text similarity calculation

Semantic search

Text clustering

Use Cases

Text analysis

News classification

Automatic classification of Latvian news articles

Comment sentiment analysis

Analyzing sentiment tendencies in Latvian user comments

Information extraction

Named entity recognition

Identifying person, location, and organization names from Latvian text

🚀 Latvian BERT base model (cased)

A pre - trained BERT model on Latvian language data, which uses masked language modeling and next sentence prediction objectives. It can be fine - tuned for various natural language understanding tasks and also used to compute contextual embeddings.

🚀 Quick Start

A BERT model pretrained on Latvian language data using the masked language modeling and next sentence prediction objectives. It was introduced in this paper and first released via a GitHub repository. The current HF repository contains an improved version of LVBERT.

This model is case - sensitive. It is primarily intended to be fine - tuned on downstream natural language understanding tasks like text classification, named entity recognition, question answering. However, the model can be used as is to compute contextual embeddings for tasks like text similarity and clustering, semantic search.

✨ Features

Pretrained on a diverse set of Latvian language corpora.
Case - sensitive, suitable for fine - tuning on various NLU tasks.
Can be used to compute contextual embeddings for multiple applications.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples were provided in the original document, so this section is skipped.

📚 Documentation

Training data

LVBERT was pretrained on texts from the Balanced Corpus of Modern Latvian, Latvian Wikipedia, Corpus of News Portal Articles, as well as Corpus of News Portal Comments; around 500M tokens in total.

Tokenization

A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens. It was then converted to the WordPiece format used by BERT.

Pretraining

We used the BERT - base configuration with 12 layers, 768 hidden units, 12 heads, 512 sequence length, 128 mini - batch size and 32k token vocabulary.

🔧 Technical Details

The model is based on the BERT architecture, pretrained on Latvian language data with specific objectives. The training data comes from multiple Latvian corpora, and the tokenization process involves training a SentencePiece model and converting it to WordPiece format. The pretraining uses a well - defined BERT - base configuration.

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

Please cite this paper if you use LVBERT:

@inproceedings{Znotins-Barzdins:2020:BalticHLT,
  author = {Arturs Znotins and Guntis Barzdins},
  title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}},
  booktitle = {Human Language Technologies - The Baltic Perspective},
  series = {Frontiers in Artificial Intelligence and Applications},
  volume = {328},
  publisher = {IOS Press},
  year = {2020},
  pages = {111-115},
  doi = {10.3233/FAIA200610},
  url = {http://ebooks.iospress.nl/volumearticle/55531}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご