LayoutLM-Wikipedia-JA Open-Source Model - Free for Japanese Document Token Classification Tasks

Layoutlm Wikipedia Ja

Developed by jri-advtechlab

This is a LayoutLM model pre-trained on Japanese text, primarily used for token classification tasks in Japanese documents.

Large Language Model

Transformers

Japanese#Japanese Document Understanding #Layout-aware Pretraining #Wikipedia Information Extraction

Downloads 22

Release Time : 1/31/2024

Model Overview

This model is a LayoutLM trained on Japanese Wikipedia, mainly fine-tuned for token classification tasks and can also be used for masked language modeling.

Model Features

Japanese Text Processing

Pre-trained specifically for Japanese text, suitable for Japanese document processing tasks.

Layout-aware

Models both text content and layout information (e.g., bounding boxes), suitable for document understanding tasks.

BERT-based Architecture

Initialized based on cl-tohoku/bert-base-japanese-v2 model, inheriting BERT's powerful language understanding capabilities.

Model Capabilities

Token Classification

Masked Language Modeling

Document Layout Understanding

Use Cases

Document Information Extraction

Wikipedia Information Extraction

Extract structured information from Japanese Wikipedia pages

Achieved a macro F1 score of 55.1451 in the SHINRA 2022 shared task

🚀 LayoutLM-wikipedia-ja Model

This is a LayoutLM model pretrained on Japanese texts, which can be fine - tuned for token classification tasks.

🚀 Quick Start

Use the code below to get started with the model.

>>> from transformers import AutoTokenizer, AutoModel
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("jri-advtechlab/layoutlm-wikipedia-ja")
>>> model = AutoModel.from_pretrained("jri-advtechlab/layoutlm-wikipedia-ja")

>>> tokens = tokenizer.tokenize("こんにちは")  # ['こん', '##にち', '##は']
>>> normalized_token_boxes = [[637, 773, 693, 782], [693, 773, 749, 782], [749, 773, 775, 782]]
>>> # add bounding boxes of cls + sep tokens
>>> bbox = [[0, 0, 0, 0]] + normalized_token_boxes + [[1000, 1000, 1000, 1000]]

>>> input_ids = [tokenizer.cls_token_id] \
                + tokenizer.convert_tokens_to_ids(tokens) \
                + [tokenizer.sep_token_id]
>>> attention_mask = [1] * len(input_ids)
>>> token_type_ids = [0] * len(input_ids)
>>> encoding = {
    "input_ids": torch.tensor([input_ids]),
    "attention_mask": torch.tensor([attention_mask]),
    "token_type_ids": torch.tensor([token_type_ids]),
    "bbox": torch.tensor([bbox]),
    }

>>> outputs = model(**encoding)

✨ Features

The model is primarily aimed at being fine - tuned on a token classification task. You can use the raw model for masked language modeling, although it is not the primary use case. Refer to https://github.com/nishiwakikazutaka/shinra2022-task2_jrird for instructions on how to fine - tune the model. Note that the linked repository is written in Japanese.

📚 Documentation

Model Details

Model Description

Property	Details
Developed by	Advanced Technology Laboratory, The Japan Research Institute, Limited.
Model Type	LayoutLM
Language	Japanese
License	CC BY - SA 3.0
Finetuned from model	[cl - tohoku/bert - base - japanese - v2](https://huggingface.co/cl - tohoku/bert - base - japanese - v2)

Training Details

Training Data

The model is trained on the Japanese version of Wikipedia. The training corpus is distributed as training data of the SHINRA 2022 shared task.

Tokenization and Localization

We used the tokenizer of [cl - tohoku/bert - base - japanese - v2](https://huggingface.co/cl - tohoku/bert - base - japanese - v2) to split texts into tokens (subwords). Each token is wrapped in a <span> tag with the no - wrap value set for the white - space property and localized by obtaining BoundingClientRect. The localization process was conducted with Google Chrome (106.0.5249.119) headless mode on Ubuntu 20.04.5 LTS with a 1,280*854 window size.

The vocabulary is the same as [cl - tohoku/bert - base - japanese - v2](https://huggingface.co/cl - tohoku/bert - base - japanese - v2).

Training Procedure

The model was trained using Masked Visual - Language Model (MVLM), but it was not trained using Multi - label Document Classification (MDC). We made this decision because we did not identify significant visual differences, such as those between a contract and an invoice, between the different Wikipedia articles.

Preprocessing

All parameters except the 2 - D Position Embedding were initialized with weights from [cl - tohoku/bert - base - japanese - v2](https://huggingface.co/cl - tohoku/bert - base - japanese - v2). We initialized the 2 - D Position Embedding with random values.

Training Hyperparameters

The model was trained on 8 NVIDIA A100 SXM4 GPUs for 100,000 steps, with a batch size of 256 with a maximum sequence length of 512. The optimizer used is Adam with a learning rate of 5e - 5, β₁=0.9, β₂=0.999, learning rate warmup for 1,000 steps, and linear decay of the learning rate after. Additionally, we utilized fp16 mixed precision during training. The training took about 5.3 hours to finish.

Evaluation

Our fine - tuned model achieved a macro - f1 score of 55.1451 on the leaderboard for the SHINRA 2022 shared task. You can check the leaderboard at https://2022.shinra-project.info/#leaderboard for detailed information.

Citation

BibTeX:

@inproceedings{nishiwaki2023layoutlm-wiki-ja,
  title = {日本語情報抽出タスクのための{L}ayout{LM}モデルの評価},
  author = {西脇一尊 and 大沼俊輔 and 門脇一真},
  booktitle = {言語処理学会第29回年次大会(NLP2023)予稿集},
  year = {2023},
  pages = {522--527}
}

📄 License

This model is licensed under CC BY - SA 3.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご