BnTQA - mBartオープンソースモデル - ベンガル語の構造化表データの質疑応答タスクを無料で処理

ホーム

Bntqa Mbart

vaishaliによって開発

BnTQA-mBart は mBART アーキテクチャに基づく低リソースベンガル語表質問応答モデルで、ベンガル語の構造化表データに対する質問応答タスク専用に設計されています。

質問応答システム

PyTorch

その他オープンソースライセンス:MIT #ベンガル語表質問応答 #低リソース言語処理 #多言語転移学習

ダウンロード数 17

リリース時間 : 9/27/2024

モデル概要

このモデルは低リソースインド語族（特にベンガル語）の表質問応答タスクに特化しており、構造化表から情報を抽出したり、回答として表を生成したりできます。

モデル特徴

低リソース言語サポート

ベンガル語などの低リソースインド語族向けに特別に設計されており、これらの言語における表質問応答分野の空白を埋めます

完全自動データ生成

完全自動の大規模表質問応答データ生成プロセスを採用し、低リソース言語の注釈付きデータ不足問題を解決します

言語間転移能力

ゼロショット言語間転移能力を備えており、関連言語のタスクに適応可能です

モデル能力

表質問応答

構造化データクエリ

数学的推論

言語間転移

使用事例

ビジネスインテリジェンス

財務諸表分析

ベンガル語の財務諸表から特定のデータ項目を抽出

教育

学習教材質問応答

ベンガル語教材の表内容に基づく質問に回答

🚀 低リソーステーブル質問応答モデル (BnTQA - mBart)

このプロジェクトは、低リソース言語（特にベンガル語）に対するテーブル質問応答システムを提供します。Transformerベースのモデルを使用して、構造化情報を持つテーブルに関する質問に回答します。

🚀 クイックスタート

このモデルを使用するには、以下の手順に従ってください。

💻 使用例

基本的な使用法

import pandas as pd
from datasets import load_dataset
from transformers import MBartForConditionalGeneration
model = MBartForConditionalGeneration.from_pretrained("vaishali/BnTQA-mBart")
tokenizer = AutoTokenizer.from_pretrained(args.pretrained_model_name, src_lang="bn_IN", tgt_lang="bn_IN")
forced_bos_id = forced_bos_token_id = tokenizer.lang_code_to_id["bn_IN"]


# linearize table
def process_header(headers: List):
  return "<কলাম> " + " | ".join(headers)

def process_row(row: List, row_index: int):
  en2bnDigits = {'0': '০',  '1': '১', '2': '২', '3': '৩', '4': '৪', '5': '৫', '6': '৬', '7': '৭', '8': '৮', '9': '৯', '.': '.'}
  row_str = ""
  row_cell_values = []
  for cell_value in row:
      if isinstance(cell_value, int) or isinstance(cell_value, float):
          cell_value = convert_engDigit_to_bengali(str(cell_value))
          row_cell_values.append(str(cell_value))
      else:
          row_cell_values.append(cell_value)
  row_str += " | ".join(row_cell_values)
  bn_row_index = []
  for c in str(row_index):
      bn_row_index.append(en2bnDigits[c])
  return "<রো " + "".join(bn_row_index) + "> " + row_str

def process_table(table_content: Dict):
  table_str = process_header(table_content["header"]) + " "
  for i, row_example in enumerate(table_content["rows"]):
      table_str += process_row(row_example, row_index=i + 1) + " "
  return table_str.strip()

# load the dataset
banglatableQA = load_dataset("vaishali/banglaTabQA")

for sample in banglatableQA['train']:
  question = sample['question']
  input_table = pd.read_json(sample['table'], orient='split')
  answer = pd.read_json(sample['answer'], orient='split')

  # create the input sequence: query + linearized input table
  table_content = {"header": list(input_table.columns)[1:], "rows": [list(row.values)[1:] for i, row in input_table.iterrows()]}
  linearized_inp_table = process_table(table_content)
  linearized_output_table = process_table({"name": None, "header": [translate_column(col) for col in list(answer.columns)], 
                             "rows": [list(row.values) for i, row in answer.iterrows()]})
  source = query + " " + linearized_inp_table
  target = linearized_output_table
  input = tokenizer(source,
                    return_tensors="pt",
                    padding="max_length",
                    truncation="longest_first",
                    max_length=1024,
                    add_special_tokens=True)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(target,
               return_tensors="pt",
               padding="max_length",
               truncation="longest_first",
               max_length=1024,
               add_special_tokens=True).input_ids

  # inference
  out = model.generate(input["input_ids"].to("cuda"), num_beams=5, return_dict_in_generate=True,
                                output_scores=True, max_length=1024)

📚 詳細ドキュメント

モデル情報

属性	详情
モデルタイプ	低リソース言語用のテーブル質問応答モデル
訓練データ	vaishali/banglaTabQA データセット

BibTeX引用

@inproceedings{pal-etal-2024-table,
    title = "Table Question Answering for Low-resourced {I}ndic Languages",
    author = "Pal, Vaishali  and
      Kanoulas, Evangelos  and
      Yates, Andrew  and
      de Rijke, Maarten",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.5",
    pages = "75--92",
    abstract = "TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. TableQA research has focused primarily on high-resource languages, leaving medium- and low-resource languages with little progress due to scarcity of annotated data and neural models. We address this gap by introducing a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models. TableQA models trained on our large-scale datasets outperform state-of-the-art LLMs. We further study the trained models on different aspects, including mathematical reasoning capabilities and zero-shot cross-lingual transfer. Our work is the first on low-resource tableQA focusing on scalable data generation and evaluation procedures. Our proposed data generation method can be applied to any low-resource language with a web presence. We release datasets, models, and code (https://github.com/kolk/Low-Resource-TableQA-Indic-languages).",
}