BnTQA-mBart开源模型 - 免费处理孟加拉语结构化表格数据问答任务

首页

Bntqa Mbart

由 vaishali 开发

BnTQA-mBart 是一个基于 mBART 架构的低资源孟加拉语表格问答模型，专门用于处理孟加拉语的结构化表格数据问题回答任务。

问答系统

PyTorch

其他开源协议:MIT #孟加拉语表格问答 #低资源语言处理 #多语言迁移学习

下载量 17

发布时间 : 9/27/2024

模型简介

该模型专注于低资源印度语系（特别是孟加拉语）的表格问答任务，能够从结构化表格中提取信息或生成表格作为答案。

模型特点

低资源语言支持

专门针对孟加拉语等低资源印度语系设计，填补了这些语言在表格问答领域的空白

全自动数据生成

采用全自动大规模表格问答数据生成流程，解决了低资源语言标注数据稀缺的问题

跨语言迁移能力

具备零样本跨语言迁移能力，可适应相关语言的任务

模型能力

表格问题回答

结构化数据查询

数学推理

跨语言迁移

使用案例

商业智能

财务报表分析

从孟加拉语财务报表中提取特定数据项

教育

学习材料问答

回答基于孟加拉语教材中表格内容的问题

🚀 BnTQA-mBart 孟加拉语表格问答模型

BnTQA - mBart 是一个用于孟加拉语表格问答的模型，它基于 Facebook 的 mBART - large - 50 模型，能够处理孟加拉语表格中的问答任务，为低资源语言的表格问答研究提供了有效的解决方案。

🚀 快速开始

本模型可用于孟加拉语表格问答任务，通过加载预训练模型和数据集，能够对表格中的问题进行回答。

✨ 主要特性

低资源语言支持：针对孟加拉语等低资源语言的表格问答任务进行优化。
大规模数据生成：采用全自动的大规模表格问答数据生成流程。
性能优越：在大规模数据集上训练的模型性能优于现有最先进的大语言模型。

📦 安装指南

文档未提供具体安装步骤，可参考相关依赖库的官方安装说明，如pandas、datasets、transformers等。

💻 使用示例

基础用法

import pandas as pd
from datasets import load_dataset
from transformers import MBartForConditionalGeneration
model = MBartForConditionalGeneration.from_pretrained("vaishali/BnTQA-mBart")
tokenizer = AutoTokenizer.from_pretrained(args.pretrained_model_name, src_lang="bn_IN", tgt_lang="bn_IN")
forced_bos_id = forced_bos_token_id = tokenizer.lang_code_to_id["bn_IN"]


# linearize table
def process_header(headers: List):
  return "<কলাম> " + " | ".join(headers)

def process_row(row: List, row_index: int):
  en2bnDigits = {'0': '০',  '1': '১', '2': '২', '3': '৩', '4': '৪', '5': '৫', '6': '৬', '7': '৭', '8': '৮', '9': '৯', '.': '.'}
  row_str = ""
  row_cell_values = []
  for cell_value in row:
      if isinstance(cell_value, int) or isinstance(cell_value, float):
          cell_value = convert_engDigit_to_bengali(str(cell_value))
          row_cell_values.append(str(cell_value))
      else:
          row_cell_values.append(cell_value)
  row_str += " | ".join(row_cell_values)
  bn_row_index = []
  for c in str(row_index):
      bn_row_index.append(en2bnDigits[c])
  return "<রো " + "".join(bn_row_index) + "> " + row_str

def process_table(table_content: Dict):
  table_str = process_header(table_content["header"]) + " "
  for i, row_example in enumerate(table_content["rows"]):
      table_str += process_row(row_example, row_index=i + 1) + " "
  return table_str.strip()

# load the dataset
banglatableQA = load_dataset("vaishali/banglaTabQA")

for sample in banglatableQA['train']:
  question = sample['question']
  input_table = pd.read_json(sample['table'], orient='split')
  answer = pd.read_json(sample['answer'], orient='split')

  # create the input sequence: query + linearized input table
  table_content = {"header": list(input_table.columns)[1:], "rows": [list(row.values)[1:] for i, row in input_table.iterrows()]}
  linearized_inp_table = process_table(table_content)
  linearized_output_table = process_table({"name": None, "header": [translate_column(col) for col in list(answer.columns)], 
                             "rows": [list(row.values) for i, row in answer.iterrows()]})
  source = query + " " + linearized_inp_table
  target = linearized_output_table
  input = tokenizer(source,
                    return_tensors="pt",
                    padding="max_length",
                    truncation="longest_first",
                    max_length=1024,
                    add_special_tokens=True)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(target,
               return_tensors="pt",
               padding="max_length",
               truncation="longest_first",
               max_length=1024,
               add_special_tokens=True).input_ids

  # inference
  out = model.generate(input["input_ids"].to("cuda"), num_beams=5, return_dict_in_generate=True,
                                output_scores=True, max_length=1024)

📚 详细文档

模型信息

属性	详情
模型类型	基于 mBART - large - 50 的条件生成模型
训练数据	vaishali/banglaTabQA 数据集

引用信息

@inproceedings{pal-etal-2024-table,
    title = "Table Question Answering for Low-resourced {I}ndic Languages",
    author = "Pal, Vaishali  and
      Kanoulas, Evangelos  and
      Yates, Andrew  and
      de Rijke, Maarten",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.5",
    pages = "75--92",
    abstract = "TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. TableQA research has focused primarily on high-resource languages, leaving medium- and low-resource languages with little progress due to scarcity of annotated data and neural models. We address this gap by introducing a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models. TableQA models trained on our large-scale datasets outperform state-of-the-art LLMs. We further study the trained models on different aspects, including mathematical reasoning capabilities and zero-shot cross-lingual transfer. Our work is the first on low-resource tableQA focusing on scalable data generation and evaluation procedures. Our proposed data generation method can be applied to any low-resource language with a web presence. We release datasets, models, and code (https://github.com/kolk/Low-Resource-TableQA-Indic-languages).",
}