JPharmatron-7B-baseオープンソース大型言語モデル - 製薬アプリケーションと研究を支援する日英バイリンガルツール

Home

Jpharmatron 7B Base

Developed by EQUES

JPharmatron-7B-baseは、製薬アプリケーションと研究に特化して設計された70億パラメータの日本語と英語の大規模言語モデルです。

大規模言語モデル

Transformers

Supports Multiple Languages#製薬分野専用 #日英バイリンガル対応 #持続的事前学習

Downloads 104

Release Time : 4/1/2025

Model Overview

このモデルはQwen2.5 - 7Bアーキテクチャに基づき、日本語データセットからの20億個のトークンを使用して持続的に事前学習され、製薬分野の自然言語処理タスクに特化しています。

Model Features

分野特化性

製薬アプリケーションと研究に特化して設計され、分野固有の最適化が施されています。

多言語対応

日本語と英語をサポートし、多言語の製薬研究に適しています。

持続的事前学習

Qwen2.5 - 7Bに基づき、20億個の日本語製薬分野のトークンを使用して持続的に事前学習されています。

Model Capabilities

製薬分野のテキスト理解

多言語用語標準化

製薬知識問答

製薬文書分析

Use Cases

製薬研究

薬剤師資格試験問答

日本の薬剤師国家資格試験内容に基づく問答システム

YakugakuQAベンチマークテストで優れた成績を収めました

多言語用語標準化

日本語と英語間の医薬品同義語と用語の標準化を処理

NayoseQAベンチマークテストで競争力を発揮しました

声明一貫性検証

ペアの声明間の一貫性推論を評価

SogoCheckタスクで一部の商用モデルを上回る成績を収めました

🚀 JPharmatron-7B-base

JPharmatron-7B-baseは、医薬品のアプリケーションや研究に特化した70億パラメータの大規模言語モデルです。

✨ 主な機能

JPharmatron-7B-baseは、Qwen2.5-7Bをベースに、日本語のデータセットから20億トークンを使って継続的に事前学習されています。

📦 インストール

ドキュメントにインストール手順が記載されていないため、このセクションをスキップします。

📚 ドキュメント

モデルの詳細

開発元: EQUES Inc.
資金提供元 [任意]: GENIAC Project
モデルの種類: 因果的デコーダーのみ
言語 (NLP): 日本語、英語
ライセンス: CC-BY-SA-4.0

モデルのソース [任意]

リポジトリ: https://github.com/EQUES-Inc/pharma-LLM-eval
論文 [任意]: A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

利用方法

このモデルは、命令微調整を含む事後学習を一切行っていません。したがって、このモデルを下流タスクに直接使用することはお勧めしません。また、医療用途やその他のリスクが高い用途については検証されていません。

引用 [任意]

BibTeX:

@misc{sukeda_japanese_2025,
  title     = {A {Japanese} {Language} {Model} and {Three} {New} {Evaluation} {Benchmarks} for {Pharmaceutical} {NLP}},
  url       = {http://arxiv.org/abs/2505.16661},
  doi       = {10.48550/arXiv.2505.16661},
  abstract  = {We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.},
  urldate   = {2025-05-30},
  publisher = {arXiv},
  author    = {Sukeda, Issey and Fujii, Takuro and Buma, Kosei and Sasaki, Shunsuke and Ono, Shinnosuke},
  month     = may,
  year      = {2025},
  note      = {arXiv:2505.16661 [cs]},
  annote    = {Comment: 15 pages, 9 tables, 5 figures}
}