Stockmark-2-VL-100B-beta: An Open-Source Japanese Vision-Language Model - Supports Document Reading Comprehension and Reasoning

Stockmark 2 VL 100B Beta

Developed by stockmark

Stockmark-2-VL-100B-beta is a Japanese-specific vision-language model with 100 billion parameters, equipped with chain-of-thought (CoT) reasoning ability and can be used for document reading and comprehension.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Other #Japanese document understanding #Multimodal reasoning #Chain-of-thought reasoning

Downloads 184

Release Time : 5/27/2025

Model Overview

This model is optimized for Japanese scenarios, combining image and text information to achieve richer interactions, suitable for tasks such as Japanese document reading and comprehension.

Model Features

Japanese optimization

Designed specifically for Japanese scenarios and optimized for tasks such as Japanese document reading and comprehension

Chain-of-thought reasoning

Equipped with CoT reasoning ability to improve the logic of document understanding and answering

Multimodal processing

Combining image and text information to achieve richer interactions

High-performance visual encoder

Using google/siglip2-so400m-patch14-384 as the visual encoder, with better multilingual performance

Model Capabilities

Document reading and comprehension

Visual question answering

Combined analysis of image and text

Multimodal reasoning

Use Cases

Business analysis

Business slide analysis

Understand the content of complex Japanese business slide images and answer questions

Scored 4.2 in the BusinessSlideVQA benchmark test, outperforming GPT-4o

Data visualization

Chart understanding

Analyze Japanese chart images and answer related questions

Achieved an accuracy of 0.88 in the JChartQA benchmark test

Document processing

Japanese document understanding

Read and understand the content of Japanese documents and answer questions

Scored 3.5 in the JDocQA benchmark test

🚀 Stockmark-2-VL-100B-beta

Stockmark-2-VL-100B-beta is a 100-billion-parameter Japanese-specialized visual language model. It supports Chain-of-Thought (CoT) reasoning for document reading comprehension. The model uses synthetic data from Qwen2.5-VL-72B and is provided under the Qwen license.

🚀 Quick Start

Inference using 🤗Transformers

Stockmark-2-VL-100B-beta is based on the LLaVA-OneVision architecture. Ensure you have transformers>=4.45.0 installed.

pip install transformers>=4.45.0 accelerate torchvision pillow

The following code shows how to use Stockmark-2-VL-100B-beta in pure transformers.

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from huggingface_hub import hf_hub_download

model_id = "stockmark/Stockmark-2-VL-100B-beta"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True,
    low_cpu_mem_usage=True, 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

conversation = [
    {
        "role": "system", 
        "content": "あなたは誠実で優秀な日本人のアシスタントです。"
    },
    {
        "role": "user",
        "content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか？",
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

img_path = hf_hub_download(
    repo_id=model_id,
    filename="assets/demo.png"
)
raw_image = Image.open(img_path)
inputs = processor(images=raw_image, text=prompt, return_tensors="pt").to("cuda").to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=255, do_sample=False)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]

answer = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()

print(answer)

Inference using vLLM

The following code demonstrates how to use Stockmark-2-VL-100B-beta in vLLM.

import os
import requests
from PIL import Image
from transformers import (
    AutoProcessor,
)
from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    model_id = "stockmark/Stockmark-2-VL-100B-beta"
    processor = AutoProcessor.from_pretrained(
        model_id,
        trust_remote_code=True        
    )

    message = [
        {
            "role": "system", 
            "content": "あなたは誠実で優秀な日本人のアシスタントです。"
        },
        {
            "role": "user", 
            "content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか？"
        }
    ]
    prompt = processor.apply_chat_template(message, add_generation_prompt=True)    
    print(prompt)

    llm = LLM(
        model=model_id,
        tensor_parallel_size=2,
        limit_mm_per_prompt={"image": 1},
        trust_remote_code=True,
        dtype="bfloat16",
    )

    img_path = hf_hub_download(
        repo_id=model_id,
        filename="assets/demo.png"
    )
    image = Image.open(img_path)
    inputs = {
        "prompt": prompt,
        "multi_modal_data": {
            "image": image
        },
    }

    sampling_params = SamplingParams(
        temperature=0,
        max_tokens=256
    )

    outputs = llm.generate(
        inputs,
        sampling_params=sampling_params,
    )

    answer = outputs[0].outputs[0].text
    print(answer)

if __name__ == "__main__":
    main()

Evaluation using `llm-jp-eval-mm`

If you want to evaluate Stockmark-2-VL-100B-beta using llm-jp-eval-mm, add the following code to llm-jp-eval-mm.

Model class

The following is the model class for Stockmark-2-VL-100B-beta in llm-jp-eval-mm. Place it in the `llm-jp-eval-mm/examples` directory. ```python # -*- coding: utf-8 -*- """ @File : stockmark_vl.py @Description : The VLM model class for Stockmark-2-VL-100B-beta. """

import torch from PIL import Image from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor from base_vlm import BaseVLM from utils import GenerationConfig

DEFAULT_IMAGE_TOKEN = ""

class VLM(BaseVLM): def init(self, model_id) -> None: self.model_id = model_id self.model = LlavaOnevisionForConditionalGeneration.from_pretrained( self.model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, low_cpu_mem_usage=True, device_map="auto" ) self.processor = AutoProcessor.from_pretrained(self.model_id)

def generate(
    self,
    images: list[Image.Image],
    text: str,
    gen_kwargs: GenerationConfig = GenerationConfig(),
) -> str:
    content = DEFAULT_IMAGE_TOKEN * len(images)  + "\n" + text
    messages = [
        {
            "role": "system", 
            "content": "あなたは誠実で優秀な日本人のアシスタントです。"
        },
        {
            "role": "user",
            "content": content,
        },
    ]
    prompt = self.processor.apply_chat_template(
        messages, add_generation_prompt=True
    )

    if len(images) == 0:
        images = None

    inputs = self.processor(images=images, text=prompt, return_tensors="pt").to(
        "cuda"
    ).to(torch.bfloat16)

    output_ids = self.model.generate(**inputs, **gen_kwargs.__dict__)
    generated_ids = [
        output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
    ]

    answer = self.processor.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0].strip()

    return answer

Make sure the information for Stockmark-2-VL-100B-beta is included in `MODEL_ID_TO_CLASS_PATH` in `llm-jp-eval-mm/examples/model_table.py`.
```python
MODEL_ID_TO_CLASS_PATH = {
    "stockmark/Stockmark-2-VL-100B-beta": "stockmark_vl.VLM",
}

Dependency group

Use the following code to create a dependency group in llm-jp-eval-mm for Stockmark-2-VL-100B-beta. ```bash uv add --group stockmark_vl transformers>=4.49.0 torch>=2.5.1 torchvision>=0.20.1 flash-attn>=2.7.3 accelerate>=0.27.2 sentencepiece>=0.2.0 pillow>=10.4.0 protobuf>=5.29.3 ```

✨ Features

Specialized for Japanese: Tailored for Japanese document reading comprehension tasks.
Chain-of-Thought Reasoning: Supports CoT reasoning for better understanding and answering complex questions.
Based on LLaVA-OneVision: Utilizes a well - known architecture for visual language processing.

📚 Documentation

Model architecture

The architecture of Stockmark-2-VL-100B-beta follows the LLaVA-OneVision framework:

LLM: Initially, Qwen/Qwen2-7B-Instruct was used in prior experiments. Eventually, stockmark/Stockmark-2-100B-Instruct-beta was adopted for official training.
Vision encoder: Instead of google/siglip-so400m-patch14-384, google/siglip2-so400m-patch14-384 with better multilingual performance is used.
Projector: Random initial weights are used for the 2 - layer MLP of the projector.

Evaluation

Japanese document reading comprehension performance evaluation

We evaluated the model using three benchmarks:

JDocQA: 1,175 questions. Evaluated with llm-jp-eval-mm using the LLM - as - a - judge score (gpt-4o-2024-11-20 as the judge model).
BusinessSlideVQA: 220 questions for evaluating the ability to comprehend complex Japanese business slide images. Scored by llm - as - a - judge (gpt-4o-2024-11-20 as the judge model).
JChartQA: Sampled 100 questions from ChartQA - val, translated into Japanese.

	BusinessSlideVQA /LLM	JChartQA /Acc	JDocQA /LLM
Heron-NVILA-Lite-15B	2.8	0.41	2.7
sarashina2-vision-14b	3.3	0.52	3.1
llm-jp-3-vila-14b	2.0	0.23	2.5
gpt-4o-2024-11-20	4.1	0.77	3.6
Stockmark-2-VL-100B-beta	4.2	0.88	3.5

Japanese general domain VQA

We selected three common benchmarks:

Evaluated using llm-jp-eval-mm with default generation parameters and gpt-4o-2024-11-20 as the judge model.

	Heron-Bench /LLM	JA-VLM-Bench-In-the-Wild /LLM	JA-VG-VQA500 /LLM
Heron-NVILA-Lite-15B	73.5	4.4	4.0
sarashina2-vision-14b	60.1	4.0	3.7
llm-jp-3-vila-14b	68.0	4.1	3.9
Stockmark-2-VL-100B-beta	78.8	4.1	4.1

🔧 Technical Details

The model uses synthetic data from Qwen2.5-VL-72B and is provided under the Qwen license.

⚠️ Risks and Limitations

As a beta release, this model has not been fully calibrated to meet social norms, ethical standards, and legal regulations. Also, as a visual reasoning model, it may ignore formatting requirements in the prompt and keep the output of the CoT process.

📄 License

qwen

Developed by

Stockmark Inc.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご