🚀 Stockmark-2-VL-100B-beta
Stockmark-2-VL-100B-beta is a 100-billion-parameter Japanese-specialized visual language model. It supports Chain-of-Thought (CoT) reasoning for document reading comprehension. The model uses synthetic data from Qwen2.5-VL-72B and is provided under the Qwen license.
🚀 Quick Start
Inference using 🤗Transformers
Stockmark-2-VL-100B-beta is based on the LLaVA-OneVision architecture. Ensure you have transformers>=4.45.0
installed.
pip install transformers>=4.45.0 accelerate torchvision pillow
The following code shows how to use Stockmark-2-VL-100B-beta in pure transformers
.
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from huggingface_hub import hf_hub_download
model_id = "stockmark/Stockmark-2-VL-100B-beta"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
low_cpu_mem_usage=True,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
conversation = [
{
"role": "system",
"content": "あなたは誠実で優秀な日本人のアシスタントです。"
},
{
"role": "user",
"content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか?",
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
img_path = hf_hub_download(
repo_id=model_id,
filename="assets/demo.png"
)
raw_image = Image.open(img_path)
inputs = processor(images=raw_image, text=prompt, return_tensors="pt").to("cuda").to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=255, do_sample=False)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()
print(answer)
Inference using vLLM
The following code demonstrates how to use Stockmark-2-VL-100B-beta in vLLM
.
import os
import requests
from PIL import Image
from transformers import (
AutoProcessor,
)
from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def main():
model_id = "stockmark/Stockmark-2-VL-100B-beta"
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True
)
message = [
{
"role": "system",
"content": "あなたは誠実で優秀な日本人のアシスタントです。"
},
{
"role": "user",
"content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか?"
}
]
prompt = processor.apply_chat_template(message, add_generation_prompt=True)
print(prompt)
llm = LLM(
model=model_id,
tensor_parallel_size=2,
limit_mm_per_prompt={"image": 1},
trust_remote_code=True,
dtype="bfloat16",
)
img_path = hf_hub_download(
repo_id=model_id,
filename="assets/demo.png"
)
image = Image.open(img_path)
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
},
}
sampling_params = SamplingParams(
temperature=0,
max_tokens=256
)
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
)
answer = outputs[0].outputs[0].text
print(answer)
if __name__ == "__main__":
main()
Evaluation using llm-jp-eval-mm
If you want to evaluate Stockmark-2-VL-100B-beta using llm-jp-eval-mm, add the following code to llm-jp-eval-mm.
Model class
The following is the model class for Stockmark-2-VL-100B-beta in llm-jp-eval-mm. Place it in the `llm-jp-eval-mm/examples` directory.
```python
# -*- coding: utf-8 -*-
"""
@File : stockmark_vl.py
@Description : The VLM model class for Stockmark-2-VL-100B-beta.
"""
import torch
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
from base_vlm import BaseVLM
from utils import GenerationConfig
DEFAULT_IMAGE_TOKEN = ""
class VLM(BaseVLM):
def init(self, model_id) -> None:
self.model_id = model_id
self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
self.model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
low_cpu_mem_usage=True,
device_map="auto"
)
self.processor = AutoProcessor.from_pretrained(self.model_id)
def generate(
self,
images: list[Image.Image],
text: str,
gen_kwargs: GenerationConfig = GenerationConfig(),
) -> str:
content = DEFAULT_IMAGE_TOKEN * len(images) + "\n" + text
messages = [
{
"role": "system",
"content": "あなたは誠実で優秀な日本人のアシスタントです。"
},
{
"role": "user",
"content": content,
},
]
prompt = self.processor.apply_chat_template(
messages, add_generation_prompt=True
)
if len(images) == 0:
images = None
inputs = self.processor(images=images, text=prompt, return_tensors="pt").to(
"cuda"
).to(torch.bfloat16)
output_ids = self.model.generate(**inputs, **gen_kwargs.__dict__)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
answer = self.processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()
return answer
Make sure the information for Stockmark-2-VL-100B-beta is included in `MODEL_ID_TO_CLASS_PATH` in `llm-jp-eval-mm/examples/model_table.py`.
```python
MODEL_ID_TO_CLASS_PATH = {
"stockmark/Stockmark-2-VL-100B-beta": "stockmark_vl.VLM",
}
Dependency group
Use the following code to create a dependency group in llm-jp-eval-mm for Stockmark-2-VL-100B-beta.
```bash
uv add --group stockmark_vl transformers>=4.49.0 torch>=2.5.1 torchvision>=0.20.1 flash-attn>=2.7.3 accelerate>=0.27.2 sentencepiece>=0.2.0 pillow>=10.4.0 protobuf>=5.29.3
```
✨ Features
- Specialized for Japanese: Tailored for Japanese document reading comprehension tasks.
- Chain-of-Thought Reasoning: Supports CoT reasoning for better understanding and answering complex questions.
- Based on LLaVA-OneVision: Utilizes a well - known architecture for visual language processing.
📚 Documentation
Model architecture
The architecture of Stockmark-2-VL-100B-beta follows the LLaVA-OneVision framework:
Evaluation
Japanese document reading comprehension performance evaluation
We evaluated the model using three benchmarks:
- JDocQA: 1,175 questions. Evaluated with llm-jp-eval-mm using the LLM - as - a - judge score (
gpt-4o-2024-11-20
as the judge model).
- BusinessSlideVQA: 220 questions for evaluating the ability to comprehend complex Japanese business slide images. Scored by llm - as - a - judge (
gpt-4o-2024-11-20
as the judge model).
- JChartQA: Sampled 100 questions from ChartQA - val, translated into Japanese.
Japanese general domain VQA
We selected three common benchmarks:
Evaluated using llm-jp-eval-mm with default generation parameters and gpt-4o-2024-11-20
as the judge model.
🔧 Technical Details
The model uses synthetic data from Qwen2.5-VL-72B and is provided under the Qwen license.
⚠️ Risks and Limitations
As a beta release, this model has not been fully calibrated to meet social norms, ethical standards, and legal regulations. Also, as a visual reasoning model, it may ignore formatting requirements in the prompt and keep the output of the CoT process.
📄 License
qwen
Developed by
Stockmark Inc.