đ Aria-sequential_mlp-bnb_nf4
This project provides a BitsAndBytes NF4 quantization model based on Aria-sequential_mlp, which can be used for image - text - to - text tasks.
đ Quick Start
The Aria-sequential_mlp-bnb_nf4
model is a BitsAndBytes NF4 quantization version of Aria-sequential_mlp. It requires about 15.5 GB of VRAM and can run on an RTX 3090. It can also run on an RTX 4060 Ti 16 GB, but it's not really practical without device_map=auto
. Currently, the model is not 5 GB sharded because it seems to cause problems when loading serialized BNB models. This might make it impossible to load the model in free - tier Colab.
đĻ Installation
pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow bitsandbytes
pip install flash-attn --no-build-isolation
đģ Usage Examples
Basic Usage
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig
torch.cuda.set_device(0)
model_id_or_path = "leon-se/Aria-sequential_mlp-bnb_nf4"
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
image = Image.open(requests.get(image_path, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"text": None, "type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
output = model.generate(
**inputs,
max_new_tokens=500,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
result = processor.decode(output_ids, skip_special_tokens=True)
print(result)
print(f'Max allocated memory: {torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3:.3f}GiB')
Advanced Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "rhymes-ai/Aria-sequential_mlp"
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_enable_fp32_cpu_offload=True,
llm_int8_skip_modules=["language_model.lm_head", "multi_modal_projector", "vision_tower"],
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
đ License
This project is licensed under the Apache - 2.0 license.
đ Documentation
Model Information
Property |
Details |
Library Name |
transformers |
Base Model |
rhymes - ai/Aria - sequential_mlp, rhymes - ai/Aria |
Pipeline Tag |
image - text - to - text |
License |
apache - 2.0 |