GLM 4.1V 9B MLX 4bit
This is an MLX format model converted from THUDM/GLM-4.1V-9B-Thinking, supporting vision-language tasks.
Text-to-Image Supports Multiple LanguagesOpen Source License:MIT#Multimodal Vision-Language#Efficient Inference#Apple Chip Optimization
Downloads 114
Release Time : 7/17/2025
Model Overview
This model is converted from THUDM/GLM-4.1V-9B-Thinking and uses the MLX format, supporting vision-language understanding and generation tasks.
Model Features
MLX Format Support
The model has been converted to the MLX format, suitable for Apple chip devices
4-bit Quantization
The model has undergone 4-bit quantization processing to reduce memory usage
Vision-Language Capability
Supports image understanding and image-based text generation
Model Capabilities
Vision-Language Understanding
Image Description Generation
Visual Question Answering
Multimodal Inference
Use Cases
Content Generation
Image Description Generation
Generate detailed descriptions based on the input image
Intelligent Question Answering
Visual Question Answering
Answer questions about the image content
🚀 Rainnighttram/GLM-4.1V-9B-MLX-4bit
This project focuses on converting the model from THUDM/GLM-4.1V-9B-Thinking to the MLX format. The converted model Rainnighttram/GLM-4.1V-9B-MLX-4bit offers a new option for users in the field of text generation.
🚀 Quick Start
Prerequisites
- Ensure you have
pip
installed on your system. - The model conversion and usage rely on several Python packages, including
mlx-lm
,mlx-vlm
,mlx
, andtorchvision
.
Installation
pip install mlx-lm mlx-vlm mlx torchvision
Configuration
- Create a directory for the model under the "models" directory in
mlx-vlm
:
mkdir glm4v
cd glm4v
- Create essential model files:
nano __init__.py
# In file: mlx_vlm/models/glm4v/__init__.py
from .glm4v import Model, ModelConfig
from .language import LanguageModel, TextConfig
from .vision import VisionModel, VisionConfig
# save and exit
nano language.py
# In file: language.py
import inspect
from dataclasses import dataclass
from typing import Any, Optional, Dict, List, Tuple
import mlx.core as mx
import mlx.nn as nn
from ..base import (
create_attention_mask,
scaled_dot_product_attention,
)
# Define the complete output class with all optional attributes the generator might check for.
@dataclass
class CausalLMOutput:
logits: mx.array
cross_attention_states: Optional[Tuple] = None
encoder_outputs: Optional[Tuple] = None
hidden_states: Optional[Tuple] = None
attentions: Optional[Tuple] = None
@dataclass
class TextConfig:
model_type: str
hidden_size: int
num_hidden_layers: int
intermediate_size: int
num_attention_heads: int
attention_bias: bool
rms_norm_eps: float
vocab_size: int
num_key_value_heads: int
partial_rotary_factor: float
rope_theta: float
rope_traditional: bool = True
max_position_embeddings: int = 65536
@classmethod
def from_dict(cls, params):
return cls(
**{
k: v
for k, v in params.items()
if k in inspect.signature(cls).parameters
}
)
class Glm4MLP(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.gate_up_proj = nn.QuantizedLinear(
args.hidden_size, 2 * args.intermediate_size, bias=False
)
self.down_proj = nn.QuantizedLinear(
args.intermediate_size, args.hidden_size, bias=False
)
def __call__(self, x) -> mx.array:
x = self.gate_up_proj(x)
gate, up_states = mx.split(x, 2, axis=-1)
return self.down_proj(nn.silu(gate) * up_states)
class Glm4Attention(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.head_dim = args.hidden_size // args.num_attention_heads
self.n_heads = args.num_attention_heads
self.n_kv_heads = args.num_key_value_heads
self.scale = self.head_dim ** -0.5
bias = args.attention_bias
q_out = args.num_attention_heads * self.head_dim
kv_out = args.num_key_value_heads * self.head_dim
self.q_proj = nn.QuantizedLinear(args.hidden_size, q_out, bias=bias)
self.k_proj = nn.QuantizedLinear(args.hidden_size, kv_out, bias=bias)
self.v_proj = nn.QuantizedLinear(args.hidden_size, kv_out, bias=bias)
self.o_proj = nn.QuantizedLinear(q_out, args.hidden_size, bias=False)
self.rope = nn.RoPE(
dims=int(self.head_dim * args.partial_rotary_factor),
base=args.rope_theta,
traditional=args.rope_traditional,
)
def __call__(
self, x: mx.array, mask: Optional[mx.array] = None, cache: Optional[Any] = None
) -> mx.array:
B, L, D = x.shape
queries, keys, values = self.q_proj(x), self.k_proj(x), self.v_proj(x)
queries = queries.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
keys = keys.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
values = values.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
if cache is not None:
queries = self.rope(queries, offset=cache.offset)
keys = self.rope(keys, offset=cache.offset)
keys, values = cache.update_and_fetch(keys, values)
else:
queries = self.rope(queries)
keys = self.rope(keys)
output = scaled_dot_product_attention(
queries, keys, values, cache=cache, scale=self.scale, mask=mask
)
output = output.transpose(0, 2, 1, 3).reshape(B, L, -1)
return self.o_proj(output)
class Glm4DecoderLayer(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.self_attn = Glm4Attention(args=args)
self.mlp = Glm4MLP(args)
self.input_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
self.post_attention_layernorm = nn.RMSNorm(
args.hidden_size, eps=args.rms_norm_eps
)
self.post_self_attn_layernorm = nn.RMSNorm(
args.hidden_size, eps=args.rms_norm_eps
)
self.post_mlp_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
def __call__(
self, x: mx.array, mask: Optional[mx.array] = None, cache: Optional[Any] = None
) -> mx.array:
x = x + self.post_self_attn_layernorm(
self.self_attn(self.input_layernorm(x), mask, cache)
)
residual = x
x = (
self.post_mlp_layernorm(self.mlp(self.post_attention_layernorm(x)))
+ residual
)
return x
class Glm4Model(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.embed_tokens = nn.QuantizedEmbedding(args.vocab_size, args.hidden_size)
self.layers = [
Glm4DecoderLayer(args=args) for _ in range(args.num_hidden_layers)
]
self.norm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
def __call__(
self,
inputs: mx.array,
mask: Optional[mx.array] = None,
cache: Optional[Any] = None,
inputs_embeds: Optional[mx.array] = None,
):
if inputs_embeds is not None:
h = inputs_embeds
else:
h = self.embed_tokens(inputs)
if mask is None:
mask = create_attention_mask(h, cache)
if cache is None:
cache = [None] * len(self.layers)
for layer, c in zip(self.layers, cache):
h = layer(h, mask, cache=c)
return self.norm(h)
class LanguageModel(nn.Module):
def __init__(self, config: TextConfig):
super().__init__()
self.config = config
self.model_type = config.model_type
self.model = Glm4Model(config)
self.lm_head = nn.QuantizedLinear(config.hidden_size, config.vocab_size, bias=False)
def __call__(
self,
inputs: mx.array,
inputs_embeds: Optional[mx.array] = None,
mask: Optional[mx.array] = None,
cache=None,
):
out = self.model(inputs, inputs_embeds=inputs_embeds, mask=mask, cache=cache)
out = self.lm_head(out)
# --- THIS IS THE FIX ---
# Return a consistent object type
return CausalLMOutput(logits=out)
@property
def layers(self):
return self.model.layers
# save and exit
nano vision.py
#In file vision.py
import inspect
from dataclasses import dataclass
from typing import Any, Optional, Dict, List, Tuple
import mlx.core as mx
import mlx.nn as nn
from ..base import (
create_attention_mask,
scaled_dot_product_attention,
)
# Define the complete output class with all optional attributes the generator might check for.
@dataclass
class CausalLMOutput:
logits: mx.array
cross_attention_states: Optional[Tuple] = None
encoder_outputs: Optional[Tuple] = None
hidden_states: Optional[Tuple] = None
attentions: Optional[Tuple] = None
@dataclass
class TextConfig:
model_type: str
hidden_size: int
num_hidden_layers: int
intermediate_size: int
num_attention_heads: int
attention_bias: bool
rms_norm_eps: float
vocab_size: int
num_key_value_heads: int
partial_rotary_factor: float
rope_theta: float
rope_traditional: bool = True
max_position_embeddings: int = 65536
@classmethod
def from_dict(cls, params):
return cls(
**{
k: v
for k, v in params.items()
if k in inspect.signature(cls).parameters
}
)
class Glm4MLP(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.gate_up_proj = nn.QuantizedLinear(
args.hidden_size, 2 * args.intermediate_size, bias=False
)
self.down_proj = nn.QuantizedLinear(
args.intermediate_size, args.hidden_size, bias=False
)
def __call__(self, x) -> mx.array:
x = self.gate_up_proj(x)
gate, up_states = mx.split(x, 2, axis=-1)
return self.down_proj(nn.silu(gate) * up_states)
class Glm4Attention(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.head_dim = args.hidden_size // args.num_attention_heads
self.n_heads = args.num_attention_heads
self.n_kv_heads = args.num_key_value_heads
self.scale = self.head_dim ** -0.5
bias = args.attention_bias
q_out = args.num_attention_heads * self.head_dim
kv_out = args.num_key_value_heads * self.head_dim
self.q_proj = nn.QuantizedLinear(args.hidden_size, q_out, bias=bias)
self.k_proj = nn.QuantizedLinear(args.hidden_size, kv_out, bias=bias)
self.v_proj = nn.QuantizedLinear(args.hidden_size, kv_out, bias=bias)
self.o_proj = nn.QuantizedLinear(q_out, args.hidden_size, bias=False)
self.rope = nn.RoPE(
dims=int(self.head_dim * args.partial_rotary_factor),
base=args.rope_theta,
traditional=args.rope_traditional,
)
def __call__(
self, x: mx.array, mask: Optional[mx.array] = None, cache: Optional[Any] = None
) -> mx.array:
B, L, D = x.shape
queries, keys, values = self.q_proj(x), self.k_proj(x), self.v_proj(x)
queries = queries.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
keys = keys.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
values = values.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
if cache is not None:
queries = self.rope(queries, offset=cache.offset)
keys = self.rope(keys, offset=cache.offset)
keys, values = cache.update_and_fetch(keys, values)
else:
queries = self.rope(queries)
keys = self.rope(keys)
output = scaled_dot_product_attention(
queries, keys, values, cache=cache, scale=self.scale, mask=mask
)
output = output.transpose(0, 2, 1, 3).reshape(B, L, -1)
return self.o_proj(output)
class Glm4DecoderLayer(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.self_attn = Glm4Attention(args=args)
self.mlp = Glm4MLP(args)
self.input_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
self.post_attention_layernorm = nn.RMSNorm(
args.hidden_size, eps=args.rms_norm_eps
)
self.post_self_attn_layernorm = nn.RMSNorm(
args.hidden_size, eps=args.rms_norm_eps
)
self.post_mlp_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
def __call__(
self, x: mx.array, mask: Optional[mx.array] = None, cache: Optional[Any] = None
) -> mx.array:
x = x + self.post_self_attn_layernorm(
self.self_attn(self.input_layernorm(x), mask, cache)
)
residual = x
x = (
self.post_mlp_layernorm(self.mlp(self.post_attention_layernorm(x)))
+ residual
)
return x
class Glm4Model(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.embed_tokens = nn.QuantizedEmbedding(args.vocab_size, args.hidden_size)
self.layers = [
Glm4DecoderLayer(args=args) for _ in range(args.num_hidden_layers)
]
self.norm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
def __call__(
self,
inputs: mx.array,
mask: Optional[mx.array] = None,
cache: Optional[Any] = None,
inputs_embeds: Optional[mx.array] = None,
):
if inputs_embeds is not None:
h = inputs_embeds
else:
h = self.embed_tokens(inputs)
if mask is None:
mask = create_attention_mask(h, cache)
if cache is None:
cache = [None] * len(self.layers)
for layer, c in zip(self.layers, cache):
h = layer(h, mask, cache=c)
return self.norm(h)
class LanguageModel(nn.Module):
def __init__(self, config: TextConfig):
super().__init__()
self.config = config
self.model_type = config.model_type
self.model = Glm4Model(config)
self.lm_head = nn.QuantizedLinear(config.hidden_size, config.vocab_size, bias=False)
def __call__(
self,
inputs: mx.array,
inputs_embeds: Optional[mx.array] = None,
mask: Optional[mx.array] = None,
cache=None,
):
out = self.model(inputs, inputs_embeds=inputs_embeds, mask=mask, cache=cache)
out = self.lm_head(out)
# --- THIS IS THE FIX ---
# Return a consistent object type
return CausalLMOutput(logits=out)
@property
def layers(self):
return self.model.layers
#Save and Exit
nano glmv4.py
#in the file glmv4.py
import inspect
from dataclasses import dataclass
from typing import Any, Optional, Dict, List, Tuple
import mlx.core as mx
import mlx.nn as nn
from ..base import (
create_attention_mask,
scaled_dot_product_attention,
)
# Define the complete output class with all optional attributes the generator might check for.
@dataclass
class CausalLMOutput:
logits: mx.array
cross_attention_states: Optional[Tuple] = None
encoder_outputs: Optional[Tuple] = None
hidden_states: Optional[Tuple] = None
attentions: Optional[Tuple] = None
@dataclass
class TextConfig:
model_type: str
hidden_size: int
num_hidden_layers: int
intermediate_size: int
num_attention_heads: int
attention_bias: bool
rms_norm_eps: float
vocab_size: int
num_key_value_heads: int
partial_rotary_factor: float
rope_theta: float
rope_traditional: bool = True
max_position_embeddings: int = 65536
@classmethod
def from_dict(cls, params):
return cls(
**{
k: v
for k, v in params.items()
if k in inspect.signature(cls).parameters
}
)
class Glm4MLP(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.gate_up_proj = nn.QuantizedLinear(
args.hidden_size, 2 * args.intermediate_size, bias=False
)
self.down_proj = nn.QuantizedLinear(
args.intermediate_size, args.hidden_size, bias=False
)
def __call__(self, x) -> mx.array:
x = self.gate_up_proj(x)
gate, up_states = mx.split(x, 2, axis=-1)
return self.down_proj(nn.silu(gate) * up_states)
class Glm4Attention(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.head_dim = args.hidden_size // args.num_attention_heads
self.n_heads = args.num_attention_heads
self.n_kv_heads = args.num_key_value_heads
self.scale = self.head_dim ** -0.5
bias = args.attention_bias
q_out = args.num_attention_heads * self.head_dim
kv_out = args.num_key_value_heads * self.head_dim
self.q_proj = nn.QuantizedLinear(args.hidden_size, q_out, bias=bias)
self.k_proj = nn.QuantizedLinear(args.hidden_size, kv_out, bias=bias)
self.v_proj = nn.QuantizedLinear(args.hidden_size, kv_out, bias=bias)
self.o_proj = nn.QuantizedLinear(q_out, args.hidden_size, bias=False)
self.rope = nn.RoPE(
dims=int(self.head_dim * args.partial_rotary_factor),
base=args.rope_theta,
traditional=args.rope_traditional,
)
def __call__(
self, x: mx.array, mask: Optional[mx.array] = None, cache: Optional[Any] = None
) -> mx.array:
B, L, D = x.shape
queries, keys, values = self.q_proj(x), self.k_proj(x), self.v_proj(x)
queries = queries.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
keys = keys.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
values = values.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
if cache is not None:
queries = self.rope(queries, offset=cache.offset)
keys = self.rope(keys, offset=cache.offset)
keys, values = cache.update_and_fetch(keys, values)
else:
queries = self.rope(queries)
keys = self.rope(keys)
output = scaled_dot_product_attention(
queries, keys, values, cache=cache, scale=self.scale, mask=mask
)
output = output.transpose(0, 2, 1, 3).reshape(B, L, -1)
return self.o_proj(output)
class Glm4DecoderLayer(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.self_attn = Glm4Attention(args=args)
self.mlp = Glm4MLP(args)
self.input_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
self.post_attention_layernorm = nn.RMSNorm(
args.hidden_size, eps=args.rms_norm_eps
)
self.post_self_attn_layernorm = nn.RMSNorm(
args.hidden_size, eps=args.rms_norm_eps
)
self.post_mlp_layernorm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
def __call__(
self, x: mx.array, mask: Optional[mx.array] = None, cache: Optional[Any] = None
) -> mx.array:
x = x + self.post_self_attn_layernorm(
self.self_attn(self.input_layernorm(x), mask, cache)
)
residual = x
x = (
self.post_mlp_layernorm(self.mlp(self.post_attention_layernorm(x)))
+ residual
)
return x
class Glm4Model(nn.Module):
def __init__(self, args: TextConfig):
super().__init__()
self.embed_tokens = nn.QuantizedEmbedding(args.vocab_size, args.hidden_size)
self.layers = [
Glm4DecoderLayer(args=args) for _ in range(args.num_hidden_layers)
]
self.norm = nn.RMSNorm(args.hidden_size, eps=args.rms_norm_eps)
def __call__(
self,
inputs: mx.array,
mask: Optional[mx.array] = None,
cache: Optional[Any] = None,
inputs_embeds: Optional[mx.array] = None,
):
if inputs_embeds is not None:
h = inputs_embeds
else:
h = self.embed_tokens(inputs)
if mask is None:
mask = create_attention_mask(h, cache)
if cache is None:
cache = [None] * len(self.layers)
for layer, c in zip(self.layers, cache):
h = layer(h, mask, cache=c)
return self.norm(h)
class LanguageModel(nn.Module):
def __init__(self, config: TextConfig):
super().__init__()
self.config = config
self.model_type = config.model_type
self.model = Glm4Model(config)
self.lm_head = nn.QuantizedLinear(config.hidden_size, config.vocab_size, bias=False)
def __call__(
self,
inputs: mx.array,
inputs_embeds: Optional[mx.array] = None,
mask: Optional[mx.array] = None,
cache=None,
):
out = self.model(inputs, inputs_embeds=inputs_embeds, mask=mask, cache=cache)
out = self.lm_head(out)
# --- THIS IS THE FIX ---
# Return a consistent object type
return CausalLMOutput(logits=out)
@property
def layers(self):
return self.model.layers
⚠️ Important Note
- This is not an official repository for the model, so there is no official support.
- Loading the model requires manual adjustment of the MLX-VLM package.
- Currently, the conversion and model loading processes may encounter issues and be somewhat chaotic.
📄 License
This project is licensed under the MIT License.
Clip Vit Large Patch14 336
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
Transformers

C
openai
5.9M
241
Fashion Clip
MIT
FashionCLIP is a vision-language model fine-tuned specifically for the fashion domain based on CLIP, capable of generating universal product representations.
Text-to-Image
Transformers English

F
patrickjohncyh
3.8M
222
Gemma 3 1b It
Gemma 3 is a lightweight advanced open model series launched by Google, built on the same research and technology as the Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.
Text-to-Image
Transformers

G
google
2.1M
347
Blip Vqa Base
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in visual question answering tasks through joint language-image training to achieve multimodal understanding and generation capabilities
Text-to-Image
Transformers

B
Salesforce
1.9M
154
CLIP ViT H 14 Laion2b S32b B79k
MIT
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
Safetensors
C
laion
1.8M
368
CLIP ViT B 32 Laion2b S34b B79k
MIT
A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
Safetensors
C
laion
1.1M
112
Pickscore V1
PickScore v1 is a scoring function for text-to-image generation, used to predict human preferences, evaluate model performance, and rank images.
Text-to-Image
Transformers

P
yuvalkirstain
1.1M
44
Owlv2 Base Patch16 Ensemble
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can localize objects in images through text queries.
Text-to-Image
Transformers

O
google
932.80k
99
Llama 3.2 11B Vision Instruct
Llama 3.2 is a multilingual, multimodal large language model released by Meta, supporting image-to-text and text-to-text conversion tasks with robust cross-modal understanding capabilities.
Text-to-Image
Transformers Supports Multiple Languages

L
meta-llama
784.19k
1,424
Owlvit Base Patch32
Apache-2.0
OWL-ViT is a zero-shot text-conditioned object detection model that can search for objects in images via text queries without requiring category-specific training data.
Text-to-Image
Transformers

O
google
764.95k
129
Featured Recommended AI Models