Clip Vitb16 Test Time Registers
A vision-language model based on the OpenCLIP-ViT-B-16 architecture. By introducing test-time registers to optimize the internal representation, it solves the problem of feature map artifacts.
Text-to-Image
Transformers