SAIL is a single Transformer model specifically designed for vision and language, serving as a unified Multimodal Large Language Model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture.
Image-to-Text
Transformers