Gemma explained: PaliGemma architecture – Google Developers Blog

Within the previous post of Gemma defined, you reviewed RecurrentGemma structure. On this weblog put up, you’ll discover PaliGemma structure. Let’s dive into it!

PaliGemma 3B

PaliGemma is a light-weight open vision-language mannequin (VLM) impressed by PaLI-3, and primarily based on open elements just like the SigLIP vision model and the Gemma language model. Pali stands for Pathway Language and Image Model. Because the title implies this mannequin is ready to take each picture and textual content inputs and produce a textual content response, as you may see on this fantastic tuning guide.

PaliGemma Structure

PaliGemma provides an extra imaginative and prescient mannequin to the BaseGemma mannequin, which consists of a picture encoder. This encoder together with the textual content tokens is handed to a specialised Gemma 2B mannequin. Each the Imaginative and prescient Mannequin and Gemma mannequin are skilled in varied phases each independently, and collectively, to provide the ultimate joint structure. For full particulars see Part 3.2 of the Pali-3 paper

PaliGemmaForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=legitimate)
        (position_embedding): Embedding(256, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-26): 27 x SiglipEncoderLayer(
            (self_attn): SiglipAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear(in_features=1152, out_features=4304, bias=True)
              (fc2): Linear(in_features=4304, out_features=1152, bias=True)
            )
            (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
    )
  )
  (multi_modal_projector): PaliGemmaMultiModalProjector(
    (linear): Linear(in_features=1152, out_features=2048, bias=True)
  )
  (language_model): GemmaForCausalLM(
    (mannequin): GemmaModel(
      (embed_tokens): Embedding(257216, 2048, padding_idx=0)
      (layers): ModuleList(
        (0-17): 18 x GemmaDecoderLayer(
          (self_attn): GemmaSdpaAttention(
            (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (k_proj): Linear(in_features=2048, out_features=256, bias=False)
            (v_proj): Linear(in_features=2048, out_features=256, bias=False)
            (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (rotary_emb): GemmaRotaryEmbedding()
          )
          (mlp): GemmaMLP(
            (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
            (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
            (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
            (act_fn): PytorchGELUTanh()
          )
          (input_layernorm): GemmaRMSNorm()
          (post_attention_layernorm): GemmaRMSNorm()
        )
      )
      (norm): GemmaRMSNorm()
    )
    (lm_head): Linear(in_features=2048, out_features=257216, bias=False)
  )
)

vision_tower (SiglipVisionModel)

This part is chargeable for processing the enter picture.

It makes use of SiglipVisionTransformer which is a sort of transformer structure designed for imaginative and prescient duties.

embeddings (SiglipVisionEmbeddings)

PaliGemma takes as enter a number of photos, that are became “tender tokens” by the SigLIP encoder.

It breaks the picture into smaller patches, just like how a textual content mannequin processes phrases in a sentence. The mannequin then learns to seize relationships between these patches, successfully understanding the picture’s visible content material.

patch_embedding

It makes use of a convolutional layer (Conv2d) with the next parameters.

3: The enter has 3 channels (for RGB photos)

1152: The output has 1152 channels, which is the embedding dimension of every patch

kernel_size=(14, 14): Every patch is a 14×14 pixel sq.

stride=(14, 14): The patches are taken with no overlap (the convolutional filter strikes 14 pixels at a time)

padding=’legitimate’: No padding is utilized, so the output measurement will probably be smaller than the enter measurement.

position_embedding

Place embeddings are added to every patch embedding to encode the spatial data (i.e., the place every patch was situated within the unique picture).

That is executed utilizing a realized embedding layer (Embedding) that takes as enter the place of every patch (as much as 256 positions) and outputs a vector of measurement 1152 (the identical because the patch embedding dimension).

encoder (SiglipEncoder)

The embeddings move by way of a sequence of SiglipEncoderLayer, every consisting of self-attention and feed-forward neural networks. This helps the mannequin seize relationships between totally different components of the picture.

multi_modal_projector (PaliGemmaMultiModalProjector)

This part tasks the output of the imaginative and prescient tower right into a multi-modal house. That is achieved utilizing a easy linear layer and it permits the imaginative and prescient and language representations to be mixed successfully.

language_model (GemmaForCausalLM)

This part is a language mannequin primarily based on the Gemma 2B mannequin.

It takes as enter the multi-modal illustration from the projector and generates textual content output.

For the textual content enter, every checkpoint was skilled with varied sequence lengths. For instance, paligemma-3b-mix-224 was skilled with sequence size 256 (enter textual content + output textual content tokenized by Gemma’s tokenizer).

PaliGemma makes use of the Gemma tokenizer with 256000 tokens, however extends its vocabulary with 1024 entries that symbolize coordinates in normalized image-space (<loc0000>…<loc1023>), and one other with 128 entries (<seg000>…<seg127>) which might be codewords utilized by a light-weight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE). (256000 + 1024 + 128 = 257216)

Object Segmentation Instance

Extra tender tokens encode object detection and picture segmentation. Beneath is an instance output from the paligemma-3b-mix-224. You’ll be able to attempt it by your self from the HuggingFace live demo.

Output from the PaliGemma with the immediate “section flooring;cat;individual;”

The outputs from the mannequin are unintuitive to decode if you’re not accustomed to ML and pc imaginative and prescient duties.

The preliminary 4 location tokens symbolize the coordinate of the bounding field, starting from 0 to 1023. These coordinates are impartial of the side ratio, because the picture is assumed to be resized to 1024 x 1024.

As an example, the output shows the cat’s location inside the coordinates (382, 637) and (696, 784). On this coordinate system, the highest left nook is denoted as (0,0) and the vertical coordinate is listed earlier than the horizontal coordinate.

The masks is encoded with the next 16 segmentation tokens. A neural community mannequin (VQ-VAE) can reconstruct masks from quantized representations (codebook indices) by decoding these values. You’ll be able to discover the precise code from here.

Ultimately, you may get hold of this lovely final result from the output of the PaliGemma.

Abstract

On this article, you realized about PaliGemma.

The Gemma household presents a singular alternative to grasp trendy massive language mannequin methods by providing a set of open weights fashions with related core architectures however designed for various use instances. These fashions, launched by Google for researchers, builders, and finish customers, span varied functionalities and complexities.

We hope this overview supplies a concise understanding of the Gemma mannequin household, highlighting its versatility and suitability for a big selection of duties.

The Google Developer Community Discord server is a wonderful platform to showcase your tasks, set up connections with fellow builders, and interact in interactive discussions. Contemplate becoming a member of the server to discover these thrilling alternatives.

Thanks for studying!