Gemma explained: What’s new in Gemma 2

Within the previous post of the Gemma defined sequence, we mentioned the Gemma structure. On this submit, you’ll discover the newest mannequin, Gemma 2. Let’s get began!

Gemma 2

Not too long ago, we launched Gemma 2, our groundbreaking new suite of open fashions, setting a brand new commonplace for efficiency and accessibility. Obtainable in 2B, 9B, and 27B parameter sizes, Gemma 2 has rapidly made its mark. Our 27B mannequin quickly ascended the LMSYS Chatbot Enviornment leaderboard, surpassing even widespread fashions greater than twice its dimension in partaking, real-world conversations, establishing itself as one of many highest-ranking and most helpful open fashions. In the meantime, the Gemma 2 2B mannequin showcases its distinctive conversational AI prowess by outperforming all GPT-3.5 fashions on the Chatbot Enviornment at a dimension runnable on edge units.

Builders can entry sturdy tuning capabilities with Gemma 2 throughout platforms and instruments. Positive-tuning Gemma 2 is simplified with cloud-based options like Google Cloud and group instruments like Axolotl. Seamless integration with companions equivalent to Hugging Face and NVIDIA TensorRT-LLM, in addition to our JAX and Keras, allows optimization of efficiency and environment friendly deployment throughout numerous {hardware} configurations.

Right here’s the core parameters of the brand new fashions:

Key Variations

Gemma 2 shares the same architectural basis with the unique Gemma fashions, together with the implementation of Rotary Positioning Embeddings (RoPE) and the approximated GeGLU non-linearity. Nevertheless, it introduces novel architectural improvements that set it other than its predecessors.

Alternating Native and World Consideration

As an alternative of contemplating all phrases in a textual content without delay, it generally focuses on a small window of phrases (native consideration) and generally considers all phrases (world consideration). This mix helps the mannequin perceive each the rapid context and the general which means of the textual content effectively.

Logit Tender-Capping

Think about you’re coaching a mannequin to foretell the subsequent phrase in a sentence. Generally, the mannequin is likely to be overly assured a couple of specific phrase, even when it’s not your best option. Logit soft-capping prevents this by limiting how assured the mannequin might be about its predictions, main to higher total efficiency.

RMSNorm for Pre and Submit-Normalization

Consider this as a option to hold the mannequin’s calculations from changing into too massive or too small throughout coaching. Similar to we would regulate the amount on a speaker to stop distortion, RMSNorm ensures that the knowledge flowing by the mannequin stays inside an inexpensive vary, resulting in extra steady and efficient coaching.

Grouped-Question Consideration (GQA)

This system helps the mannequin course of data extra effectively, particularly when coping with massive quantities of textual content. It improves upon conventional multi-head consideration(MHA) by grouping queries collectively, enabling sooner processing, particularly for giant fashions. It’s like dividing a big job into smaller, extra manageable chunks, permitting the mannequin to grasp the relationships between phrases sooner with out sacrificing accuracy.

Gemma 27B

Gemma2ForCausalLM(
  (mannequin): Gemma2Model(
    (embed_tokens): Embedding(256000, 4608, padding_idx=0)
    (layers): ModuleList(
      (0-45): 46 x Gemma2DecoderLayer(
        (self_attn): Gemma2SdpaAttention(
          (q_proj): Linear(in_features=4608, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4608, out_features=2048, bias=False)
          (v_proj): Linear(in_features=4608, out_features=2048, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4608, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=4608, out_features=36864, bias=False)
          (up_proj): Linear(in_features=4608, out_features=36864, bias=False)
          (down_proj): Linear(in_features=36864, out_features=4608, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm()
        (post_attention_layernorm): Gemma2RMSNorm()
        (pre_feedforward_layernorm): Gemma2RMSNorm()
        (post_feedforward_layernorm): Gemma2RMSNorm()
      )
    )
    (norm): Gemma2RMSNorm()
  )
  (lm_head): Linear(in_features=4608, out_features=256000, bias=False)
)

self_attn

Within the self-attention mechanism, Gemma 2 makes use of Grouped Question Consideration (GQA).

k_proj and v_proj share the identical head with a dimension of 128 and 16 heads (128 x 16 = 2048). In distinction, q_proj and o_proj have 32 heads (128 x 32 = 4096) in parallel.

Word that the Gemma 9B mannequin makes use of the Similar GQA however completely different variety of heads(8 for k_proj and v_proj, 16 for q_proj and o_proj) and head dimension (256)

(self_attn): Gemma2SdpaAttention(
          (q_proj): Linear(in_features=3584, out_features=4096, bias=False)
          (k_proj): Linear(in_features=3584, out_features=2048, bias=False)
          (v_proj): Linear(in_features=3584, out_features=2048, bias=False)
          (o_proj): Linear(in_features=4096, out_features=3584, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )

The 2B mannequin makes use of 4 for k_proj and v_proj, 8 for q_proj and o_proj and head dimension (256)

pre_feedforward_layernorm and post_feedforward_layernorm

One other vital distinction is the inclusion of further RMSNorm in Gemma 2, which reinforces the steadiness of the coaching course of.

Key Findings

Our technical report offers in-depth particulars, however here is a fast abstract of Gemma 2’s principal findings:

Distillation vs. Coaching from Scratch:

We educated the 2B and 9B fashions with data distillation from the bigger mannequin (27B).

Distilling data from a bigger mannequin, even with an equal variety of coaching tokens, results in vital efficiency enhancements.

Grouped Question Consideration vs. Multi Head Consideration:

Changing MHA with GQA leads to comparable efficiency whereas providing parameter effectivity and sooner inference instances, making GQA the popular selection.

Mannequin Depth vs. Width:

A deeper mannequin showcases barely superior efficiency in comparison with a wider mannequin with the identical parameter rely.

What’s Subsequent?

On this article, you discovered about Gemma 2, the subsequent technology of Gemma fashions.

In our subsequent sequence of posts, you’ll study the RecurrentGemma which is an open mannequin based mostly on Griffin.

If you wish to delve into the fascinating world of AI and achieve insights from the consultants who’re shaping its improvement, head over to goo.gle/ai-podcast or seek for the present “Folks of AI Podcast” on any podcast platform.

Keep tuned and thanks for studying!