Gemma is a household of light-weight, state-of-the artwork open fashions constructed from the identical analysis and know-how used to create the Gemini fashions.
Totally different variations of Gemma are designed for various use circumstances and modalities, similar to:
- Single modality (Textual content in, Textual content out)
- Specialization for coding use circumstances
- Multi modality (Textual content and Picture in, Textual content out)
- Various sizes for various {hardware} varieties, inference wants, and different constraints.
- “Novel” architectures
As a result of all these fashions share an analogous DNA, the Gemma household presents a singular strategy to study concerning the architectures and design decisions which are out there in trendy LLM methods. We hope this contributes to a wealthy ecosystem of open fashions and promotes a higher understanding of how LLM methods work.
This collection will cowl:
- Gemma 1 (2B, 7B) – Transformer based mostly text-to-text fashions.
- CodeGemma (2B and 7B) – A fine-tuned model of Gemma, optimized for code completion and technology.
- Gemma 2 (2B, 9B, 27B) – Up to date text-to-text fashions skilled with newer structure with the 2B and 9B variations skilled via distillation from bigger fashions.
- RecurrentGemma (2B, 9B) – A mannequin constructed on the novel Griffin structure. This structure makes use of a combination of native consideration and linear recurrences to realize quick inference when producing lengthy sequences.
- PaliGemma (3B) – A vision-language mannequin that may absorb textual content and pictures and supply a textual content output.
How one can use this information
On this collection, we’ll
- Collate the precise architectures of assorted fashions
- Clarify how these parameters have an effect on mannequin generations (e.g. num embeddings, Multi Question vs Multi Head vs Grouped Question)
- Present code examples of the fashions for additional exploration
To offer details about the mannequin, we use Hugging Face Transformers print module, like the easy code under.
from transformers import AutoModelForCausalLM
mannequin = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
print(mannequin)
You possibly can discover contained in the mannequin with torchinfo or summary() within the Keras Mannequin class API as properly.
What this information will not be
This information will not be an introduction to AI. It assumes working data of neural networks, Transformers and related phrases like tokens. If you happen to want a refresher on these ideas listed below are some assets to get you began:
A palms on neural community studying software that works in browser
An introduction to transformers
Gemma
Gemma is an open weight LLM. It is available in each instruction-tuned and uncooked, pretrained variants at varied parameter sizes. It’s based mostly on the LLM structure launched by Google Analysis within the Attention Is All You Need paper. Its main operate is to generate textual content tokenword by tokenword, based mostly on a immediate offered by a person. In duties like translation, Gemma takes a sentence from one language as enter and outputs its equal in one other language.
As you’ll quickly see Gemma is each an ideal mannequin by itself, but in addition lends itself to customized extensions to satisfy totally different person wants.
Gemma Structure
First, let’s see the transformer decoder that Gemma fashions are based mostly on.
Not like the unique encoder-decoder transformer mannequin structure launched in “Consideration Is All You Want”, Gemma is solely a “decoder-only” mannequin.
The core parameters of the structure are summarized within the desk under.
Fashions are skilled on a context size of 8192 tokens. This implies they’ll course of as much as roughly 6144 phrases (utilizing the rule of thumb of 100 tokens ~= 75 phrases) at a time.
It is value noting that the sensible enter restrict can range based mostly on the duty and utilization. It’s because textual content technology consumes tokens inside the context window, successfully lowering house for brand new enter. Though the technical enter restrict stays fixed, generated output turns into a part of the next enter, influencing additional generations.
d_model (2B: 2048, 7B: 3072)
d_model represents the scale of the embeddings (vector representations of phrases or subwords a.okay.a tokens) used as enter to the decoder. It additionally determines the scale of the inner illustration inside the decoder layers.
“d_model x Num heads x Head measurement” defines the parameter quantity in self_attn
A bigger d_model worth means the mannequin has extra “house” to symbolize the nuances of various phrases and their relationships. This may result in higher efficiency, particularly for advanced language duties. Nevertheless, rising d_model additionally makes the mannequin bigger and extra computationally costly to coach and use.
Layers (2B: 18, 7B: 28)
Transformers include a number of stacked layers. Deeper fashions have extra layers, and subsequently extra parameters and might study extra intricate patterns. Nevertheless these further parameters imply they’re additionally extra liable to overfitting and require extra computational assets.
This augmented representational capability may outcome within the mannequin studying noise or particular coaching knowledge patterns that lack the flexibility to generalize to novel examples.
Moreover, deeper fashions usually necessitate extra coaching knowledge to avert overfitting. In circumstances the place out there knowledge is restricted, the mannequin may lack enough examples to study a generalizable illustration, resulting in the memorization of coaching knowledge as a substitute.
Feedforward hidden dims (2B: 32768, 7B: 49152)
Every Transformer layer features a feedforward community after the eye mechanism. This community has its personal dimensionality, usually bigger than the d_model measurement to extend the mannequin’s expressive energy.
It’s applied as a multi-layer perceptron (MLP), a type of neural community, to additional rework the embeddings and extract extra intricate patterns.
In Gemma, the usual ReLU non-linearity is changed by the GeGLU activation function, a variation of GLU (Gate Linear Unit). GeGLU divides the activation into two elements: a sigmoidal half and a linear projection. The output of the sigmoidal half is element-wise multiplied with the linear projection, leading to a non-linear activation operate.
Num heads (2B: 8, 7B: 16)
Every Transformer layer incorporates a number of consideration mechanisms working in parallel. These “heads” enable the mannequin to give attention to totally different points of the enter sequence concurrently. Growing the variety of heads can improve the mannequin’s capability to seize numerous relationships within the knowledge.
Num KV heads (2B: 1, 7B: 16)
The 7B mannequin makes use of multi-head consideration(MHA), whereas the 2B mannequin makes use of multi-query consideration(MQA). MQA shares the identical key and worth projections, which suggests every head focuses on the identical underlying illustration however with totally different question projections.
The unique MHA gives richer illustration studying however comes with greater computational prices. MQA supplies an environment friendly various that has been shown to be effective.
Head measurement (2B: 256, 7B: 256)
It refers back to the dimensionality of every consideration head inside the multi-head consideration mechanism. It’s calculated by dividing the embedding dimension by the variety of heads. For instance, if the embedding dimension is 2048 and there are 8 heads, then every head would have a measurement of 256.
Vocab measurement (2B: 256128, 7B: 256128)
It defines the variety of distinctive tokens (phrases, sub phrases or characters) that the mannequin understands and might course of. Gemma tokenizer is predicated on SentencePiece. The scale of the vocabulary is predetermined earlier than coaching. SentencePiece then learns the optimum subword segmentation based mostly on the chosen vocabulary measurement and the coaching knowledge. Gemma’s massive 256k vocabulary permits it to deal with numerous textual content inputs and probably enhance efficiency on varied duties, e.g. dealing with multilingual textual content inputs.
Gemma 7B
GemmaForCausalLM(
(mannequin): GemmaModel(
(embed_tokens): Embedding(256000, 3072, padding_idx=0)
(layers): ModuleList(
(0-27): 28 x GemmaDecoderLayer(
(self_attn): GemmaSdpaAttention(
(q_proj): Linear(in_features=3072, out_features=4096, bias=False)
(k_proj): Linear(in_features=3072, out_features=4096, bias=False)
(v_proj): Linear(in_features=3072, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=3072, bias=False)
(rotary_emb): GemmaRotaryEmbedding()
)
(mlp): GemmaMLP(
(gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
(up_proj): Linear(in_features=3072, out_features=24576, bias=False)
(down_proj): Linear(in_features=24576, out_features=3072, bias=False)
(act_fn): PytorchGELUTanh()
)
(input_layernorm): GemmaRMSNorm()
(post_attention_layernorm): GemmaRMSNorm()
)
)
(norm): GemmaRMSNorm()
)
(lm_head): Linear(in_features=3072, out_features=256000, bias=False)
)
embed_tokens (Embedding Layer)
This layer converts the enter tokens (phrases or subwords) into dense numerical representations (embeddings) that the mannequin can course of. It has a vocabulary measurement of 256,000 and creates embeddings of dimension 3072.
layers
That is the guts of the mannequin, consisting of 28 stacked GemmaDecoderLayer blocks. Every of those layers refines the token embeddings to seize advanced relationships between phrases and their context.
self_attn
Within the self-attention mechanism, the mannequin assigns totally different weights to the phrases within the enter when creating the following phrase. Leveraging a scaled dot-product consideration mechanism, the mannequin employs linear projections (q_proj, k_proj, v_proj, and o_proj) to generate question, key, worth, and output representations.
All out_features values are the identical 4096 for q_proj, k_proj and v_proj as this mannequin makes use of Multi Head Consideration (MHA). They’ve 16 heads with a measurement of 256 in parallel, totaling 4096 (256 x 16).
Moreover, the mannequin leverages positional info extra successfully by using rotary_emb (GemmaRotaryEmbedding) for positional encoding (a.okay.a RoPE).
Lastly, o_proj layer initiatives the eye output again to the unique dimension (3072).
Observe that the Gemma 2B mannequin makes use of Multi Question Consideration (MQA).
k_proj and v_proj share the identical head with a measurement of 256, leading to out_features of 256. In distinction, q_proj and o_proj have 8 heads (256 x 8 = 2048) in parallel.
(self_attn): GemmaSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=256, bias=False)
(v_proj): Linear(in_features=2048, out_features=256, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): GemmaRotaryEmbedding()
)
mlp
It makes use of gate_proj and up_proj for a gating mechanism, adopted by down_proj to scale back the dimension again to 3072.
input_layernorm, post_attention_layernorm and norm
These normalization layers stabilize coaching and enhance the mannequin’s capability to study successfully.
lm_head
This last layer maps the refined embeddings (3072) again to a likelihood distribution for the following token over the vocabulary house (256000).
CodeGemma (2B and 7B)
CodeGemma fashions are fine-tuned Gemma fashions which are optimized for code completion and coding chat help. CodeGemma fashions are skilled on greater than 500 billion tokens of primarily code. As well as CodeGemma provides fill-in-the- center functionality, permitting completions that happen between two items of present textual content.
CodeGemma highlights the finetunability of the Gemma checkpoints. By further coaching the fashions change into specialised at a sure process, studying a extra advanced completion than pure suffix completion.
Code Gemma Utilization
You should utilize 4 user-defined tokens – 3 for FIM and a “<|file_separator|>” token for multi-file context assist.
BEFORE_CURSOR = "<|fim_prefix|>"
AFTER_CURSOR = "<|fim_suffix|>"
AT_CURSOR = "<|fim_middle|>"
FILE_SEPARATOR = "<|file_separator|>"
Think about that you’re making an attempt to finish the code just like the display under.
And the enter immediate ought to appear like this
<|fim_prefix|>import <|fim_suffix|>if __name__ == "__main__":n sys.exit(0)<|fim_middle|>
The mannequin will present “sys” because the steered code completion.
You possibly can discover extra about CodeGemma on CodeGemma / Quickstart.
What’s Subsequent?
This text mentioned the Gemma structure.
In our subsequent collection of posts, you’ll discover the newest mannequin, Gemma 2. With substantial enhancements in security measures, this mannequin surpasses its predecessor when it comes to efficiency and effectivity throughout inference.
Keep tuned and thanks for studying!
References
Papers
Code Examples
Gemma
CodeGemma
📋 The whole Gemma structure collection
- Gemma defined: An outline of Gemma mannequin household architectures
- Gemma defined: What’s new in Gemma 2
- Gemma defined: RecurrentGemma structure
- Gemma defined: PaliGemma structure