Run a single forward pass of the GPT model for one token position.
This follows GPT-2 with minor differences: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU.
Token id to embed.
Position index within the sequence.
Per-layer KV cache for keys (mutated: new key appended).
Per-layer KV cache for values (mutated: new value appended).
Model weights.
Model hyperparameters.
Logits vector of length vocabSize.
vocabSize
Run a single forward pass of the GPT model for one token position.
This follows GPT-2 with minor differences: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU.