Token id sequence for one document (including BOS sentinels).
Flat array of all model parameters.
Model weights.
Model hyperparameters.
Adam optimizer hyperparameters.
Mutable Adam moment buffers.
Current training step (0-indexed), used for bias correction and LR decay.
Total number of training steps, used for linear LR decay.
The scalar loss value for this step.
Perform a single training step: forward pass, loss computation, backward pass, and Adam parameter update.