Rotary Positional Embeddings in nanoGPT
Author: Shuvayan Brahmachary — Sep 21, 2025
Introduction
Language models require not only learning the relationship between tokens but also capturing their order. In character-level modeling tasks, where the unit is a single character, the ability to represent sequential positions becomes even more crucial.
For this study, I use the enwik8 dataset — a standard 100M character benchmark widely used in compression and modeling tasks. The metric of interest is Bits per Character (BPC), which directly measures predictive performance: lower BPC means better compression and thus stronger modeling ability.
Example: Prediction as Compression
Let’s walk through a simple example with only two characters: A
and B
.
If we know nothing about the text, we have to treat them equally. That means we assign the same cost to each character, for example 1 bit each. On average, every character in the dataset costs 1 bit to store.
Now suppose our dataset looks very skewed: A
appears 90% of the time and B
only 10% of the time. A language model trained on this dataset quickly picks up that pattern and starts predicting:
p(A) = 0.9
p(B) = 0.1
Here’s the key idea: if something is very likely (like A
), it doesn’t need much space to describe, because the model already expects it. In contrast, if something is rare (like B
), it requires a longer description to make sure we don’t confuse it with A
. This trade-off — short for common, long for rare — reduces the average cost.
When we do the math:
A
works out to about 0.47 bits per occurrenceB
works out to about 3.32 bits per occurrence- Weighted by frequency, the overall cost is
0.9 × 0.47 + 0.1 × 3.32 ≈ 0.8 bits per character
So instead of 1.0 bits per character, the model only needs about 0.8 bits per character on average.
This is exactly what Bits per Character (BPC) measures: the average number of bits required to store text when compressed using the model’s predictions. Better prediction leads to fewer wasted bits, which means lower BPC.
My baseline is nanoGPT, a minimal GPT implementation by Andrej Karpathy.
I introduce Rotary Positional Embeddings (RoPE) into this architecture and evaluate whether it improves model efficiency under a fixed parameter budget (~44M parameters).
Character-Level Language Models
A character-level LM predicts the next character given a sequence of previous characters. Unlike word-level models, it must capture both fine-grained patterns (e.g., “th” → “e”) and long-range dependencies across thousands of characters.
Why Positional Embeddings Matter
Transformers process sequences in parallel, so they need explicit positional information.
- Absolute embeddings: add a learned or sinusoidal vector to each token embedding.
- Relative embeddings: represent positions by differences, allowing generalization to unseen lengths.
- RoPE (Rotary Positional Embedding), introduced by Su et al. in RoFormer (2021), provides a neat trick: instead of adding positional vectors, it rotates the query and key vectors inside self-attention. This ensures dot products naturally reflect relative positions.
nanoGPT in a Nutshell
nanoGPT is a minimal GPT implementation by Andrej Karpathy.
It keeps just the essentials for autoregressive text modeling:
- Token & Positional Embeddings – turn characters into vectors
- Transformer Blocks (12 in this study), each with
- LayerNorm
- Multi-Head Self-Attention
- Feed-Forward MLP
- Final Linear Layer – maps hidden states back to vocabulary logits
Despite its simplicity, nanoGPT is strong enough to train on enwik8 and makes a clean research baseline.
What enwik8 text looks like
enwik8 is the raw Wikipedia dump:
Why it matters:
- The model sees every single character (letters, digits, whitespace, punctuation, markup).
- Vocabulary is small
(~205 symbols: alphabets (A-Z), digits (0-9), punctuations (.,!?;:"'()[]{}…)
, but sequences are long and structured, testing whether the LM can capture dependencies across thousands of characters.
raw text
↓ tokenize (char → id, per your vocab)
X: (B, T) = (1, 32) e.g. [19, 8, 5, 3, 42, 53, 12, ...]
# integers in [0..V-1]
↓ token embedding lookup (row-wise table lookup)
Embedding table Wte: (V, d_model) = (205, 16)
TokEmb = Wte[X]: (B, T, d_model) = (1, 32, 16)
↓ positional information (Absolute Positional Embeddings)
Positional table Wpe: (T_max, d_model) = (32, 16)
PosEmb = broadcast Wpe[0:T] → (B, T, d_model) = (1, 32, 16)
↓ add position to token representation (the block’s input)
TokIn = TokEmb + PosEmb # (1, 32, 16)
↓ Transformer Block (×1 layer in your run)
- LayerNorm over last dim (per token)
H0 = LN(TokIn) # (1, 32, 16)
- Self-Attention (per head)
# Linear projections from H0
Q_lin = H0 @ W_Q + b_Q # (1, 32, 16)
K_lin = H0 @ W_K + b_K # (1, 32, 16)
V_lin = H0 @ W_V + b_V # (1, 32, 16)
# reshape to heads
Q = reshape(Q_lin) → (B, n_head, T, d_head) = (1, 1, 32, 16)
K = reshape(K_lin) → (1, 1, 32, 16)
V = reshape(V_lin) → (1, 1, 32, 16)
# scaled dot-product attention with causal mask
scores = Q @ K^T / sqrt(d_head) # (1, 1, 32, 32)
scores[j > i] = -inf (causal mask)
weights = softmax(scores, axis=-1) # (1, 1, 32, 32)
AttnOut = weights @ V # (1, 1, 32, 16)
merge heads → concat over head dim # (1, 32, 16)
# output projection
AttnProj = AttnOut @ W_O + b_O # (1, 32, 16)
# residual connection
H1 = TokIn + AttnProj # (1, 32, 16)
- MLP (position-wise feed-forward: GELU + Linear)
H2_in = LN(H1) # (1, 32, 16)
MLP_hidden = GELU(H2_in @ W1 + b1) # (1, 32, d_ff)
MLP_out = MLP_hidden @ W2 + b2 # (1, 32, 16)
H2 = H1 + MLP_out # (1, 32, 16)
↓ final LayerNorm
Hf = LN(H2) # (1, 32, 16)
↓ linear head to vocab logits
Logits = Hf @ W_vocab + b_vocab # (1, 32, V) = (1, 32, 205)
↓ cross-entropy vs next-token targets Y
# Y is X shifted left by 1 position (teacher forcing)
loss = CE(Logits, Y) # scalar