Your Sentence Has a Secret Structure. Here’s How GPT Sees It.

Last Updated on March 3, 2026 by Editorial Team Author(s): Rohini Joshi Originally published on Towards AI. Image Generated by ChatGPT The sentence “dog bites man” and “man bites dog” contain the exact same words. A Transformer without positional encoding would treat them as identical. Here’s how modern LLMs learn word order and then decide which words actually matter. The previous article here, explained how embeddings convert words into numbers, vectors in a high-dimensional space where distance reflects meaning. But embeddings alone have a problem. They represent individual words in isolation. They do not capture where a word appears in a sentence, or how it relates to other words around it. Two mechanisms fix this. Positional encoding tells the model where each word sits. Attention tells the model which words matter for understanding each other word. Together, they are what make Transformers work. Part 1: Positional Encoding: Teaching Word Order to a Model The Problem: Without Order, Words Are Just a Bag Recurrent neural networks (RNNs and LSTMs) process words one at a time, left to right. Word order is built into the architecture, the model sees “the” before “cat” before “sat” because it literally processes them in sequence. Transformers do not work this way. They process all words simultaneously, in parallel. This makes them much faster to train, but it creates a fundamental problem: without intervention, a Transformer has no idea that “the” comes before “cat” which comes before “sat.” Every word is just a floating vector with no address. Consider these two sentences: The cat sat on the mat The mat sat on the cat The word embeddings are identical in both cases. The same words appear the same number of times. Without positional information, these two sentences are mathematically indistinguishable to the model. That is obviously unacceptable, one describes a normal cat and the other describes a very unusual mat. The Solution: Add Position to the Embedding The fix is elegant. Before feeding embeddings into the Transformer, a positional encoding vector is added to each word’s embedding. This vector encodes the word’s position in the sequence. After the addition, the embedding for “cat” in position 2 is numerically different from “cat” in position 5, even though the word is the same. final_embedding = word_embedding + positional_encoding That’s it. One addition. But the details of how the positional encoding is constructed makes all the difference. Sinusoidal Positional Encoding The original “Attention Is All You Need” paper used a mathematical approach based on sine and cosine waves at different frequencies. For each position and each dimension, each of the 300 numbers in the embedding vector from the previous article, the encoding is computed as: Where pos is the word's position, i is the dimension index, and d is the total embedding dimension. This looks abstract, but the intuition is simple: each dimension oscillates at a different frequency. Low dimensions change slowly (capturing broad position information), while high dimensions change rapidly (capturing fine-grained position). Together, they create a unique fingerprint for every position. import numpy as npimport matplotlib.pyplot as plt def sinusoidal_positional_encoding(max_len, d_model): """Generate positional encodings as described in 'Attention Is All You Need'""" pe = np.zeros((max_len, d_model)) position = np.arange(max_len)[:, np.newaxis] # shape: (max_len, 1) # Compute the division term: 10000^(2i/d_model) div_term = 10000 ** (np.arange(0, d_model, 2) / d_model) # Apply sin to even indices, cos to odd indices pe[:, 0::2] = np.sin(position / div_term) pe[:, 1::2] = np.cos(position / div_term) return pe # Generate encodings for 50 positions in a 64-dimensional spacepe = sinusoidal_positional_encoding(max_len=50, d_model=64) plt.figure(figsize=(14, 6))plt.imshow(pe, cmap="RdBu", aspect="auto")plt.xlabel("Embedding Dimension")plt.ylabel("Word Position in Sentence")plt.title("Sinusoidal Positional Encoding — Each Position Gets a Unique Pattern")plt.colorbar(label="Value")plt.tight_layout()plt.savefig("positional_encoding_heatmap.png", dpi=150, bbox_inches="tight")plt.show() Sinusoidal positional encoding for 50 positions across 64 dimensions. Slow waves on the left, fast oscillations on the right, together, they give every position a unique pattern. Each row is one word position. The left side (low dimensions) shows wide, slow-changing waves, capturing broad position. The right side (high dimensions) shows tight, rapid stripes, capturing exact position. Every row has a unique pattern, which is exactly what the model needs to distinguish positions. Why Sine and Cosine? Three properties make this design effective: Unique positions. No two positions get the same encoding. The model can always tell position 3 from position 17. Relative distance is learnable. The relationship between position 5 and position 8 is consistent regardless of where in the sentence they occur. This is because sinusoidal functions have a mathematical property: PE(pos + k) can be expressed as a linear function of PE(pos). The model can learn to detect “3 positions apart” as a pattern. Generalizes to unseen lengths. Since the encoding is computed from a formula (not looked up from a table), it works for sequences longer than anything seen during training. # Demonstrating that relative distances are capturedpos_5 = pe[5]pos_8 = pe[8]pos_15 = pe[15]pos_18 = pe[18] # Distance between position 5 and 8dist_5_8 = np.linalg.norm(pos_5 - pos_8)# Distance between position 15 and 18 (same gap, different location)dist_15_18 = np.linalg.norm(pos_15 - pos_18) print(f"Distance between position 5 and 8: {dist_5_8:.4f}")print(f"Distance between position 15 and 18: {dist_15_18:.4f}")print(f"Difference: {abs(dist_5_8 - dist_15_18):.4f}") # Distance between adjacent positions vs. far-apart positionsdist_1_2 = np.linalg.norm(pe[1] - pe[2])dist_1_30 = np.linalg.norm(pe[1] - pe[30])print(f"\nAdjacent positions (1,2): {dist_1_2:.4f}")print(f"Far-apart positions (1,30): {dist_1_30:.4f}") Distance between position 5 and 8: 3.5813Distance between position 15 and 18: 3.5813Difference: 0.0000 Adjacent positions (1,2): 1.4718Far-apart positions (1,30): 5.6980 Nearby positions have smaller distances than far-apart positions. And the same gap (3 positions apart) produces similar distances regardless of absolute position. This is exactly the structure the model needs. Learned vs. Sinusoidal Encodings The original Transformer used the fixed sinusoidal approach described above. But modern models like BERT and GPT use learned positional embeddings instead, they treat position as another parameter that gets optimized during training, just like word embeddings. Both approaches work. The sinusoidal version is mathematically principled and generalizes to longer sequences. The learned version is more flexible and can capture position patterns specific to the training data. In practice, learned encodings tend to perform […]
This article was originally published by Towards AI. View original article
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
Bidirectional Encoder Representations from Transformers.
The processing power needed to train and run AI models.
A dense numerical representation of data (words, images, etc.