parallax background

LLM thinking

%alireza rashidi data science%
Machine vs Deep Learning
Anatomy of a Choice
AI Architecture Series

ANATOMY OF A CHOICE

How does an LLM decide what comes next? A sequence of precise mathematical transformations.

The Prompt

“The weather today is very…”

The Goal: The model must predict the next token (word) to complete the sentence. It isn’t magic; it is a calculation from raw numbers to a final word.

01

Logits (The Raw Scores)

The journey begins when the model scans its entire vocabulary (often 50,000+ words) and assigns a raw score to every possible next token. These scores are called Logits.

Logits are unnormalized real numbers. They can be positive or negative and have no upper limit. A higher logit means the model thinks the word is more likely, but these numbers do not represent percentages yet.

Hot
12.5
Cold
11.0
Cloudy
6.0
Frog
-2.0

Insight: These are raw outputs from the neural network’s final layer.

02

Temperature (Controlling Creativity)

Before converting scores to probabilities, we can scale them using the Temperature parameter. We divide all logits by the temperature value.

New_Logit =
Old_Logit Temperature

The Physics of Choice:

  • Conservative (Temp < 1): Differences between numbers are exaggerated. The “winner” becomes much stronger.
  • Creative (Temp > 1): Differences are flattened. Outliers (like “Frog”) get a better fighting chance.
03

Softmax (Enter Probability)

The Softmax function is the translator. It takes the arbitrary logit values and squashes them into a normalized probability distribution.

Softmax(x) =
exp(x) ∑ exp(all_x)

After Softmax, every number is between 0 and 1, and the sum of all numbers is exactly 1 (100%). Now we know the mathematical likelihood of each word:

Hot
70%
Cold
20%
Cloudy
8%
Frog
2%
04 & 05

Filtering (Sort & Nucleus Sampling)

First, we Sort the vocabulary from highest probability to lowest. Then, we apply Top-p (Nucleus) Sampling.

This technique sets a cumulative threshold (e.g., P = 0.90). We keep the top words whose probabilities sum up to 90% and aggressively discard the rest. This prevents the model from choosing nonsensical words (the “tail” of the distribution).

Filtering Logic (Target P ≥ 0.90)

Hot (0.70)
Cumulative: 0.70
KEEP
Cold (0.20)
Cumulative: 0.90
KEEP
Cloudy (0.08)
Threshold reached
DISCARD
Frog (0.02)
Threshold reached
DISCARD
Ali Reza Rashidi
Ali Reza Rashidi
Ali Reza Rashidi, a BI analyst with over nine years of experience, He is the author of three books that delve into the world of data and management.

Comments are closed.

error: Content is protected!