Illustrating how Neural Networks train

2025-12-05 AI

Neural network training is the source of many common misunderstandings on how the training works and what a neural network does exactly. This is exacerbated by the fact that there appears to be a lot of mysticism around large language models (LLMs) and modern AI systems.

It is important to understand that at the time of writing, AI models do not understand the context in which they operate. They simply are approximations of a statistical distribution. What people often do not quite grasp is just how much textual data there already is and how much computational cost has been sunk in these models in a desperate attempt to not run out of money before reaching a useful state.

Brief Overview of Important Concepts

Neural networks learn by adjusting internal weights so that outputs match desired targets for given inputs.
Supervised learning uses explicit labelled examples to compute weight updates.
The alternative, "Reinforcement Learning" does about what you imagine it does

The supervised learning, and particularly the formal learning theory, is well-studied, and there are resources to understand it. In the context of a large language model, it can be understood as "is this a good continuation of the sentence?", when fed some text examples. This is not quite intuitive to many people and it is also somewhat deceiving considering that there are so many other concepts that go into LLMs, such as prompts and specifically also additional training steps which do not directly involve text, or not only text and also other modalities.

Alpha Go Zero (Silver et al. (2017)) is a famous reinforcement learning example which is also incredibly easy to understand: Given just the rules of the game and the "incentive" that winning is good and loosing is bad, they trained a computer to play against itself without human data. Given the fact that games tend to have good feedback in terms of winning or loosing, and that the action space (meaning the possible moves) is constrained by a rule-set which has been written down, it seems already obvious how a computer might learn it.

Take DeepSeek (not peer reviewed, but Hou et al. (2025)), on the other hand. Its learning modalities appear strange at first, but there are rules that can be derived from what we expect the output to look like. The reasoning component of LLMs can be tested, even. That does not mean that they have an inherent understanding of reasoning, but here actually we also have well-defined structures on how "reasoning" looks like.

Interactive demos

Supervised: next-iteration training

Points are coloured by their true label (blue / red). The marker outline shows the network's prediction. X (horizontal) and Y (vertical) both range -1..1

Interactive feedback (simple RL-like)

Use the four (vibe-based) buttons to signal whether the current separator looks correct. This feedback nudges the network's parameters.

Additional sample: next-iteration

Same as the reinforcment sample but this time we compute the score and you just press next, which is arguably much easier and less vibes-based.

Applied reinforcement score

Current score (mapped from supervised loss):

—

When you press "Next iteration" on the left, the score shown here is passed to selectNextAction for the reinforcement learner.

Basic description

This demo shows you the simplest unit of a neural network: a perceptron. The neuron metaphor for perceptrons has always bugged me but if it helps you understand the concept, think of it like a brain cell (even though it is much less). A perceptron takes multiple inputs, and applies learnable weights to them, sums them and puts them through an activation function. This part is actually somewhat brian-like. Neurons fire together when they recognize a concept. However, neurons in the human brain are interconnected in a more complex way, as there are many avanues for "in" and "out" and it is not as unidirectional as in current neural networks, even if you take more complex architectures into account.

What you do is press buttons. For the "we have data" case it is quite straight forward: You have points with labels and you just press "next iteration" as the neural network will learn based on a loss (meaning error) function. You do not quite have control over what exactly it learns, but you can choose to continue and to stop at any time.

The next graph is slightly more complex. We are learning the neural network through policies. A policy in terms of reinforcement learning is a set of rules that make our learner (a specific program which manipulates the neural network) choose modifications. Sometimes, these modifications are more complex as the general idea is to find a good representation of the real world in terms of a neural network, but without having access to the direct backpropagation of errors to our weights. Instead, we try to sample and explore, and based on the scoring of our output, try to descend our network into a better direction.

It might be a little frustrating for you to use this graph. Of course, usually it is automated and particularly in comparison to the supervised case in which you just press a button, it is more work, particularly on your part. However, it is already quite interesting to see that even if we fully decouple the "rating" from the network, we can still learn the representation of what we see in the real world.

Try it and experiment

Click "Next iteration" several times in the supervised demo to watch the decision boundary approach the true separator. In the reinforcement case, there are four buttons you can use to indicate what you think the current success of the last change (or rather, of the current outcome) is.

A note on the sampling

We are employing the Thompson sampling strategy which you can read about in Yang 2024 (Yang et al. (2024)) which serves as literature review of Thompson sampling. What this "sampling" does is attempt to get a good view of the action space, meaning the space of potential modifications that the learning algorithm can do to its best guess on what we want from the data. You directly see that this kind of sampling is much more prone to getting stuck or to falling into local minima, as our data is sparse and while we can give nudges here and there, it is not comparable to the full backpropagation of errors that we have in the supervised case. However, particularly if we already have a very robust pre-training and our loss function is sufficiently well-informed, such as for rational behavior in LLMs, this kind of feedback can be very powerful. DeepSeek was particularly trained on reasoning tasks, which have a very well-defined structure, and thus the feedback could be very targeted.

neural-networks education interactive