In the previous post, we saw how the MLP organizes neurons into layers and how data flows through the feedforward pass. By the end, the network was producing predictions with fixed weights and we were measuring the error — but the network wasn’t doing anything with it.
Now let’s fix that: how does the network use the error to learn?
Optimization
Training a neural network is, at its core, an optimization problem: the goal is to find the set of weights that produces the smallest possible error between the network’s output and the expected result. Without an optimization algorithm, weights would remain static and the network would be incapable of learning.
There are several strategies for searching the ideal weights:
- Brute Force: test every possible weight combination. Computationally infeasible — even a small network would have trillions of combinations.
- Simulated Annealing: a probabilistic method that searches for the global minimum by temporarily accepting worse solutions to avoid getting trapped in local minima.
- Genetic Algorithms: inspired by biological evolution, using mutation and crossover. Different populations of weights are tested, and the sets with the lowest error are selected to breed the next generation.
- Particle Swarm Optimization (PSO): modeled on the social behavior of animal groups. Several “particles” (weight sets) explore the error space, sharing the best positions they’ve found to converge toward the goal.
In this post, we go with Gradient Descent. The choice comes down to its directionality and scalability. By using differential calculus to estimate the direction of error reduction — rather than relying purely on random variation — the process becomes far faster and more viable for large networks.
Gradient Descent
After measuring the error, we use Gradient Descent to adjust the weights. The goal is to find the direction in which to change each weight so that the network’s overall error is minimized.
The Sigmoid Derivative
To adjust the weights, we need to know how a neuron’s output responds to small changes in its input. Since we’re using the Sigmoid function, its derivative can be computed efficiently using the activation result $y$ itself: $$ d = y \cdot(1-y) $$ This formula tells us the neuron’s sensitivity:
- If $y$ is close to 0.5, the derivative reaches its maximum value ($0.25$). The neuron is in a high-sensitivity zone — small weight changes have a large impact on the output.
- If $y$ is close to 0 or 1, the derivative approaches zero. The neuron is saturated — weight adjustments will have little effect on the output.
Delta
The Delta represents the local gradient of a neuron. It combines the error assigned to the neuron with its sensitivity (the derivative). However, the Delta calculation varies depending on the neuron’s position in the network:
- Output Layer: since we have the expected value (target), the calculation is direct: $\delta_{output} = (target - prediction) \cdot \text{derivative}$
- Hidden Layer: since we have no expected value for intermediate neurons, the error is estimated based on how much the hidden neuron contributed to the next layer. The error is “carried back” through the weights: $\delta_{hidden} = \text{derivative} \cdot (\delta_{output} \cdot weight)$
This distinction is the foundation of the backpropagation algorithm, allowing the output error to be distributed proportionally across all neurons in the network.
Weight Update
This is where the network “acts” on the calculated error, modifying its internal structure to improve performance on the next iteration. To control this process, we use an essential parameter: the learning rate.
The update formula used in this project is: $$ weight_{n+1} = weight_{n} + (input \cdot \delta \cdot \eta) $$ Where:
- $\eta$ (Learning Rate): controls the magnitude of each adjustment. Values that are too high cause instability; values that are too low make convergence painfully slow.
- $input$: the signal coming from the previous layer.
- $\delta$: the local gradient calculated for the destination neuron.
There’s also a technique called momentum, but it won’t be used in the code below. In classic momentum, we store a “velocity” for each weight based on previous adjustments — this is different from simply multiplying the current weight by a constant.
Batch vs. Stochastic
An important decision in neural network training is how often to update the weights relative to the volume of data. There are three main approaches:
- Batch Gradient Descent: the algorithm computes the error over the entire dataset before making a single weight update. The most stable method, but can be slow and memory-intensive on large datasets. This is the technique used in this post.
- Stochastic Gradient Descent (SGD): weights are updated after processing each individual sample. Very fast, and frequent updates can help the network escape local minima, but convergence is noisier.
- Mini-batch Gradient Descent: data is split into small groups, and weights are updated after each group. The best balance between stability and speed — the standard approach in most modern neural networks.
Backpropagation
Backpropagation is the process that puts Gradient Descent into practice: it propagates the output error back to the initial layers, computing the gradients needed to adjust each weight. While feedforward is the execution step, backpropagation is the correction step.
At each iteration (epoch), the network runs this cycle:
- Feedforward to generate a prediction.
- Calculate the error by comparing with the target.
- Backpropagate to find the gradients.
- Update the weights to reduce the error next time.
Epochs
In the practical examples below, you’ll notice a variable called epochs. In machine learning, an epoch represents one complete pass of the entire dataset through the neural network (feedforward and backpropagation).
As we saw with Gradient Descent, weight adjustments happen in small steps controlled by the learning rate. A single cycle isn’t enough for the network to “see” the full pattern, so we need thousands or even millions of epochs.
It’s through repetition that the weights slowly converge to their ideal values, minimizing the error until the network can solve the problem at hand.
Matrix Transposition
Looking at the code below, you’ll notice frequent use of the transposition operation (.T or np.transpose). In linear algebra, to compute the dot product of two matrices, the number of columns in the first matrix must equal the number of rows in the second.
Transposition inverts rows and columns, aligning the dimensions and making backpropagation and gradient calculations possible. Without it, the network would be unable to map the output error back to the correct connections in earlier layers.
Practice: Training XOR
The script below runs the full training cycle to solve the XOR problem, iterating through many epochs until the network minimizes the error and learns the correct logic.
import numpy as np
np.random.seed(42)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
# Sigmoid derivative expressed in terms of its output
return x * (1 - x)
# 1. Dataset Setup (XOR)
inputs = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
expected_outputs = np.array([[0], [1], [1], [0]])
# 2. Weight Initialization
# Input (2) -> Hidden Layer (3)
weights_input_hidden = 2 * np.random.random((2, 3)) - 1
# Hidden Layer (3) -> Output (1)
weights_hidden_output = 2 * np.random.random((3, 1)) - 1
# 3. Training Parameters
epochs = 1_000_000
learning_rate = 0.6
if __name__ == '__main__':
for epoch in range(epochs):
# --- FORWARD PROPAGATION ---
hidden_layer_input = np.dot(inputs, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_input = np.dot(hidden_layer_output, weights_hidden_output)
predictions = sigmoid(output_layer_input)
# --- ERROR CALCULATION ---
error = expected_outputs - predictions
mean_absolute_error = np.mean(np.abs(error))
# Log progress every 10,000 epochs
if epoch % 10000 == 0:
print(f"Epoch {epoch} | MAE: {mean_absolute_error:.6f}")
# --- BACKPROPAGATION ---
# 1. Output Delta
output_derivative = sigmoid_derivative(predictions)
delta_output = error * output_derivative
# 2. Propagate error to the Hidden Layer
# Transpose weights to align dimensions and carry the delta back
weights_hidden_output_transposed = weights_hidden_output.T
delta_hidden_layer = np.dot(delta_output, weights_hidden_output_transposed) * sigmoid_derivative(hidden_layer_output)
# --- WEIGHT UPDATE ---
# Adjust: Hidden Layer -> Output
# Transpose the activation to align with the delta in the dot product
output_gradient = np.dot(hidden_layer_output.T, delta_output)
weights_hidden_output += output_gradient * learning_rate
# Adjust: Input -> Hidden Layer
# Transpose the input to align with the hidden layer delta
hidden_gradient = np.dot(inputs.T, delta_hidden_layer)
weights_input_hidden += hidden_gradient * learning_rate
print("\n--- Training Complete ---")
print(f"Final predictions:\n{predictions}")
print(f"Final Error (MAE): {mean_absolute_error:.6f}")
In this example, the network runs one million iterations. At each cycle, it senses the error through backpropagation and uses Gradient Descent to nudge the weights toward the solution.
By the end, the predictions for all four XOR combinations will be extremely close to the real values (0, 1, 1, 0), proving that the addition of a hidden layer and iterative learning overcame the original Perceptron’s limitation.
Bias
So far, we’ve seen the neuron multiply its inputs by weights and sum the results. But if all inputs are zero, the sum would necessarily be zero too — the neuron would have no flexibility to learn patterns that don’t start exactly at the origin.
To solve this, we use bias, an extra value added to the calculation that sets the starting point for each neuron.
Think of it this way: imagine the neuron is deciding whether to buy Mass Effect 4 on launch day. The inputs are the price and the review score, and the weights are how much you care about each. But what about your personal taste? That’s where bias comes in:
- Negative Bias (skeptic): you’re a certain friend of mine. Even if the game is cheap and well-reviewed, you still need the inputs to do a lot of convincing.
- Positive Bias (fan): you’re like me, already in love with the franchise. Even if the price is high or the score is middling, you’re already inclined to buy.
In the neural network, bias gives the neuron that freedom — allowing it to fire even when inputs are low, or stay silent even when inputs are high. During training, the network learns the ideal bias value for each neuron, adjusting its “threshold” until it finds the best configuration to solve the problem.
Practice: XOR with Bias
In the script below, we add a bias vector for the hidden layer and another for the output layer. Note that they’re also initialized and updated every epoch, allowing the network to find the ideal “baseline” for each neuron.
import numpy as np
np.random.seed(42)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
# XOR Dataset
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
expected_outputs = np.array([[0], [1], [1], [0]])
# Weight and Bias Initialization
weights_input_hidden = np.random.uniform(size=(2, 3))
weights_hidden_output = np.random.uniform(size=(3, 1))
# Bias initialized as a vector for each layer
bias_hidden = np.random.uniform(size=(1, 3))
bias_output = np.random.uniform(size=(1, 1))
epochs = 500_000
learning_rate = 0.3
if __name__ == '__main__':
for epoch in range(epochs):
# --- FORWARD PROPAGATION ---
# Add bias to the dot product
hidden_layer_input = np.dot(inputs, weights_input_hidden) + bias_hidden
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_input = np.dot(hidden_layer_output, weights_hidden_output) + bias_output
predictions = sigmoid(output_layer_input)
# --- ERROR CALCULATION ---
error = expected_outputs - predictions
if epoch % 50000 == 0:
print(f"Epoch {epoch} | MAE: {np.mean(np.abs(error)):.6f}")
# --- BACKPROPAGATION ---
delta_output = error * sigmoid_derivative(predictions)
delta_hidden_layer = np.dot(delta_output, weights_hidden_output.T) * sigmoid_derivative(hidden_layer_output)
# --- WEIGHT AND BIAS UPDATE ---
# Bias update follows the same logic as weights, but its "input" is always 1
weights_hidden_output += np.dot(hidden_layer_output.T, delta_output) * learning_rate
bias_output += np.sum(delta_output, axis=0) * learning_rate
weights_input_hidden += np.dot(inputs.T, delta_hidden_layer) * learning_rate
bias_hidden += np.sum(delta_hidden_layer, axis=0) * learning_rate
print("\n--- Training with Bias Complete ---")
print(f"Final predictions:\n{predictions}")
In the next post, we’ll take this knowledge to real-world problems and explore how to better measure error, handle multiple outputs, and understand the limits and extensions of the MLP architecture.