In the previous post, we explored the Perceptron: we saw that it can learn and make simple decisions, but we also discovered it has a hard limit. Being a linear classifier, it can only solve problems where the data can be separated by a straight line.
The most iconic limitation in the history of neural networks was the logical XOR (Exclusive OR) operator. Since the points of that problem can’t be split by a single line, a lone Perceptron loops forever and never learns the rule.
To overcome this obstacle, two fronts had to evolve: how neurons are organized, and how they “fire” information forward. In this post, we’ll understand how Multilayer Perceptrons (MLPs) tackle that complexity.
Introduction to Multilayer Networks
A Multilayer Perceptron (MLP) is nothing more than a collection of Perceptrons organized into a more robust structure. While the original model had only inputs and a direct output, the MLP introduces a crucial element: hidden layers.
Structure
In a multilayer network, neurons are organized into three types of layers:
- Input Layer: where the data (the features) enters the network.
- Hidden Layers: where intermediate processing happens. A network can have one or several of these layers.
- Output Layer: where the network delivers its final answer.
The most important point here is that each neuron in a hidden layer performs the exact same calculations as a regular Perceptron: it receives inputs, multiplies them by weights, and sums everything up.
The Hidden Layer
Think of the hidden layer as a “translator”: it receives the raw data and rewrites it in a form that the output layer can understand.
If the XOR problem can’t be solved with a straight line in the original data, the hidden layer learns an intermediate representation of that data. In this new representation, the output layer can finally separate the classes.
Feedforward
The process of information entering the network, passing through the hidden layers, and reaching the output is called feedforward. It’s a one-way flow: each layer processes the values and passes them to the next.
Activation Functions
As we saw with the Perceptron, after summing inputs and weights, the neuron needs to decide what value to pass forward. That “filter” is the Activation Function.
There are countless activation functions, each better suited to a specific scenario. The three most common ones are covered below.
Step Function
Used in the classic Perceptron, the step function is binary: if the sum exceeds a threshold, it returns 1. Otherwise, it returns 0. Simple as it is, it’s too rigid for multi-layer networks, since it doesn’t allow any nuance in learning.
Sigmoid Function
Unlike the step function, the Sigmoid produces a smooth curve, returning values between 0 and 1. This means the neuron doesn’t just “fire or not” — it transmits a level of intensity.
The Sigmoid formula is: $$ y = \frac{1}{1 + e^{-x}} $$ Where $e$ is Euler’s Number, a mathematical constant approximately equal to $2.718$, used in nature and computing to describe growth processes. In practice:
- If $x$ is a high value, the Sigmoid returns something close to 1.
- If $x$ is a very low value, it returns something close to 0.
- It never returns negative values.
In Python, we can implement it using the NumPy library:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
Due to its clarity and good results in didactic examples, this is the function used throughout this post.
Hyperbolic Tangent (Tanh)
Very similar to the Sigmoid, but with an important difference: the Hyperbolic Tangent returns values between -1 and 1, which can be useful in scenarios where negative values help the network learn faster. $$ Y = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
Practice
To visualize the data flow in a multilayer network, let’s implement the feedforward process. In this example, we’ll set weights manually to observe how the network processes XOR inputs and how we quantify the output error.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Input set (XOR)
inputs = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
# Expected outputs
expected_outputs = np.array([[0], [1], [1], [0]])
# Weight Matrix: Input -> Hidden Layer (2 inputs -> 3 neurons)
weights_input_hidden = np.array([
[0.234, -0.740, -0.371],
[0.435, 0.547, -0.469]
])
# Weight Matrix: Hidden Layer -> Output (3 neurons -> 1 output)
weights_hidden_output = np.array([
[0.107],
[-0.893],
[0.130]
])
if __name__ == '__main__':
# 1. Hidden Layer Processing
# Dot product of inputs and weights, followed by sigmoid activation
hidden_layer_input = np.dot(inputs, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
# 2. Output Layer Processing
# The hidden layer result feeds into the final layer
output_layer_input = np.dot(hidden_layer_output, weights_hidden_output)
predictions = sigmoid(output_layer_input)
# 3. Error Evaluation
# Difference between expected and predicted values
error = expected_outputs - predictions
# Mean Absolute Error (MAE)
mean_absolute_error = np.mean(np.abs(error))
print(f"Network predictions:\n{predictions}")
print(f"\nMean Absolute Error: {mean_absolute_error:.4f}")
Architecture and Weight Matrices
Unlike the traditional Perceptron, the MLP uses weight matrices to connect its layers. The dimensions of these matrices define the network’s structure:
weights_input_hidden(2x3): each of the 2 inputs is connected to each of the 3 hidden neurons, and every value in the matrix represents the weight of that connection.weights_hidden_output(3x1): the 3 hidden neurons converge their outputs into a single final neuron.
Error Calculation
After running the feedforward, we compare the results against the expected values. To measure the network’s overall performance, we use the Mean Absolute Error in two steps:
np.abs(error): the absolute value ensures that errors in opposite directions (positive and negative) don’t cancel each other out. Each difference becomes a positive distance measure.np.mean(...): averaging those distances gives a single numerical indicator of the network’s accuracy across the entire dataset.
Measuring error is the fundamental prerequisite for a neural network to learn. In this example, the weights are static — the network doesn’t “learn”, it just processes data with pre-defined values.
In the next post, that error magnitude will be used to recalculate the network’s weights at each iteration — that’s what allows it to converge toward solving non-linear problems like XOR.