In the previous posts, we built the MLP foundation: we understood its structure and feedforward and how it learns through backpropagation. Now let’s take that knowledge to more realistic problems and open the horizon to what comes next.
Other Ways to Measure Error
Up to now, we’ve used MAE (Mean Absolute Error) to monitor the network. It’s excellent for its simplicity: it just tells us the average distance between prediction and reality. However, monitoring error and training the network don’t have to use the exact same measure. Depending on the problem, other loss functions can generate more useful gradients for learning.
MAE: Mean Absolute Error
$$ MAE = \frac{1}{n} \sum |target - prediction| $$
The average of absolute differences. Useful for interpreting the mean distance between prediction and target, without giving extra weight to large errors.
mae = np.mean(np.abs(expected - predicted))
MSE: Mean Squared Error
$$ MSE = \frac{1}{n} \sum (target - prediction)^2 $$
MSE squares the differences before averaging. By squaring, large errors become proportionally much larger than small ones, forcing the network to pay far more attention to correcting gross mistakes and pushing learning to be more precise where the failure is greatest.
mse = np.mean(np.power(expected - predicted, 2))
RMSE: Root Mean Squared Error
$$ RMSE = \sqrt{\frac{1}{n} \sum (target - prediction)^2} $$
Simply the square root of MSE. Since MSE delivers a “squared” value that can be hard to interpret, RMSE brings it back to the original data’s scale while keeping the characteristic of being much stricter with large errors than MAE.
rmse = np.sqrt(np.mean(np.power(expected - predicted, 2)))
Other Approaches
- Binary Cross-Entropy: widely used in yes/no problems (like XOR), as it measures error based on probabilities.
- Huber Loss: a middle ground between MAE and MSE. For small errors it behaves like MSE; for large errors it behaves more like MAE, reducing the impact of outliers.
The choice of loss function defines the character of the network’s learning: whether it will be forgiving of small mistakes or strict with any deviation from the target.
Real-World Applications
Note that the datasets below are fictional and simplified. They’re meant to illustrate the learning process, not to make real credit or medical decisions.
Practice: Credit Risk Analysis
In this scenario, the network analyzes three client variables: Salary, Debt, and Credit History. The goal is to didactically predict the probability of a loan approval. Since real data can have different scales, we use normalized values (between 0 and 1) to help convergence.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
# Dataset: [Salary, Debt, Credit History]
# 1.0 = High/Excellent, 0.0 = Low/Poor
inputs = np.array([
[0.8, 0.1, 1.0], # High salary, low debt, great history -> Approve (1)
[0.2, 0.8, 0.1], # Low salary, high debt, poor history -> Deny (0)
[0.5, 0.5, 0.5], # Average across the board -> Risk (0)
[0.7, 0.6, 0.9], # High salary, but high debt and good history -> Approve (1)
])
expected_outputs = np.array([[1], [0], [0], [1]])
# Architecture (3 inputs -> 4 hidden -> 1 output) with Bias
np.random.seed(42)
weights_0 = np.random.uniform(size=(3, 4))
weights_1 = np.random.uniform(size=(4, 1))
bias_0 = np.random.uniform(size=(1, 4))
bias_1 = np.random.uniform(size=(1, 1))
epochs = 100_000
learning_rate = 0.2
for epoch in range(epochs):
layer_1 = sigmoid(np.dot(inputs, weights_0) + bias_0)
layer_2 = sigmoid(np.dot(layer_1, weights_1) + bias_1)
error = expected_outputs - layer_2
d_layer_2 = error * sigmoid_derivative(layer_2)
d_layer_1 = np.dot(d_layer_2, weights_1.T) * sigmoid_derivative(layer_1)
weights_1 += np.dot(layer_1.T, d_layer_2) * learning_rate
bias_1 += np.sum(d_layer_2, axis=0) * learning_rate
weights_0 += np.dot(inputs.T, d_layer_1) * learning_rate
bias_0 += np.sum(d_layer_1, axis=0) * learning_rate
# Test: Client with High Salary (0.9), Medium Debt (0.4) and Good History (0.7)
new_client = np.array([[0.9, 0.4, 0.7]])
l1 = sigmoid(np.dot(new_client, weights_0) + bias_0)
result = sigmoid(np.dot(l1, weights_1) + bias_1)
print(f"Approval Probability: {result[0][0]:.4f}")
Practice: Medical Screening — Diabetes Risk
Here, the network processes four fictional health indicators: Glucose, BMI, Blood Pressure, and Age. The hidden layer allows the network to represent correlations between these factors, but this example should not be interpreted as real medical screening.
import numpy as np
np.random.seed(42)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
# Dataset: [Glucose, BMI, Blood Pressure, Age]
inputs = np.array([
[0.9, 0.8, 0.7, 0.6], # All indicators high -> High Risk (1)
[0.2, 0.3, 0.4, 0.2], # Low indicators -> Low Risk (0)
[0.6, 0.9, 0.5, 0.4], # Average glucose but very high BMI -> High Risk (1)
[0.4, 0.2, 0.3, 0.8], # Elderly with good indicators -> Low Risk (0)
])
expected_outputs = np.array([[1], [0], [1], [0]])
# Architecture (4 inputs -> 5 hidden -> 1 output) with Bias
weights_in = np.random.uniform(size=(4, 5))
weights_out = np.random.uniform(size=(5, 1))
b_hidden = np.random.uniform(size=(1, 5))
b_out = np.random.uniform(size=(1, 1))
learning_rate = 0.1
for epoch in range(200_000):
l1 = sigmoid(np.dot(inputs, weights_in) + b_hidden)
l2 = sigmoid(np.dot(l1, weights_out) + b_out)
err = expected_outputs - l2
d_l2 = err * sigmoid_derivative(l2)
d_l1 = np.dot(d_l2, weights_out.T) * sigmoid_derivative(l1)
weights_out += np.dot(l1.T, d_l2) * learning_rate
b_out += np.sum(d_l2, axis=0) * learning_rate
weights_in += np.dot(inputs.T, d_l1) * learning_rate
b_hidden += np.sum(d_l1, axis=0) * learning_rate
# Test: Patient with High Glucose (0.85) and other normal indicators (0.5)
patient = np.array([[0.85, 0.5, 0.5, 0.5]])
res = sigmoid(np.dot(sigmoid(np.dot(patient, weights_in) + b_hidden), weights_out) + b_out)
print(f"Diabetes Risk: {'High' if res > 0.5 else 'Low'} ({res[0][0]:.4f})")
Architecture Design
Once you’ve mastered the learning cycle, design questions come up: how do you structure the network so it works well on real problems? This section covers the most important decisions in that process.
Networks with Multiple Outputs
So far, we’ve focused on networks that deliver a single number as output. But what if we need the network to identify several attributes at once — like “has cat”, “has dog”, and “has bird” in an image? For that, we use networks with multiple output neurons.
The main change is in the matrix dimensions. If our hidden layer has 3 neurons and we need 2 outputs, the final weight matrix will have dimensions 3x2. Each output neuron has its own set of weights to receive what the hidden layer processed.
With multiple outputs, the error is no longer a single number — it becomes a vector. With 2 outputs, we get 2 errors. In backpropagation, each error generates its own Delta ($\delta$), which travels back to adjust the weights feeding that specific neuron.
When classes are mutually exclusive, like “cat or dog or bird”, it’s common to use softmax activation at the output. To keep the focus on the MLP, the example below sticks with Sigmoid.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Weights: Hidden (3 neurons) -> Output (2 neurons) — note the (3, 2) shape
weights_hidden_output = np.array([
[0.1, 0.5],
[0.8, -0.2],
[0.4, 0.1]
])
# Simulated hidden layer activation (3 neurons)
hidden_layer_output = np.array([[0.6, 0.1, 0.9]])
# The result will be a vector with 2 predictions
predictions = sigmoid(np.dot(hidden_layer_output, weights_hidden_output))
print(f"Predictions (2 output neurons): {predictions}")
# Expected output: something like [[0.74, 0.59]]
How Many Layers and Neurons?
One of the most common questions when designing a neural network is: how do you choose the ideal number of layers and neurons? There’s no exact mathematical formula, but there are technical guidelines based on experimentation (heuristics).
- One hidden layer: often enough for simpler problems. According to the Universal Approximation Theorem, a single hidden layer with enough neurons can approximate continuous functions under certain conditions.
- Multiple layers: useful when learning representations at different levels of abstraction, such as in images, audio, or text.
The key is finding the right balance: too few neurons prevent the network from learning (underfitting), while too many cause the network to memorize the training data and lose the ability to generalize (overfitting). Start simple and increase complexity gradually as you monitor the error behavior.
Cross-Validation
A neural network can sometimes show a low error simply because it got “lucky” with the train/test split. Cross-Validation solves this by dividing the data into several parts (K-folds).
The model is trained and tested multiple times, rotating which part serves as the test set. The final result is the average of these tests, ensuring the model genuinely learned the general pattern and didn’t just memorize a specific subset of the data.
AutoML (Automated Machine Learning)
Choosing the number of layers and neurons can be an exhausting trial-and-error process. AutoML aims to automate that search.
Through specialized algorithms, the system automatically tests various combinations of architectures, learning rates, and other parameters to find the most efficient model — saving hundreds of hours of manual tuning.
The Vanishing Gradient Problem
At this point, it’s worth mentioning one of the greatest historical obstacles in training deep neural networks: the vanishing gradient problem.
In backpropagation, the error travels from the output to the input through successive multiplications. Since the Sigmoid derivative has a maximum value of only 0.25, stacking many hidden layers means repeatedly multiplying these decimal values ($0.25 \cdot 0.25 \cdot 0.25 \dots$).
The result is that the gradient becomes so small that, by the time it reaches the first layers, weight adjustments are practically zero — the early layers stop learning, effectively stalling the network’s development.
This problem was one of the main reasons deep neural networks were so hard to train for years. Today, it’s mitigated by using other activation functions like ReLU (which doesn’t saturate for positive values), and more advanced weight initialization techniques.
Beyond MLPs: The Neural Network Ecosystem
Multilayer Neural Networks form the conceptual foundation for far more complex architectures. While MLPs are powerful, they treat inputs as flat vectors without directly exploiting spatial or temporal structure. For specific problems, specialized variations emerged that now underpin much of modern AI:
Convolutional Neural Networks (CNN)
Specialized in grid-structured data, like images. CNNs use filters that slide across the image to detect spatial patterns — edges, textures, objects — instead of connecting each pixel to a neuron independently. They power facial recognition, computer vision, and driver assistance systems.
Recurrent Neural Networks (RNN)
Designed to handle sequences and temporal data. RNNs have internal loops that allow information to persist, unlike MLPs. They carry a kind of “memory” of what happened in the previous step, making them useful for speech processing, text, and time series.
Transformers
The architecture that enabled modern language models (LLMs). Transformers use a mechanism called Attention, which lets the network focus on the most relevant parts of a data sequence regardless of how far apart they are.
They replaced RNNs in many text tasks by being more parallelizable and scalable. It’s the backbone of technologies like ChatGPT.
Generative Adversarial Networks (GANs)
Networks that don’t just classify, but create new data. In GANs, two networks compete: one tries to create fake data, the other tries to detect the fraud. This rivalry leads to the creation of extremely realistic content — like images generated by Midjourney or DALL-E.
Conclusion
The transition from the simple Perceptron to Multilayer Networks addressed the non-linearity problem: the secret to handling these problems isn’t just adding more neurons, but how hidden layers learn intermediate representations before the final decision.
Through the use of differentiable functions like the Sigmoid and the application of backpropagation, we turned learning into a continuous process of error minimization. Concepts we covered here — like bias for activation flexibility and Gradient Descent for weight adjustment — remain at the core of the most advanced deep learning architectures, showing that the mathematical foundation established decades ago still underpins much of modern artificial intelligence.
With this understanding in place, we’re ready to explore how these networks can be optimized to process massive volumes of information and tackle even deeper challenges.
Thanks for reading — see you in the next post!