Backpropagation: first draft

I’m assuming you understand the basic idea of neural networks. This essay focuses purely on the backpropagation algorithm itself.

What is Backpropagation?

Backpropagation is an algorithm that computes how much each weight and bias should change to reduce the loss.

It tells us not just whether parameters should go up or down, but by how much, based on their actual impact on the loss function. We use math to figure out that.

The Key Players

Before we dive in, let’s identify what we’re working with:

Input data - what we feed into the network
Parameters - weights (w) and biases (b) that we need to adjust
Neurons - the computational units
Loss - measures how wrong our predictions are
Target - what we’re trying to predict

What is a Neuron?

neuron A neuron performs two simple computations:

z = w · a + b

a_out = σ(z)

Where:

a is the input activation (from the previous layer)
w is the weight
b is the bias
z is the weighted sum (pre-activation) -> z = sum((wi) x (xi)) + b
σ is the activation function (sigmoid, ReLU, etc.)
a_out is the output activation

Think of it like this: the neuron takes inputs, weighs them, adds a bias, then applies a non-linear function. That’s it.

The Forward Pass

Training involves:

Forward pass - feed input through the network to get a prediction
Compute loss - measure how wrong the prediction is
Backward pass - figure out how to adjust weights to reduce loss
Update weights - make the adjustments
Repeat

The forward pass is straightforward. The backward pass is where backpropagation comes in.

The Goal: Minimize Loss

We want to reduce the loss. The tool we use is gradient descent.
Why gradient descent?
Gradients tell us about the movement of some function(in our case the loss function) wrt some variable (in our case, the parameters).
Gradient descent is the idea that we try to move in the opposite direction to the gradient vector to minimize the cost function.
Why opposite direction? Because gradient vector points to the direction of steepest increase.

Gradient Descent Intuition

Imagine you’re blindfolded on a hilly terrain trying to reach the lowest valley.

What would you do?

Feel the slope under your feet
Determine which direction goes downhill most steeply
Take a small step in that direction
Repeat

In mathematical terms:

The terrain = Loss function (how wrong your model is)
Your position = Current parameter values
Feeling the slope = Computing the gradient ∂L/∂w
Taking a step = Updating: w ← w - η · ∂L/∂w

The gradient ∂L/∂w tells us which direction is “downhill” for the loss.

The Problem: We Can’t Compute ∂L/∂w Directly

Here’s the issue: the loss doesn’t directly depend on w. What do I mean? Suppose there are 2 million layers.

the 101th layer’s third neuron. Think about that neuron. We want to figure out the Loss’s derivative wrt to THAT neuron. They are just too far away from each other.

The dependency chain looks like this:

w → z → a → (more layers) → prediction → loss

The loss is computed way at the end, but w is buried deep in the network. They’re connected through many ==intermediate computations==.

This is why we need the chain rule.

The Chain Rule Solution

To compute ∂L/∂w, we break it into pieces:

∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w

Let’s understand each term:

∂z/∂w - “How does z change when w changes?” Looking at z = w·a + b, we get: ∂z/∂w = a This is local and easy to compute.

∂a/∂z - “How does activation change when z changes?”
This depends on the activation function:

Sigmoid: σ’(z) = σ(z)(1 - σ(z))
ReLU: σ’(z) = 1 if z > 0, else 0

This is also local and easy.

∂L/∂a - “How does loss change when this activation changes?” This is the tricky one. And this is where the interaction between layers happen! For hidden layers, we don’t know this directly. We have to get it from the layer ahead.

The Backpropagation Algorithm

Backpropagation works backwards through the network:

Step 1: Output Layer (Easy Case)

At the output layer, ∂L/∂a can be computed directly from the loss function.

For example, if Loss = (prediction - target)², then:

∂L/∂a^L = 2(a^L - target)

Step 2: Compute Local Gradients

For each neuron at this layer:

Compute ∂a/∂z = σ’(z)
Compute ∂z/∂w = a (the input to this neuron)

Step 3: Multiply Using Chain Rule

∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w

Now we have the gradient for this weight!

Step 4: Pass Gradients Backward

For the previous layer to compute its gradients, it needs ∂L/∂a.

We compute it using:

∂L/∂a^prev = w · ∂L/∂z

Where ∂L/∂z = ∂L/∂a · ∂a/∂z (combining the first two terms).

Step 5: Repeat

Move to the previous layer and repeat steps 2-4, using the ∂L/∂a we just computed.

Continue until you’ve computed gradients for all weights in all layers.

The Complete Picture

Forward pass: Input → Layer 1 → Layer 2 → … → Output → Loss

Backward pass: Loss → ∂L/∂w_output → ∂L/∂w_layer2 → … → ∂L/∂w_input

Each layer:

Receives ∂L/∂a from the next layer
Computes its own ∂L/∂w using the chain rule
Passes ∂L/∂a_prev to the previous layer

Update the Weights

Once we have ∂L/∂w for every weight:

w ← w - η · ∂L/∂w

b ← b - η · ∂L/∂b

This nudges each parameter in the direction that reduces loss.

Implementation Note: PyTorch

While trying to implement a Tensor class myself, I got to know that each tensor holds the dL/d(that tensor w) in the .grad attribute.

x = torch.tensor(2.0, requires_grad=True)
y = x * 3
y.backward()
print(x.grad)

one shot

Backpropagation is elegant:

Do a forward pass and compute loss
Start at the output where ∂L/∂a is known
Use the chain rule to compute ∂L/∂w locally
Pass ∂L/∂a backwards to the previous layer
Repeat until all gradients are computed
Update all weights using gradient descent

The “back” in backpropagation refers to this backward flow of gradients through the network, from output to input. This algorithm is also ‘greedy’.

Neural Net

We now know that a neural network is simply a set of parameters optimized to minimize a loss function. A good exercise to internalize this idea is the following.

You are given the XOR problem:

// XOR problem.
// X -> inputs, y -> true values
float X[4][2] = {{0,0}, {0,1}, {1,0}, {1,1}};
float y[4][1] = {{0}, {1}, {1}, {0}};

// Random weights and biases.
float w1[2][2];
float b1[2][1];
float w2[2][1];
float b2[1][1];

Writing backpropagation for this problem is what helped me develop an intuitive understanding of how backpropagation works. The task is simple: after training, the loss function (which can be any reasonable choice) should be minimized, meaning the forward pass produces accurate outputs for the XOR problem.

For the sake of simplicity, here is a simple solution in python

import numpy as np

def loss(yt, y):
    return np.mean((yt - y) ** 2)

X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])  # XOR

W1 = np.random.randn(2,2)
b1 = np.zeros((1,2))
W2 = np.random.randn(2,1)
b2 = np.zeros((1,1))
lr = 0.1

print("Initial weights and biases:")
print(f"W1:\n{W1}\nb1:\n{b1}\nW2:\n{W2}\nb2:\n{b2}")
print("*" * 60)

for epoch in range(5000):
    # Feedforward
    z1 = X @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    y_hat = sigmoid(z2)
    
    # Calculate loss
    epoch_loss = loss(y_hat, y)
    
    # Backprop
    dz2 = y_hat - y
    dW2 = a1.T @ dz2
    db2 = np.mean(dz2, axis=0, keepdims=True)
    
    dz1 = (dz2 @ W2.T) * sigmoid_prime(z1)
    dW1 = X.T @ dz1
    db1 = np.mean(dz1, axis=0, keepdims=True)
    
    # Update weights
    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2
    
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}: Loss = {epoch_loss:.6f}")

# Activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z) * (1 - sigmoid(z))
print("\n" + "=" * 60)
print("Final predictions:")
print("y_hat", y_hat)
for i in range(len(X)):
    print(f"Input: {X[i]} -> Predicted: {y_hat[i][0]:.4f}, Actual: {y[i][0]}")

Thanks for reading

~ Aayushya Tiwari

REFERENCES

Watch the Andrej Karpathy Micrograd Video for super intuition. Or 3b1b’s video. They are the best.
NN in NUMPY book pdf: my first reference to backprop Original paper

What is Backpropagation?#

The Key Players#

What is a Neuron?#

The Forward Pass#

The Goal: Minimize Loss#

Gradient Descent Intuition#

The Problem: We Can’t Compute ∂L/∂w Directly#

The Chain Rule Solution#

The Backpropagation Algorithm#

Step 1: Output Layer (Easy Case)#

Step 2: Compute Local Gradients#

Step 3: Multiply Using Chain Rule#

Step 4: Pass Gradients Backward#

Step 5: Repeat#

The Complete Picture#

Update the Weights#

Implementation Note: PyTorch#

one shot#

Neural Net#

REFERENCES#