A Detailed Derivation of Backpropagation

September 7, 2022 (3y ago)

Outline

There are several approaches to calculating backpropagation (BP) for multi-layer perceptrons (MLPs). Our goal is to compute the gradient of the loss with respect to the weight matrix for each layer, which represents a scalar-to-matrix derivative. We can use the following methods:

  • Calculate LW(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} directly using matrix-to-matrix gradients. (We won't use this approach because matrix-to-matrix gradients are difficult to compute)
  • Calculate LW(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} directly while avoiding vector-to-matrix gradients. (We won't use this approach either, as it's still quite challenging)
  • Calculate LWi,j(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}} for each element and then assemble them into a matrix. (This is the approach we'll adopt)

For our chosen method, while computing LWi,j(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}} is sufficient to determine all gradients and you could write a for-loop to complete the update, this approach isn't efficient. Modern accelerators like GPUs can significantly speed up matrix multiplication operations. Therefore, we still need to assemble the scalar results into matrix form.

We'll first examine the gradient calculation for a single example. Since the SGD algorithm requires a batch of examples, we'll then extend our results to handle batches of data.

Preliminaries

To make our calculations more straightforward, we'll first introduce three key concepts you should understand: denominator layout, the multivariable chain rule, and matrix assembly.

Denominator Layout: In the deep learning community, derivatives are computed using the denominator layout by default. This means that a scalar-to-matrix gradient will result in a matrix with the same shape as the original matrix. If the result is the transpose of the original matrix, then we're using the denominator layout. You can learn more about denominator layout from this Wikipedia article.

Multivariable Chain Rule: We need to understand the multivariable chain rule. If x=x(t)x=x(t), y=y(t)y=y(t), and z=f(x,y)z=f(x,y), then we have zt=fxxt+fyyt\frac{\partial z}{\partial t}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial t}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial t}. Here's an article on the multivariable chain rule. Since we calculate gradients in a scalar manner, a function that accepts a vector or matrix becomes a multivariable function. For example, f(x)f(\mathbf{x}) becomes f(x0,x1,,xn)f(\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_n). Therefore, in the following derivation, we'll always use the multivariable chain rule.

Matrix Assembly: Finally, we need to understand how to assemble a vector or matrix. For a matrix W\mathbf{W}, we can write it as W=[Wi,j]i,j\mathbf{W}=[\mathbf{W}_{i,j}]_{i,j}, where the indices ii and jj indicate that we need to iterate through each row and column. Similarly, for a vector, we have v=[vi]i\mathbf{v}=[\mathbf{v}_i]_i. Assembly can be thought of as the reverse process of matrix multiplication. For example, the multiplication of two column vectors xy=[xiyj]i,j\mathbf{xy}^\top=[\mathbf{x}_i\mathbf{y}_j]_{i,j}. Thus, if we obtain a scalar result where Wi,j=xiyj\mathbf{W}_{i,j}=\mathbf{x}_i\mathbf{y}_j, we can assemble it into the matrix W=xyT\mathbf{W}=\mathbf{x}\mathbf{y}^T.

Definitions

Notation:

  • Scalars: x,y,x, y, \cdots
  • Vectors: x,y,\mathbf{x}, \mathbf{y}, \cdots
  • Matrices: X,Y,\mathbf{X}, \mathbf{Y}, \cdots
  • Subscript notation: xi\mathbf{x}_i represents the ii-th element of vector x\mathbf{x}, which is a scalar.
  • Indicator function: 1ij\mathbf{1}_{ij} equals 1 if i=ji=j and 0 otherwise.

Network Architecture:

  • Input: x\mathbf{x} with shape [n,1][n, 1]
  • Label: y\mathbf{y} with shape [c,1][c, 1] (one-hot encoded)
  • Number of layers: LL
  • Number of classes: cc
  • Linear transformation: Wx+b\mathbf{Wx+b}
  • Weight matrix at layer ll: W(l)\mathbf{W}^{(l)}

Weight Matrix Shapes:

  • First layer: W(1)\mathbf{W}^{(1)} with shape [m(1),m(0)=n][m^{(1)}, m^{(0)}=n]
  • Hidden layers (from 2 to L1L-1): W(l)\mathbf{W}^{(l)} with shape [m(l),m(l1)][m^{(l)}, m^{(l-1)}]
  • Last layer: W(L)\mathbf{W}^{(L)} with shape [c,m(L1)][c, m^{(L-1)}]

Activation Function and Activations:

  • Activation function: ff
  • Activation at layer ll: a(l)\mathbf{a}^{(l)}

Activation Shapes:

  • Input: a(0)\mathbf{a}^{(0)} with shape [n,1][n, 1]
  • Hidden layer activations (from 1 to L1L-1): a(l)\mathbf{a}^{(l)} with shape [m(l),1][m^{(l)}, 1]
  • Output logits (last layer): a(L)\mathbf{a}^{(L)} with shape [c,1][c, 1]

Forward Pass

a(0)=xz(1)=W(1)a(0)+b(1)a(1)=f(z(1))z(l)=W(l)a(l1)+b(l)a(l)=f(z(l1))z(L)=W(L)a(L1)+b(L)a(L)=softmax(z(L1))L=CE(a(L),y)\begin{align*} \mathbf{a}^{(0)} &= \mathbf{x} \\ \mathbf{z}^{(1)} &= \mathbf{W}^{(1)}\mathbf{a}^{(0)} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(1)} &= f(\mathbf{z}^{(1)}) \\ \vdots & \quad\vdots\\ \mathbf{z}^{(l)} &= \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \\ \mathbf{a}^{(l)} &= f(\mathbf{z}^{(l-1)}) \\ \vdots & \quad\vdots\\ \mathbf{z}^{(L)} &= \mathbf{W}^{(L)}\mathbf{a}^{(L-1)} + \mathbf{b}^{(L)} \\ \mathbf{a}^{(L)} &= \text{softmax}(\mathbf{z}^{(L-1)}) \\ \mathcal{L} &=\text{CE}(\mathbf{a}^{(L)},y) \\ \end{align*}

where the cross-entropy loss is L=CE(a(L),y)=i=1cyilog(ai(L))\mathcal{L}=\text{CE}(\mathbf{a}^{(L)},y)=-\sum_{i=1}^c \mathbf{y}_i\log(\mathbf{a}_i^{(L)}).

Our goal is to calculate the gradient of L\mathcal{L} with respect to W(l)\mathbf{W}^{(l)} and b(l)\mathbf{b}^{(l)}. In this derivation, we'll focus on computing the gradient with respect to W(l)\mathbf{W}^{(l)}. The gradient with respect to b(l)\mathbf{b}^{(l)} follows a similar pattern.

The Last Layer

Since the last layer differs from the other layers (it uses softmax instead of a regular activation function), we'll calculate its gradient separately.

In the forward pass, ai(L)=ezik=1cezk\mathbf{a}^{(L)}_i=\frac{e^{\mathbf{z}_i}}{\sum_{k=1}^c e^{\mathbf{z}_k}} represents the normalized probability output by the softmax layer. A straightforward calculation of the softmax gradient yields ai(L)zj(L)=ai(L)(1ijaj(L))\frac{\partial \mathbf{a}^{(L)}_i}{\partial \mathbf{z}^{(L)}_j}=\mathbf{a}^{(L)}_i(\mathbf{1}_{ij}-\mathbf{a}^{(L)}_j).

To calculate the gradients Lzj(L)\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}, we use the chain rule by introducing the activations a\mathbf{a}. An important point here is that when we work in scalar form, most functions we encounter are multivariable functions, so we need to use the multivariable chain rule. For example, L\mathcal{L} is the result of the cross-entropy function applied to a0(L),a1(L),,ac1(L)\mathbf{a}_0^{(L)}, \mathbf{a}_1^{(L)}, \dots, \mathbf{a}_{c-1}^{(L)}. According to the multivariable chain rule, we have Lzj(L)=i=1cLai(L)ai(L)zj(L)\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}=\sum_{i=1}^c\frac{\partial \mathcal{L}}{\partial \mathbf{a}_i^{(L)}}\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}}.

Thus, we have

Lzj(L)=i=1cLai(L)ai(L)zj(L)=i=1c(yiai(L))ai(L)zj(L)=i=1cyi(1ijaj(L))=i=1cyi1ij+aj(L)i=1cyi=aj(L)yj\begin{align*} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}} &= \sum_{i=1}^c\frac{\partial \mathcal{L}}{\partial \mathbf{a}_i^{(L)}}\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}} \\ & =\sum_{i=1}^c(\frac{\mathbf{y}_i}{\mathbf{a}_i^{(L)}})\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}} \\ &=-\sum_{i=1}^c\mathbf{y}_i(\mathbf{1}_{ij}-\mathbf{a}^{(L)}_j) \\ &=-\sum_{i=1}^c\mathbf{y}_i\mathbf{1}_{ij}+\mathbf{a}_j^{(L)}\sum_{i=1}^c\mathbf{y}_i \\ &=\mathbf{a}^{(L)}_j-\mathbf{y}_j \end{align*}

Now we assemble Lzj(L)\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}} into a vector: Lz(L)=a(L)y\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}=\mathbf{a}^{(L)}-\mathbf{y}. The detailed assembly process is Lz(L)=[Lzj(L)]j=1c=[aj(L)yj]j=1c=[aj(L)]j=1c[yj]j=1c=a(L)y\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}=[\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}]_{j=1}^{c}=[\mathbf{a}^{(L)}_j-\mathbf{y}_j]_{j=1}^{c}=[\mathbf{a}^{(L)}_j]_{j=1}^{c}-[\mathbf{y}_j]_{j=1}^{c}=\mathbf{a}^{(L)}-\mathbf{y}.

Gradients for Weights

We can define the error term as: δ(l)=Lz(l)\mathbf{\delta}^{(l)}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} which is a column vector that simplifies our chain rule calculations. For the last layer, δ(L)=a(L)y\mathbf{\delta}^{(L)}=\mathbf{a}^{(L)}-\mathbf{y}.

If we can calculate δ(l)\mathbf{\delta}^{(l)} for all layers ll, then we can calculate LW(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} for all layers. This calculation uses the multivariable chain rule. Since we want to use the chain rule to connect L\mathcal{L} and W\mathbf{W} through z\mathbf{z}, we need to use the multivariable chain rule: LWi,j(l)=kLzk(l)zk(l)Wi,j(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}=\sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{W}^{(l)}_{i,j}}.

Let's recall the matrix multiplication in z(l)=W(l)a(l1)+b(l)\mathbf{z}^{(l)}= \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}. We have zi(l)=j=1nWi,j(l)aj(l1)+bi(l)\mathbf{z}^{(l)}_i=\sum_{j=1}^n \mathbf{W}^{(l)}_{i,j}\mathbf{a}^{(l-1)}_j+\mathbf{b}^{(l)}_i. From this, we can see that zi(l)Wi,j(l)=aj(l1)\frac{\partial \mathbf{z}^{(l)}_i}{\partial \mathbf{W}^{(l)}_{i,j}}=\mathbf{a}^{(l-1)}_j, and zk(l)Wi,j(l)=0\frac{\partial \mathbf{z}^{(l)}_k}{\partial \mathbf{W}^{(l)}_{i,j}}=0 for iki\neq k (that is, Wi,j(l)\mathbf{W}_{i,j}^{(l)} only affects the calculation of zi(l)\mathbf{z}_i^{(l)}). Therefore, we have:

LWi,j(l)=Lzi(l)zi(l)Wi,j(l)=δi(l)aj(l1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}_i^{(l)}}\frac{\partial \mathbf{z}_i^{(l)}}{\partial \mathbf{W}^{(l)}_{i,j}}=\mathbf{\delta}_i^{(l)}\mathbf{a}_j^{(l-1)}

Now let's assemble the result into matrix form. We have LW(l)=[LWi,j(l)]i,j=[δi(l)aj(l1)]i,j=δ(l)a(l1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}=[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}]_{i,j}=[\mathbf{\delta}_i^{(l)}\mathbf{a}_j^{(l-1)}]_{i,j}=\mathbf{\delta}^{(l)}\mathbf{a}^{(l-1)\top}.

For the last layer specifically, we have LWi,j(L)=(ai(L)yi)aj(L1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}_{i,j}}=(\mathbf{a}^{(L)}_i-\mathbf{y}_i)\mathbf{a}_j^{(L-1)}, and LW(L)=(a(L)y)a(L1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}}=(\mathbf{a}^{(L)}-\mathbf{y})\mathbf{a}^{(L-1)\top}.

Backpropagation Through Layers

Now the only remaining task is to calculate δ(l)\mathbf{\delta}^{(l)} for all layers ll. We can use the chain rule to compute δ(l)\mathbf{\delta}^{(l)} for all layers:

δj(l1)=Lzj(l1)=kLzk(l)zk(l)zj(l1)=kLzk(l)zk(l)ak(l1)ak(l1)zj(l1)\begin{align*} \mathbf{\delta}_j^{(l-1)} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(l-1)}} \\ &= \sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{z}_j^{(l-1)}} \\ &= \sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{a}_k^{(l-1)}}\frac{\partial \mathbf{a}_k^{(l-1)}}{\partial \mathbf{z}_j^{(l-1)}} \\ \end{align*}

The second step follows because zj(l1)\mathbf{z}_j^{(l-1)} contributes to every zk(l)\mathbf{z}_k^{(l)} through the linear transformation. The third step is because zj(l1)\mathbf{z}_j^{(l-1)} only affects aj(l1)\mathbf{a}_j^{(l-1)} through the nonlinear transformation.

Since aj(l1)=f(zj(l1))\mathbf{a}_j^{(l-1)}=f(\mathbf{z}_j^{(l-1)}), we have aj(l1)zj(l1)=f(zj(l1))\frac{\partial \mathbf{a}_j^{(l-1)}}{\partial \mathbf{z}_j^{(l-1)}}=f'(\mathbf{z}_j^{(l-1)}). Since z(l)=W(l)a(l1)+b(l)\mathbf{z}^{(l)}=\mathbf{W}^{(l)}\mathbf{a}^{(l-1)}+\mathbf{b}^{(l)}, from the matrix multiplication, we have zk(l)aj(l1)=Wk,j(l)\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{a}_j^{(l-1)}}=\mathbf{W}^{(l)}_{k,j}. Therefore, we have:

δj(l1)=kδk(l)Wk,j(l)f(zj(l1))\mathbf{\delta}_j^{(l-1)}=\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}f'(\mathbf{z}_j^{(l-1)})

Now let's assemble the result into vector form. We have:

δ(l1)=[δj(l1)]j=1m=[kδk(l)Wk,j(l)f(zj(l1))]j=1m=[kδk(l)Wk,j(l)]j=1m[f(zj(l1))]j=1m=[W:,j(l)δ(l)]j=1mf(z(l1))=W(l)δ(l)f(z(l1))\begin{align*} \mathbf{\delta}^{(l-1)} &=[\mathbf{\delta}_j^{(l-1)}]_{j=1}^m=[\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}f'(\mathbf{z}_j^{(l-1)})]_{j=1}^m \\ &=[\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}]_{j=1}^m\odot[f'(\mathbf{z}_j^{(l-1)})]_{j=1}^m \\ &=[\mathbf{W}^{(l)\top}_{:,j} \mathbf{\delta}^{(l)}]_{j=1}^m\odot f'(\mathbf{z}^{(l-1)}) \\ &=\mathbf{W}^{(l)\top}\mathbf{\delta}^{(l)}\odot f'(\mathbf{z}^{(l-1)}) \end{align*}

Batch Processing

We can extend the above calculations to handle batches of data. In our previous discussion, each example was represented as a column vector, but in deep learning programming, we typically represent examples as rows in a matrix. We have X=[x1,x2,,xb]\mathbf{X}=[\mathbf{x}_1^\top,\mathbf{x}_2^\top,\cdots,\mathbf{x}_b^\top] with shape [b,n][b, n], where xi\mathbf{x}_i is a column vector. Similarly, we have Y=[y1,y2,,yb]\mathbf{Y}=[\mathbf{y}_1,\mathbf{y}_2,\cdots,\mathbf{y}_b] with shape [b,c][b, c], where yi\mathbf{y}_i is a column vector.

The total loss is L=1bi=1bL(xi,yi)\mathcal{L}=\frac{1}{b}\sum_{i=1}^b\mathcal{L}(\mathbf{x}_i,\mathbf{y}_i). The vectors a(l),z(l),δ(l)\mathbf{a}^{(l)},\mathbf{z}^{(l)},\mathbf{\delta}^{(l)} become matrices A(l),Z(l),Δ(l)\mathbf{A}^{(l)},\mathbf{Z}^{(l)},\mathbf{\Delta}^{(l)} respectively, all with shape [b,m(l)][b, m^{(l)}]. For the linear transformation, we have:

Z(l)=A(l1)W(l)+B(l)\mathbf{Z}^{(l)}=\mathbf{A}^{(l-1)}\mathbf{W}^{(l)\top}+\mathbf{B}^{(l)}

where B(l)\mathbf{B}^{(l)} is formed by stacking the bias vector b(l)\mathbf{b}^{(l)} across all examples. For the nonlinear transformation, we have A(l)=f(Z(l))\mathbf{A}^{(l)}=f(\mathbf{Z}^{(l)}). (Alternatively, you could define the weight matrix W\mathbf{W} as the transpose of our original definition to avoid the transpose in the linear transformation).

Note that since our previous discussion for single vectors was done element-wise, the derivation for the matrix case with batches of data follows a similar pattern.

Loss and Softmax

For the loss and softmax, LZi,j(L)=1b(Ai,j(L)Yi,j)\frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(L)}_{i,j}}=\frac{1}{b}(\mathbf{A}^{(L)}_{i,j}-\mathbf{Y}_{i,j}). The assembly process is straightforward and leads to LZ(L)=1b(A(L)Y)\frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(L)}}=\frac{1}{b}(\mathbf{A}^{(L)}-\mathbf{Y}).

Gradients for Weights

For the gradients of weights, the update is equivalent to Z(l)=W(l)A(l1)+B(l)\mathbf{Z}^{(l)\top}=\mathbf{W}^{(l)}\mathbf{A}^{(l-1)\top}+\mathbf{B}^{(l)\top}, which is more similar to the vector form. Since the weight is involved in the calculation for each example, using the multivariable chain rule, we have:

LWi,j(l)=k=1bLZi,k(l)Zi,k(l)Wi,j(l)=k=1bΔi,k(l)Ak,j(l1)\frac{\partial\mathcal{L}}{\partial \mathbf{W}_{i,j}^{(l)}}=\sum_{k=1}^{b}\frac{\partial \mathcal{L}}{\partial \mathbf{Z}_{i,k}^{(l)}}\frac{\partial \mathbf{Z}_{i,k}^{(l)}}{\partial \mathbf{W}_{i,j}^{(l)}}=\sum_{k=1}^b\mathbf{\Delta}_{i,k}^{(l)}\mathbf{A}_{k,j}^{(l-1)}

After assembly, we have:

LW(l)=Δ(l)A(l1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}=\mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)}

Let's perform a quick shape check to verify our result. LW(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} should have the same shape as W(l)\mathbf{W}^{(l)}, which is [m(l),m(l1)][m^{(l)}, m^{(l-1)}]. Δ(l)\mathbf{\Delta}^{(l)\top} has shape [m(l),b][m^{(l)}, b], and A(l1)\mathbf{A}^{(l-1)} has shape [b,m(l1)][b, m^{(l-1)}]. Thus, Δ(l)A(l1)\mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)} has shape [m(l),m(l1)][m^{(l)}, m^{(l-1)}], which matches W(l)\mathbf{W}^{(l)}.

Backpropagation Through Layers

Finally, for the update from Δ(l)\mathbf{\Delta}^{(l)} to Δ(l1)\mathbf{\Delta}^{(l-1)}, since each example is independent from the others, it's easy to see that Δ(l1)=W(l)Δ(l)f(Z(l1))\mathbf{\Delta}^{(l-1)\top}=\mathbf{W}^{(l)\top}\mathbf{\Delta}^{(l)\top}\odot f'(\mathbf{Z}^{(l-1)\top}), which means Δ(l1)=Δ(l)W(l)f(Z(l1))\mathbf{\Delta}^{(l-1)}=\mathbf{\Delta}^{(l)}\mathbf{W}^{(l)}\odot f'(\mathbf{Z}^{(l-1)}). This is a direct extension from the vector form.

Implementation

Now we have the complete derivation of the backpropagation algorithm:

Δ(L)=1b(A(L)Y)Δ(l)=Δ(l+1)W(l+1)f(Z(l))LW(l)=Δ(l)A(l1)\begin{align*} \mathbf{\Delta}^{(L)} & = \frac{1}{b}(\mathbf{A}^{(L)}-\mathbf{Y}) \\ \mathbf{\Delta}^{(l)} & = \mathbf{\Delta}^{(l+1)}\mathbf{W}^{(l+1)}\odot f'(\mathbf{Z}^{(l)}) \\ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} & = \mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)} \\ \end{align*}

Now we can implement the backward pass easily. The pseudocode for a multilayer perceptron without bias is as follows. Note that the W\mathbf{W} matrix strictly follows the definition above and is consistent with PyTorch's nn.Linear.

# PyTorch-style API implementation
# W is a list of weight matrices
 
def forward_pass(X, W):
    L = len(W)
    A, Z = [], []
    for l in range(L-1):
        Z[l] = A[l-1] @ W[l].T
        A[l] = f(Z[l])
    Z[L-1] = A[L-2] @ W[L-1]
    A[L-1] = softmax(Z[L-1])
    L = -np.sum(Y * np.log(A[L-1]))
    return L, A, Z
 
def backward_pass(Y, A, Z, W):
    L = len(W)
    Delta, W_g = [], []
    Delta[L-1] = (A[L-1] - Y) / Y.shape[0]
    for l in range(L-2, 0, -1):
        Delta[l] = Delta[l+1] @ W[l+1] * d_f(Z[l])
        W_g[l] = Delta[l+1].T @ A[l-1]
    return W_g
 
L, A, Z = forward_pass(X, W)
W_g = backward_pass(Y, A, Z, W)  # Equivalent to PyTorch's L.backward()