Gradient descent with Binary Cross-Entropy for single layer perceptron

by Henrique Andrade   Last Updated September 11, 2019 18:19 PM - source

I'm implementing a Single Layer Perceptron for binary classification in python. I'm using binary Cross-Entropy loss function and gradient descent.

The gradient descent is not converging, may be I'm doing it wrong. Here what I did:

We have a vector of weights $W = \begin{pmatrix} w_1, \dots, w_M \end{pmatrix}$, a matrix of $N$ samples $X = \begin{bmatrix} x_{11} & \dots & x_{N,1} \\ \vdots & \ddots & \vdots\\ x_{1M} & \dots & x_{NM} \end{bmatrix}$, with each column representing a sample, a sigmoid $\sigma(x) = \frac{1}{1 + e^{-x}}$ as activation function and vector of $N$ targets $Y = \begin{pmatrix} y_1, \dots, y_N \end{pmatrix}$.

On forward step we have $V = WX$ and the output is $Z = \sigma(V)$.

On backpropagation we update the weights as $W(n+1) = W(n) - \eta \frac{\partial L}{\partial W}$, where $L$ is the binary Cross-Entropy loss function: $L(Y, Z)-\frac{1}{N}\sum_{k = 1}^{N} y_k \log(z_k) + (1 - y_k) \log(1 - z_k)$. In matrix notation this function can be rewritten as $L(Y,Z) = -\frac{1}{N} \left (Y (\log(Z))^T + (1_N - Y) (\log(1_N - Z))^T \right )$.

I think that all is correct so far and may be I messed up on the derivates.

Applying chain rule: $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \frac{\partial Z}{\partial V} \frac{\partial V}{\partial W}$

$\frac{\partial V}{\partial W}$ is straightforward: $\frac{\partial V}{\partial W} = X$.

$\frac{\partial Z}{\partial V} = \sigma '(V)$

$\frac{\partial L}{\partial Z} = -\frac{1}{N} \left (Y\begin{pmatrix}\frac{1}{z_1}, \dots, \frac{1}{z_N}\end{pmatrix}^T + (1_N - Y) \left (1_N - \begin{pmatrix}\frac{1}{z_1 - 1}, \dots, \frac{1}{z_N - 1}\end{pmatrix} \right )^T \right)$

So finally, $W(n+1) = W(n) - \eta \left( -\frac{1}{N} \left (Y\begin{pmatrix}\frac{1}{z_1}, \dots, \frac{1}{z_N}\end{pmatrix}^T + (1_N - Y) \left (1_N - \begin{pmatrix}\frac{1}{z_1 - 1}, \dots, \frac{1}{z_N - 1}\end{pmatrix} \right )^T \right) \sigma '(V) X \right)$

Are those derivates correct?

Related Questions

Gradient of softmax with cross entropy loss

Updated March 04, 2017 10:19 AM

Gradient of the cross entropy loss function

Updated November 25, 2018 11:19 AM