# Gradient descent with Binary Cross-Entropy for single layer perceptron

by Henrique Andrade   Last Updated September 11, 2019 18:19 PM - source

I'm implementing a Single Layer Perceptron for binary classification in python. I'm using binary Cross-Entropy loss function and gradient descent.

The gradient descent is not converging, may be I'm doing it wrong. Here what I did:

We have a vector of weights $$W = \begin{pmatrix} w_1, \dots, w_M \end{pmatrix}$$, a matrix of $$N$$ samples $$X = \begin{bmatrix} x_{11} & \dots & x_{N,1} \\ \vdots & \ddots & \vdots\\ x_{1M} & \dots & x_{NM} \end{bmatrix}$$, with each column representing a sample, a sigmoid $$\sigma(x) = \frac{1}{1 + e^{-x}}$$ as activation function and vector of $$N$$ targets $$Y = \begin{pmatrix} y_1, \dots, y_N \end{pmatrix}$$.

On forward step we have $$V = WX$$ and the output is $$Z = \sigma(V)$$.

On backpropagation we update the weights as $$W(n+1) = W(n) - \eta \frac{\partial L}{\partial W}$$, where $$L$$ is the binary Cross-Entropy loss function: $$L(Y, Z)-\frac{1}{N}\sum_{k = 1}^{N} y_k \log(z_k) + (1 - y_k) \log(1 - z_k)$$. In matrix notation this function can be rewritten as $$L(Y,Z) = -\frac{1}{N} \left (Y (\log(Z))^T + (1_N - Y) (\log(1_N - Z))^T \right )$$.

I think that all is correct so far and may be I messed up on the derivates.

Applying chain rule: $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \frac{\partial Z}{\partial V} \frac{\partial V}{\partial W}$$

$$\frac{\partial V}{\partial W}$$ is straightforward: $$\frac{\partial V}{\partial W} = X$$.

$$\frac{\partial Z}{\partial V} = \sigma '(V)$$

$$\frac{\partial L}{\partial Z} = -\frac{1}{N} \left (Y\begin{pmatrix}\frac{1}{z_1}, \dots, \frac{1}{z_N}\end{pmatrix}^T + (1_N - Y) \left (1_N - \begin{pmatrix}\frac{1}{z_1 - 1}, \dots, \frac{1}{z_N - 1}\end{pmatrix} \right )^T \right)$$

So finally, $$W(n+1) = W(n) - \eta \left( -\frac{1}{N} \left (Y\begin{pmatrix}\frac{1}{z_1}, \dots, \frac{1}{z_N}\end{pmatrix}^T + (1_N - Y) \left (1_N - \begin{pmatrix}\frac{1}{z_1 - 1}, \dots, \frac{1}{z_N - 1}\end{pmatrix} \right )^T \right) \sigma '(V) X \right)$$

Are those derivates correct?

Tags :