Why is the stochastic gradient of a layer almost orthogonal to its weight?

by Maybe   Last Updated October 20, 2019 03:19 AM - source

In the paper Fixup Initialization: Residual Learning Without Normalization. In Page 5 when talking about effects of multipliers, the authors mentioned that

Speci´Čücally, as the stochastic gradient of a layer is typically almost orthogonal to its weight, learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay

I have two questions for this statement:

  1. Why is the stochastic gradient of a layer almost orthogonal to its weight?
  2. I can understand that L2 weight decay cause the weight norm to shrink. But why learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay?


Related Questions