In the paper Fixup Initialization: Residual Learning Without Normalization. In Page 5 when talking about effects of multipliers, the authors mentioned that

Speciﬁcally, as the stochastic gradient of a layer is typically almost orthogonal to its weight, learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay

I have two questions for this statement:

- Why is the stochastic gradient of a layer almost orthogonal to its weight?
- I can understand that L2 weight decay cause the weight norm to shrink. But why learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay?

- Serverfault Help
- Superuser Help
- Ubuntu Help
- Webapps Help
- Webmasters Help
- Programmers Help
- Dba Help
- Drupal Help
- Wordpress Help
- Magento Help
- Joomla Help
- Android Help
- Apple Help
- Game Help
- Gaming Help
- Blender Help
- Ux Help
- Cooking Help
- Photo Help
- Stats Help
- Math Help
- Diy Help
- Gis Help
- Tex Help
- Meta Help
- Electronics Help
- Stackoverflow Help
- Bitcoin Help
- Ethereum Help