Best practices to apply Layer normalization in recurrent networks

by Aziz   Last Updated October 06, 2018 12:19 PM

I'm trying to add layer normalization (in the encoder-level) to the Listen-attend-and-spell model for speech recognition tasks.

To do so, I have done many experiments (all of them failed) to make my model converge faster than my baseline-model:

  • I have used LayerNormBasicLstmCell to enable the layer normalization in the gates of my lstm cells.
  • I have tried layer normalization between the layers of the encoder and without/with applying it to the LSTM cells.

  • I have tried to delete/add dropout after and before applying layer normalization between the layers of my encoder

Knowing that all these experiments failed, I'm asking you people to share your best practices of doing layer_normalization in sequence-to-sequence models.

For further details, I attached here the architecture of the listen-attend-and-spell model.

Listen-attend-spell architecture

Related Questions

How is Spatial Dropout in 2D implemented?

Updated May 29, 2017 08:19 AM

Confused about Dropout implementation's in Tensorflow

Updated February 05, 2018 01:19 AM