I'm trying to add layer normalization (in the encoder-level) to the Listen-attend-and-spell model for speech recognition tasks.
To do so, I have done many experiments (all of them failed) to make my model converge faster than my baseline-model:
I have tried layer normalization between the layers of the encoder and without/with applying it to the LSTM cells.
I have tried to delete/add dropout after and before applying layer normalization between the layers of my encoder
Knowing that all these experiments failed, I'm asking you people to share your best practices of doing layer_normalization in sequence-to-sequence models.
For further details, I attached here the architecture of the listen-attend-and-spell model.