Why do we need the temperature in Gumbel-Softmax trick?

by user3639557   Last Updated September 14, 2018 15:19 PM

Assuming a discrete variable $z_j$ with unnormalized probability $\alpha_j$, one way to sample is to apply argmax(softmax($\alpha_j$)), another is to do the Gumbel trick argmax($\log\alpha_j+g_j$) where $g_j$ is gumbel generated noise. This second approach is useful if we want to do something like variational auto encoding. For example, if the goal was to have the full distribution over possible outcomes for $z_j$, we can use softmax transformation on top of the perturbation with Gumbel noise: $$\pi_j = \frac{e^{\log \alpha_j+g_j}}{\sum_{k=1}^{k=K}e^{\log \alpha_k+g_k}}\ \ \ \text{where}\ \ g_k=-\log(-\log(\epsilon\sim {U}(0,1))).$$ Why this isn't enough? Why do we need to include the temperature $\tau$ term in this? And rewrite, $$\pi_j = \frac{e^{\frac{\log\alpha_j+g_j}{\tau}}}{\sum_{k=1}^{k=K}e^\frac{\log \alpha_k+g_k}{\tau}}\ \ \ \text{where}\ \ g_k=-\log(-\log(\epsilon\sim {U}(0,1)))$$ I understand that the temperature makes the vector $\pi=[\pi_1, ...,\pi_k]$ smoother or rougher (i.e., high temperature just makes all $\pi_i$s to be the same, and generates a flatter distribution) but why do we need it in practice? All we want (i.e., in VAE) is to decouple the stochastic aspect of the sampling (i.e, move the stochastic part of it to the input) which is achieved by the Gumbel trick, and then somehow replace the one-hot vector draw with a continuous vector, which we get by doing the softmax($\log\alpha_j+g_j$) which we will get by using the first equation. I am sure I am missing something fundamental, but can't see what it is...

Related Questions

Parameters of sum of gumbel distributions

Updated March 21, 2017 11:19 AM

How to combine Gumbel distributions

Updated March 17, 2017 12:19 PM