Word2Vec: Reliance on only one out of two weight matrices

by Adam   Last Updated September 10, 2019 17:19 PM - source

Word2Vec is commonly used to identify words similar to a given input term. My understanding is that Word2Vec is particularly efficient in this task because it uses the word embeddings learned during training, and contained in the hidden layer, which are of a lower dimensionality than the outputs of the network.

I have read that the use of these word embeddings, rather than the output, will not lead to a loss in accuracy because the neural network aims to assign similar word vectors to (semantically) similar words (i.e., words appearing in similar context).

However, the Word2Vec (Skip-gram) architecture contains two weight matrices, W and W' (see, e.g., figure 1 here). To obtain the output, the word vector of the input word (from W) is multiplied with the weights of all other words in the vocabulary (from W').

My questions is: Since the output layer depends just as much on information from W as from W', how can we say that no accuracy is lost when limiting our evaluation of word similarities to W?

I also have trouble understanding the differences in information contained in W and W'. While some ideas are provided here, any further references or suggestions could be provided in this regard would be very helpful.

Related Questions

Confusion on how skip gram implementation is formulated

Updated September 02, 2019 04:19 AM