One-hot encoding is inefficient when vetorizing a corpus of words because a one-hot encoding vector is sparse; so if we were to create a one-hot encoding vector with many sentences we would be creating a zero vector that was as long as the vocabulary with a one at the index that corresponds to each word. This would mean that a majority of the elements in the vector are zero and would end up taking much more time than necessary to vectorize a group of words. A more efficient way to vectorize words is with word embedding which works with/represents analogous reasoning which shows how words/ word pairings relate to each other based on the vector difference between paris of words. Often word embeddings use cosine similarity to define the similiarity between two vectors - cosine similiarity works with an equation such that the inner product of the two vectors (if they are similar this value will be large) is divided by the euclidean distance between the two vectors (which calculates the cosine angle between the two vectors). Due to the fact that word embeddings operate on analogous reasoning, word embeddings are favorable because of the generality of these analygous relations between words.
Here you can see that while the validation accuracy is relatively high (around 88%) the validation loss is also incredibly high, though it intially goes down, eventually, it increases drastically. Moreover, even though the validation accuracy is high, the training accuracy is much higher. Additionally, while the validation loss is high the training loss is low. All of these factors together indicate that the model is very overfit.
This visualization is showing how the words in the IMDB reviews dataset relate to each other. The data has been sphereized and its dimensions reduced with PCA. As you can see there is quite a seperation/polarization between the words, with one end of the sphere representing words that all related to each other in that they all seemed to contribute to/indicate a negative review and at the other end were all of the words associated with the positive reviews. Between the two poles are words that could go either way. A lot of these words were what I would consider neutral descriptors/objective facts or connector words like “with” or “double.”
before adding the LSTM layers the accuracy and loss graphs look largely the same as the graphs from the word embedding exerise. The training and validation accuracy are both relatively high, though the training data has a higher accuracy than the validation data and training loss is low while the validation loss is incredibly high indicating that the model is very overfit. As seen below, even when you add the LSTM layers there is very little change in terms of the accuracy and loss so the model remains a bit overfit.