We split the data into training and test sets because we want to use the training data to help teach the model about the relationship between the data and give it the opportunity to predict answers and then be able to check these answers against the correct answers so that the model can be as accurate as possible. Once we have trained the model on the trianing data we then want to see how well it does on data that it has not seen before and thus does not actually know the correct answer for so that we can see if the model has correctly learned the relationship/rules of the data and is not too overfit (i.e has not just memorized the training data, but instead understands the rules of the data).
The activation arguments help the neural network know when to activate a node and when pass on the information to the next correct node fro analysis. The relu function essentially says that if the output of a neuron is less than zero, just set the output value to zero so that negatiev values wont skew things downstream and cancel positive values being produced elsewhere. The softmax function helps the machine find the most likely neural candidate by setting the the value of the neuron with the largest value to one and the rest of the neurons to zero.
In this example of a neural network there were 10 neurons in the last layer because there were ten possible categories that the image could be classified as, so depending on which neuron had the highest value the category that that neuron represented would be assigned to the image.
The loss and optimizer function work together to help the model learn the rules of the data. The loss function calculats how good the machine is at guessing the relationship in the data. The result from the loss function is then used by the optimizer function to help the machine make are more accurate guess. This guess is then re-evaluated by the loss function whose result is again fed into the optimizer function in a continuous cycle until the maximum accuracy is reached.
There are 60,000 images and each image is 28 by 28 pixels.
There are 60,000 labels in the training set.
The test set has 10,000 images and each image is 28 by 28 pixels.
The array of probabilities for the 9,026 image in the test set is: ([1.0489175e-09, 2.6143069e-13, 2.9991051e-07, 1.8468669e-11, 1.7452842e-06, 2.5827162e-13, 1.3445548e-13, 2.4090318e-06,1.7083970e-09, 9.9999559e-01],)
The most likely numeral for the 9,026 image in the test set is 9.