At the start of every Deep Learning problem, we have a data set and a largely untrained model. Now there are multiple open questions about how and how often we present this data to the model and I will address some of them in this post.
Table of Contents
- Training Cycle
- Batch Sizes
You can also watch a video version of this post, if you prefer that:
In general, we have a database from which we get features, for example images, and labels – also called ground truths or targets, for example describing what is on the image.
We send the features into the network and let it make a prediction on that data. We then compare the prediction with the true labels from the database by calculating the loss.
Behind the scenes in PyTorch or Tensorflow, the calculations up to this point are being tracked and we can automatically get the gradient, which points us into the direction in which the network will be better at the just shown data. That is roughly what happens during Backpropagation. Normally, we then adjust the weights of the network with this new information after each prediction.
Because every for-loop or iterative update can be called an iteration, there is not really a clear-cut definition as far as I could tell, but you can either say one iteration is finished every time the network weights are updated. Or every time we make a prediction, which would be an iteration of the for-loop that goes through the data loader.
In most cases these two are the same, so don’t worry if you don’t really understand the distinction. You could however do Gradient Descent and accumulate gradients until you have seen the whole data set before making an update, in which case these two definitions would be different.
If you are a beginner, just think of an iteration as one loop through the training cycle discussed above.
Also, a lot of research literature I’ve read talks about “training steps” to specifically mean the times when we update the neural network weights, so I would encourage you to use that term instead because it’s clearer.
An epoch is finished once the neural network has seen all the data once.
Typically, we are not finished after that, because Gradient Descent variants only take small update steps and we usually need more updates than are possible within one epoch to reach well performing model.
This means that we train for multiple epochs. However, it’s hard to give a number because it’s very dependent on the network size and training data. It’s often more than 40 epochs. I personally have trained some small image classification models and I used between 5 and 40 epochs – just to throw this out as a super rough estimate.
It’s also very different how long an epoch takes it terms of time. It could be just seconds or even hours per epoch, again, that’s very dependent on the type of network being trained and the data set.
Overall epochs are a nice metric to describe training length, because they are independent from the batch size and from the concrete layout of iterations.
We have talked about Iterations and Epochs now, but what about Batch Size? There is a bit more to talk about here.
As a basic principle, we need to show the training data to the network and one of the ways we can do this is with…
Classic full batch Gradient Descent.
For this algorithm we use the full data set at once.
The pros are that it is the most accurate at minimizing the training loss function. After all, we want the network to become better at all the training example and not only some of them, so it needs to take all of them into account when optimizing.
We can also calculate many of the example predictions on parallel with GPUs, which makes batch descent possible in the first place.
The cons however are firstly logistical: the memory space we have available is often too small. Modern neural network are already quite big in themselves, and the datasets can be many gigabytes big, so they will not fit into memory completely. As I mentioned before we could split it into batches and then accumulate the gradients and only do an update at the end, however that would take extremely long, because Gradient Descent takes many training steps to get good and if every training step takes a long time, it’s just ineffective.
The main takeaway from this is that you need to choose the batch size at least small enough that you don’t run out of memory.
Now let’s look at what happens if we don’t wait with our network update until we have seen the whole data set:
(Mini-batch) Stochastic Gradient Descent
In Stochastic Gradient Descent (SGD) we only select one data point randomly for each training update. Meanwhile in mini-batch SGD, we chose a random subset of points from the data set. How many depends on the batch size that we choose for our problem.
The main advantage originally for this algorithm was that we can train faster by making more frequent updates even without having complete information about the whole data set. This naturally speeds up training.
But what happens to our accuracy? First, it’s not as good as Gradient Descent because we make updates before having all the information. In the above plot you can see 3 different functions, f1, f2 and f3 and they represent the loss of just one data point. And you can see the average of those 3 in red, which would be the loss over all data points. Ultimately, we want to minimize the red line to reach the point at almost x equal negative 0.5.
However, if we look at what would happen if we follow one of the other lines to their minimum, then we see that they would lead us in different directions sometimes even in the opposite direction of where we want to go. But since we only take small steps and switch the leading line after every step, they still roughly average out over time. Just not exactly to the average, so that is something to keep in mind.
Similarly to this view, SGD introduces noise, and in this image here I tried to showcase the path we would go with Gradient Descent as the solid line and the path we would take with SGD as the dotted line. This is quite a mild view, because here we never go into the opposite direction of GD, but it is still a good mental image to have of the process. You can also see that we do not end up in exactly the same spot as Gradient Descent does.
Now there is some discussion and theories going on that this noise might actually be helping us train the network to some extent, but that is a topic for its own post and it’s also not fully understood in research yet. But I still wanted to mention this, because this randomness in training is not completely negative, which is why it is used in most applications successfully to train.
- Mini-batches are most often used, so that means not the whole data set at once but also not just single points.
- The exact batch size depends on your project and you should try out different ones to see which one works best in every case.
- A good guideline is to choose exponentials of 2 (e.g. 16, 32, 64, …) for your batch size as that uses memory most efficiently.