How to NOT overfit in Deep Learning

Abolo Samuel Isaac

6 minutes

‍What is overfitting?

When you're using deep learning, sometimes your neural network can perform really well during training, but not as well when it comes to testing, or in real world scenarios. This is overfitting, and it can be really frustrating, because it's not always easy to fix. It happens when the neural network has learned patterns that only appear in the data you're training it with, but not in the actual problem you're trying to solve. These patterns are like noise - they disrupt your final solution.

Overfitting illustration from Data Science Stack Exchange

In the above image, notice how the model performs really well with training data, but performs poorly on the evaluation data. We could say in simple terms, that the model has become too familiar with the training data.

It's similar to how students can get so caught up in practicing past questions for an exam, that they can't answer any question that didn't fall into the past questions training set.

Why does overfitting happen?

Complex network structure

The simplest and most obvious reason why overfitting occurs is the complexity of our network. The more layers we add, the more our neural network tries to derive insights from our data.

Insufficient data

Another reason why overfitting occurs is, when you're trying to learn from small amounts of data, or data that all looks the same, or when you're running too many "epochs" (repeated training cycles) with a big network with a lack of data. For example, let's say you're training a neural network to recognize the difference between a cat and a dog. If all the dogs in your training set are bulldogs, it might have trouble recognizing a chihuahua.

Another problem that can happen when you have limited or similar data is that your network will stop learning too soon, and will only find patterns that exist in that specific data set and not in the problem you're trying to solve.

Poorly synthesized data

A technique to fix the problem of "homogeneous" data is called data synthesis. This means creating new data by combining different sources. For example, if you're training a neural network to recognize a specific word (like "Hey Siri"), but you want it to also work well in a noisy environment, you can make new data by combining sounds of that specific word with sounds of different noisy places. This will improve the network’s ability to recognize the word in different environments.

Using data synthesis ensures your data is diverse and not too similar, but it can also cause problems if not performed correctly. Overfitting can occur if the network starts to learn the noise that was used to create the new data; in other words, if the network starts to focus on the wrong things instead of the main feature you're trying to train it on.

Techniques for preventing Overfitting

L2 regularization

L2 regularization is an idea borrowed from classical machine learning and regression. It works by adding a regularization term to the loss function. This term is the Frobenius/Matrix Norm of the weights of the neural network

The lambda value is a number that can be adjusted to change how much regularization is used to prevent overfitting. Regularization is a technique that helps the network not to focus too much on the noise in the data. If the lambda value is zero, then regularization is not applied at all. If the lambda value is very high, then regularization becomes too strong, and the network will not learn enough from the data, which is called underfitting.

So the intuition is that the higher the value of lambda, the stronger the effect L2 regularization has on our network.

Dropout regularization

The main Idea behind Dropout Regularization is randomly and temporarily shutting off a small percentage of units within layers of our neural network to reduce the chances of our neural network depending on strong signals from any of these units.

GIF

We should, however, be careful not to shut off a large portion of our network layers because it can cause our network to underfit our data.

Other things that can help

Batch normalization

Batch normalization has a regularization effect that can help prevent overfitting.

The first benefit it offers is that it helps to set all your features to be on the same scale (i.e having the same standard deviation), helping your Neural Network to converge faster and better. Your neural network would also be less likely to depend on features that are on a higher scale than others.

Diversify your data

One simple way to fix overfitting is to add more diverse data to your dataset. However, this might not always be possible or easy. By having more varied data, your model will learn to recognize features that appear throughout the dataset instead of just in a small part of it.

Transfer learning

This is a way of addressing the issue of overfitting caused by not having enough data. By using a neural network that has been trained with an abundance of data, you can use it to solve similar problems and make improvements. Often, all you need to do is retrain the input and output layers of the network.