Member-only story

Learning Rate Scheduler in Keras

Moklesur Rahman
7 min readJan 23, 2023

--

The learning rate is considered one of the most important hyperparameters for training deep learning models, but choosing it can be quite hard. Rather than simply using a fixed learning rate, it is common to use a learning rate scheduler.

image credit: pyimagesearch.com

When using different optimizers like Adam to train a deep learning model with Keras or TensorFlow, the learning rate of the model stays the same throughout the training process. Changing the learning rate in different steps, batches, and epochs can boost the performance of the deep learning model. There are many ways to decrease the learning rate during the training process over time. It is also known as “learning rate scheduling” or “learning rate annealing”. However, the Keras includes numerous schedulers for learning rate that can be used to anneal the learning rate over time. In the following subsections, different “learning rate scheduling” approaches are discussed and implemented. To implement a different method, I used a common deep learning model for the MNIST dataset.

Importing MNIST dataset:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import datasets
import numpy as np
from tensorflow.keras.utils import to_categorical

(X_train, Y_train), (X_test, Y_test) = datasets.mnist.load_data()
X_train,X_test = X_train.reshape(-1,28,28,1), X_test.reshape(-1,28,28,1)
Y_train, Y_test = to_categorical(Y_train), to_categorical(Y_test)

classes = np.unique(Y_train)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

Deep learning Model:

To explain the learning rate schedule, we designed a simple convolutional neural network based deep learning model. The deep learning model is composed of three convolutional and maxpooling layers, followed by dense (fully-connected) layers. The convolution layers have filters of sizes 16 , 32, and 64 respectively. The convolutional layers applied a kernel size of 3X3 to the input data. After each convolution layer, we used the ReLu (rectified linear unit) transfer function. Then, we have flattened the output of the third convolution layer and fed it into a dense layer that has 10 output units (same as the number of classes). The dense layer’s output has been transformed into probabilities using the softmax activation function.

--

--

Moklesur Rahman
Moklesur Rahman

Written by Moklesur Rahman

PhD student | Computer Science | University of Milan | Data science | AI in Cardiology | Writer | Researcher

Responses (1)

Write a response