Custom Optimizers in Pytorch – GeeksforGeeks

In PyTorch, an optimizer is a specific implementation of the optimization algorithm that is used to update the parameters of a neural network. The optimizer updates the parameters in such a way that the loss of the neural network is minimized. PyTorch provides various built-in optimizers such as SGD, Adam, Adagrad, etc. that can be used out of the box. However, in some cases, the built-in optimizers may not be suitable for a particular problem or may not perform well. In such cases, one can create their own custom optimizer.

A custom optimizer in PyTorch is a class that inherits from the torch.optim.Optimizer base class. The custom optimizer should implement the init and step methods. The init method is used to initialize the optimizer’s internal state, and the step method is used to update the parameters of the model.

Creating a Custom Optimizer:

In PyTorch, creating a custom optimizer is a two-step process. First, we need to create a class that inherits from the torch.optim.Optimizer class, and override the following methods:

  • __init__(self, params): This method is used to initialize the optimizer and store the model parameters in the params attribute.
  • step(): This method is used to perform a single optimization step. It should update the model parameters based on the current gradients.
  • zero_grad(): This method is used to set the gradients of all parameters to zero.

Init Method:

The init method is used to initialize the optimizer’s internal state. In this method, we define the hyperparameters of the optimizer and set the internal state. For example, let’s say we want to create a custom optimizer that implements the Momentum optimization algorithm. The init method for this optimizer would look something like this:

In the below example, we define the hyperparameters of the optimizer to be the learning rate lr and the momentum. We then call the super() method to initialize the internal state of the optimizer. We also set up a state dictionary that we will use to store the velocity vector for each parameter.

Python3

import torch

import torch.nn as nn

  

class MomentumOptimizer(torch.optim.Optimizer):

      

    

    def __init__(self, params, lr=1e-3, momentum=0.9):

        super(MomentumOptimizer, self).__init__(params, defaults={'lr': lr})

        self.momentum = momentum

        self.state = dict()

        for group in self.param_groups:

            for p in group['params']:

                self.state[p] = dict(mom=torch.zeros_like(p.data))

      

    

    def step(self):

        for group in self.param_groups:

            for p in group['params']:

                if p not in self.state:

                    self.state[p] = dict(mom=torch.zeros_like(p.data))

                mom = self.state[p]['mom']

                mom = self.momentum * mom - group['lr'] * p.grad.data

                p.data += mom

The Step Method:

The step method is used to update the parameters of the model. This method takes no arguments and updates the internal state and the model parameters. In the case of our MomentumOptimizer, the step method would look something like this:

In the above example, we iterate over all the parameters in the model and check if they are in the state dictionary. If they are not, we add them to the state dictionary with an initial velocity vector of zero. We then calculate the new velocity vector using the momentum and the learning rate and update the parameter’s value using this velocity vector.

Using the custom optimizer is similar to using the built-in optimizers, in that we instantiate it and pass in the model’s parameters and the hyperparameters.

Illustration 1:

Let’s create a simple training loop that shows how to use the custom optimizer to train a model. The loop would perform the following steps:

  1. Initialize the gradients of the model’s parameters to zero using the optimizer’s zero_grad method.
  2. Compute the forward pass of the model on some input data and calculate the loss.
  3. Compute the gradients of the model’s parameters with respect to the loss using the backward method.
  4. Call the step method of the optimizer to update the model’s parameters based on the current gradients and the optimizer’s internal state.

Step 1. Import the necessary libraries:

Python3

import torch

import torch.nn as nn

import matplotlib.pyplot as plt

Step 2: Define a custom optimizer class that inherits from torch.optim.Optimizer. In this example, we will create a custom optimizer that implements the Momentum optimization algorithm.

Python3

class MomentumOptimizer(torch.optim.Optimizer):

      

    

    def __init__(self, params, lr=1e-3, momentum=0.9):

        super(MomentumOptimizer, self).__init__(params, defaults={'lr': lr})

        self.momentum = momentum

        self.state = dict()

        for group in self.param_groups:

            for p in group['params']:

                self.state[p] = dict(mom=torch.zeros_like(p.data))

      

    

    def step(self):

        for group in self.param_groups:

            for p in group['params']:

                if p not in self.state:

                    self.state[p] = dict(mom=torch.zeros_like(p.data))

                mom = self.state[p]['mom']

                mom = self.momentum * mom - group['lr'] * p.grad.data

                p.data += mom

Step 3: Define a simple model, loss function and also initialize an instance of the custom optimizer:

Python3

model = nn.Linear(2, 2)

  

criterion = nn.MSELoss()

  

optimizer = MomentumOptimizer(model.parameters(), lr=1e-3, momentum=0.9)

Step 4: Generate some random data to train the model

Python3

X = torch.randn(100, 2)

y = torch.randn(100, 1)

Step 5:Train the model with custom optimizer and Plot the training loss.

Python3

for i in range(2500):

    optimizer.zero_grad()

    y_pred = model(X)

    loss = criterion(y_pred, y)

      

    

    if i%100 ==0:

        plt.plot(i,loss.item(),'ro-')

      

    loss.backward()

    optimizer.step()

      

plt.title('Losses over iterations')

plt.xlabel('iterations')

plt.ylabel('Losses')

plt.show()

Output:

Losses -Geeksforgeeks

Losses

You will notice that your custom optimizer is correctly updating the parameters of the model and minimizing the loss function.

Note: The above loop is an example on how to use the custom optimizer and it will help you understand how the step method of optimizer is working.

Customizing Optimizers:

There are many ways to customize optimizers in PyTorch, Some of them are as follows:

Changing the learning rate schedule:

 The learning rate of the optimizer can be changed during training using a learning rate scheduler. PyTorch provides several built-in schedulers such as torch.optim.lr_scheduler.StepLR and torch.optim.lr_scheduler.ExponentialLR. We can also create our own scheduler by inheriting from the torch.optim.lr_scheduler._LRScheduler class.

In below code, we are using the torch.optim.lr_scheduler.StepLR scheduler which will multiply the learning rate by a factor of gamma every step_size iterations.

Python3

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

  

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

  

num_epochs = 200

for i in range(num_epochs):

    

    optimizer.zero_grad()

      

    y_pred = model(X)

    loss = criterion(y_pred, y)

      

    loss.backward()

    optimizer.step()

    

    scheduler.step()

Adding regularization

To add regularization to the optimizer, we can modify the step() method to include the regularization term in the update of the model parameters. For example, we can add L1 or L2 regularization by modifying the step() method to include a term that penalizes the absolute or squared values of the parameters respectively.

Python3

class MyAdam(torch.optim.Adam):

    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), weight_decay=0):

        super().__init__(params, lr=lr, betas=betas)

        self.weight_decay = weight_decay

  

    def step(self):

        for group in self.param_groups:

            for p in group['params']:

                if p.grad is None:

                    continue

                grad = p.grad.data

                if grad.is_sparse:

                    raise RuntimeError("Adam does not support sparse gradients")

  

                state = self.state[p]

  

                

                if len(state) == 0:

                    state["step"] = 0

                    

                    state["exp_avg"] = torch.zeros_like(p.data)

                    

                    state["exp_avg_sq"] = torch.zeros_like(p.data)

  

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]

                beta1, beta2 = group["betas"]

  

                state["step"] += 1

  

                if self.weight_decay != 0:

                    grad = grad.add(p.data, alpha=self.weight_decay)

  

                

                exp_avg.mul_(beta1).add_(1 - beta1, grad)

                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

  

                denom = exp_avg_sq.sqrt().add_(group["eps"])

  

                bias_correction1 = 1 - beta1 ** state["step"]

                bias_correction2 = 1 - beta2 ** state["step"]

                step_size = group["lr"] * math.sqrt(bias_correction2) / bias_correction1

  

                p.data.addcdiv_(-step_size, exp_avg, denom)

  

optimizer = MyAdam(model.parameters(), weight_decay=0.00002)

In the above code, we are creating a custom Adam optimizer that includes weight decay regularization by adding a weight_decay parameter to the optimizer, and modifying the step() method to include the weight decay term in the update of the parameters. The weight decay term is applied to the gradients by grad = grad.add(p.data, alpha=group[“weight_decay”]) , this will penalize large parameter values by decreasing their update.

Implementing a new optimization algorithm: 

PyTorch provides several built-in optimization algorithms, such as SGD, Adam, and Adagrad. However, there are many other optimization algorithms that are not included in the library. By creating a custom optimizer, we can implement any optimization algorithm that we want.

Python3

class MyOptimizer(torch.optim.Optimizer):

    def __init__(self, params, lr=0.01):

        defaults = dict(lr=lr)

        super(MyOptimizer, self).__init__(params, defaults)

  

    def step(self):

        for group in self.param_groups:

            for p in group['params']:

                if p.grad is None:

                    continue

                p.data = p.data - group['lr']*p.grad.data**2

  

optimizer = MyOptimizer(model.parameters(), lr=0.001)

In this example, we created a new optimization algorithm called MyOptimizer, that performs updates to the parameters based on the squared gradient values, instead of the gradients themselves.

Using multiple optimizers:

 In some cases, we may want to use different optimizers for different parts of the model. For example, we may want to use Adam for the parameters of the convolutional layers, and SGD for the parameters of the fully-connected layers. This can be achieved by creating multiple instances of the optimizer, one for each set of parameters.

Python3

params1 = model.conv_layers.parameters()

params2 = model.fc_layers.parameters()

  

optimizer1 = torch.optim.Adam(params1)

optimizer2 = torch.optim.SGD(params2, lr=0.01)

  

for i in range(num_epochs):

    

    ...

    optimizer1.zero_grad()

    optimizer2.zero_grad()

    loss.backward()

    optimizer1.step()

    optimizer2.step()

In this example, we are using Adam optimizer for the parameters of the convolutional layers, and SGD optimizer with a fixed learning rate of 0.01 for the parameters of the fully-connected layers. This can help fine-tune the training of specific parts of the model.

Illustration 2: 

Build a handwritten digit classifications model using a custom optimizer

Step 1: 

Import the necessary libraries

Python3

import torch

import torch.nn as nn

from torch.optim import Optimizer

from torch.utils.data import DataLoader

from torchvision.datasets import MNIST

from torchvision.transforms import ToTensor

from torch.utils.tensorboard import SummaryWriter

import math

import matplotlib.pyplot as plt

  

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Step 2: 

Now, we’ll load the MNIST dataset, and create a data loader for it.

Python3

dataset = MNIST(root='.', train=True, download=True, transform=ToTensor())

dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

dataloader.dataset

Output:

Dataset MNIST
    Number of datapoints: 60000
    Root location: .
    Split: Train
    StandardTransform
Transform: ToTensor()

Step 3:

Let’s visualize the first batch of our dataset.

Python3

sample_idx = torch.randint(len(dataloader), size=(1,)).item()

len(dataloader)

for i, batch in enumerate(dataloader):

    figure = plt.figure(figsize=(16, 16))

    img, label = batch

    for j in range(img.shape[0]):

        figure.add_subplot(8, 8, j+1)

        plt.imshow(img[j].squeeze(), cmap="gray")

        plt.title(label[j])

        plt.axis("off")

          

    plt.show()

    break

Output:

First batch input images - Geeksforgeeks

First batch input images

Step 4: 

Next, we’ll define our model architecture, a simple fully connected network with two hidden layers

Python3

class Net(nn.Module):

    def __init__(self):

        super(Net, self).__init__()

        self.fc1 = nn.Linear(28*28, 512)

        self.fc2 = nn.Linear(512, 512)

        self.fc3 = nn.Linear(512, 10)

  

    def forward(self, x):

        x = x.view(-1, 28*28)

        x = torch.relu(self.fc1(x))

        x = torch.relu(self.fc2(x))

        x = self.fc3(x)

        return x

        

model = Net().to(device)

Step 4:

 we’ll define our loss function, in this case, we’ll use the cross-entropy loss.

Python3

loss_fn = nn.CrossEntropyLoss()

Step 5: 

Next, we’ll define our custom optimizer

Python3

class MyAdam(torch.optim.Adam):

    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), weight_decay=0):

        super().__init__(params, lr=lr, betas=betas)

        self.weight_decay = weight_decay

  

    def step(self):

        for group in self.param_groups:

            for p in group['params']:

                if p.grad is None:

                    continue

                grad = p.grad.data

                if grad.is_sparse:

                    raise RuntimeError("Adam does not support sparse gradients")

  

                state = self.state[p]

  

                

                if len(state) == 0:

                    state["step"] = 0

                    

                    state["exp_avg"] = torch.zeros_like(p.data)

                    

                    state["exp_avg_sq"] = torch.zeros_like(p.data)

  

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]

                beta1, beta2 = group["betas"]

  

                state["step"] += 1

  

                if self.weight_decay != 0:

                    grad = grad.add(p.data, alpha=self.weight_decay)

  

                

                exp_avg.mul_(beta1).add_(1 - beta1, grad)

                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

  

                denom = exp_avg_sq.sqrt().add_(group["eps"])

  

                bias_correction1 = 1 - beta1 ** state["step"]

                bias_correction2 = 1 - beta2 ** state["step"]

                step_size = group["lr"] * math.sqrt(bias_correction2) / bias_correction1

  

                p.data.addcdiv_(-step_size, exp_avg, denom)

  

optimizer = MyAdam(model.parameters(), weight_decay=0.00001)

Step 6:

Now, Train the model with custom optimizer and Plot the training loss.

Python3

num_epochs = 10

for i in range(num_epochs):

    for inputs, labels in dataloader:

        inputs, labels = inputs.to(device), labels.to(device)

        outputs = model(inputs)

        loss = loss_fn(outputs, labels)

  

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

        

          

    plt.plot(i,loss.item(),'ro-')

    print(i,'>> Loss :', loss.item())

  

plt.title('Losses over iterations')

plt.xlabel('iterations')

plt.ylabel('Losses')

plt.show()

Output:

0 >> Loss : nan
1 >> Loss : 1.2611686178923354e-44
2 >> Loss : nan
3 >> Loss : 8.407790785948902e-45
4 >> Loss : nan
5 >> Loss : 1.401298464324817e-45
6 >> Loss : nan
7 >> Loss : 0.0
8 >> Loss : nan
9 >> Loss : 1.401298464324817e-45
Losses -Geeksforgeeks

Losses

Note:  Losses will be different for different devices. 

Conclusion:

Creating custom optimizers in PyTorch is a powerful technique that allows us to fine-tune the training process of a machine learning model. By inheriting from the torch.optim.Optimizer class and implementing the __init__, step, and zero_grad methods, we can create our own optimization algorithm, adding regularization, changing learning rate schedule, or using multiple optimizers. Custom optimizers can help to improve the performance of a model and make it more suitable for a specific problem.

Read more here: Source link