Training MNIST30K
Training a CNN on MNIST using evotorch¶
This example demonstrates the application of the evotorch.neuroevolution.SupervisedNE
Problem
class to training a CNN on MNIST. This example follows set-up described in the recent DeepMind paper [1].
Note that to use this example, please ensure that torchvision is installed.
Setting up the Problem
class¶
First we will define the model. For this example, we will use the 'MNIST-30k' model from the paper, which is defined below. Note that Table 1 has a typo; the second convolution should have a 5x5 kernel, rather than a 2x2 kernel. This gives the number of parameters the authors listed.
import torch
from torch import nn
from evotorch.neuroevolution.net import count_parameters
class MNIST30K(nn.Module):
def __init__(self) -> None:
super().__init__()
# The first convolution uses a 5x5 kernel and has 16 filters
self.conv1 = nn.Conv2d(1, 16, kernel_size = 5, stride = 1, padding=2)
# Then max pooling is applied with a kernel size of 2
self.pool1 = nn.MaxPool2d(kernel_size = 2)
# The second convolution uses a 5x5 kernel and has 32 filters
self.conv2 = nn.Conv2d(16, 32, kernel_size = 5, stride = 1, padding = 2)
# Another max pooling is applied with a kernel size of 2
self.pool2 = nn.MaxPool2d(kernel_size = 2)
# Apply layer normalization after the second pool
self.norm = nn.LayerNorm(1568, elementwise_affine=False)
# A final linear layer maps outputs to the 10 target classes
self.out = nn.Linear(1568, 10)
# All activations are ReLU
self.act = nn.ReLU()
def forward(self, data: torch.Tensor) -> torch.Tensor:
# Apply the first conv + pool
data = self.pool1(self.act(self.conv1(data)))
# Apply the second conv + pool
data = self.pool2(self.act(self.conv2(data)))
# Apply layer norm
data = self.norm(data.flatten(start_dim = 1))
# Flatten and apply the output linear layer
data = self.out(data)
return data
network = MNIST30K()
print(f'Network has {count_parameters(network)} parameters')
Now lets pull the dataset (to use with standard transforms).
from torchvision import datasets, transforms
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('../data', train=True, download=True,
transform=transform)
test_dataset = datasets.MNIST('../data', train=False,
transform=transform)
Now we are ready to create a custom problem class. The below is configured to use 4 actors, and divide the available GPUs between them. You can scale this up to dozens or even hundreds of CPUs and GPUs on a ray
cluster simply by modifying the num_actors
parameter.
from evotorch.neuroevolution import SupervisedNE
mnist_problem = SupervisedNE(
train_dataset, # Using the dataset specified earlier
MNIST30K, # Training the MNIST30K module designed earlier
nn.CrossEntropyLoss(), # Minimizing CrossEntropyLoss
minibatch_size = 1024, # With a minibatch size of 1024
common_minibatch = True, # Always using the same minibatch across all solutions on an actor
num_actors = 4, # The total number of CPUs used
num_gpus_per_actor = 'max', # Dividing all available GPUs between the 4 actors
subbatch_size = 50, # Evaluating solutions in sub-batches of size 50 ensures we won't run out of GPU memory for individual workers
)
Training¶
Now we can set up the searcher.
In the paper, they used SNES with, effectively, default parameters, and standard deviation 1. The authors achieved 98%+ with only a population size of 1k, but this value can be pushed further. Note that by using the distributed = True
keyword argument, we obtain semi-updates from the individual actors which are averaged.
In our example, we use PGPE with a population size of 3200. Hyperparameter configuration can be seen below:
from evotorch.algorithms import PGPE
searcher = PGPE(
mnist_problem,
radius_init=2.25, # Initial radius of the search distribution
center_learning_rate=1e-2, # Learning rate used by adam optimizer
stdev_learning_rate=0.1, # Learning rate for the standard deviation
popsize=3200,
distributed=True, # Gradients are computed locally at actors and averaged
optimizer="adam", # Using the adam optimizer
ranking_method=None, # No rank-based fitness shaping is used
)
Let's create some loggers. We'll run evolution for quite a long time, so it's worth reducing the log frequency.
from evotorch.logging import StdOutLogger, PandasLogger
stdout_logger = StdOutLogger(searcher, interval = 1)
pandas_logger = PandasLogger(searcher, interval = 1)
Running evolution for 400 generations (note that in the paper, it was 10k generations)...
We can visualize the progress:
And of course, it is worth while to measure the test performance
#net = mnist_problem.parameterize_net(searcher.status['center']).cpu()
net = mnist_problem.make_net(searcher.status["center"]).cpu()
loss = torch.nn.CrossEntropyLoss()
net.eval()
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size = 256, shuffle = False)
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
output = net(data)
test_loss += loss(output, target).item() * data.shape[0]
pred = output.data.max(1, keepdim=True)[1]
correct += pred.eq(target.data.view_as(pred)).sum()
test_loss /= len(test_loader.dataset)
print('Test set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
References¶
[1] Lenc, Karel, et al. "Non-differentiable supervised learning with evolution strategies and hybrid methods." arXiv preprint arXiv:1906.03139 (2019).