Skip to content

Training MNIST30K

Training a CNN on MNIST using evotorch

This example demonstrates the application of the evotorch.neuroevolution.SupervisedNE Problem class to training a CNN on MNIST. This example follows set-up described in the recent DeepMind paper [1].

Note that to use this example, please ensure that torchvision is installed.

Setting up the Problem class

First we will define the model. For this example, we will use the 'MNIST-30k' model from the paper, which is defined below. Note that Table 1 has a typo; the second convolution should have a 5x5 kernel, rather than a 2x2 kernel. This gives the number of parameters the authors listed.

import torch
from torch import nn
from import count_parameters

class MNIST30K(nn.Module):
    def __init__(self) -> None:
        # The first convolution uses a 5x5 kernel and has 16 filters
        self.conv1 = nn.Conv2d(1, 16, kernel_size = 5, stride = 1, padding=2)
        # Then max pooling is applied with a kernel size of 2
        self.pool1 = nn.MaxPool2d(kernel_size = 2)
        # The second convolution uses a 5x5 kernel and has 32 filters
        self.conv2 = nn.Conv2d(16, 32, kernel_size = 5, stride = 1, padding = 2)
        # Another max pooling is applied with a kernel size of 2
        self.pool2 = nn.MaxPool2d(kernel_size = 2)

        # Apply layer normalization after the second pool
        self.norm = nn.LayerNorm(1568, elementwise_affine=False)

        # A final linear layer maps outputs to the 10 target classes
        self.out = nn.Linear(1568, 10)

        # All activations are ReLU
        self.act = nn.ReLU()

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        # Apply the first conv + pool
        data = self.pool1(self.act(self.conv1(data)))
        # Apply the second conv + pool
        data = self.pool2(self.act(self.conv2(data)))

        # Apply layer norm
        data = self.norm(data.flatten(start_dim = 1))

        # Flatten and apply the output linear layer
        data = self.out(data)

        return data

network = MNIST30K()
print(f'Network has {count_parameters(network)} parameters')

Now lets pull the dataset (to use with standard transforms).

from torchvision import datasets, transforms

    transforms.Normalize((0.1307,), (0.3081,))
train_dataset = datasets.MNIST('../data', train=True, download=True,
test_dataset = datasets.MNIST('../data', train=False,

Now we are ready to create a custom problem class. The below is configured to use 4 actors, and divide the available GPUs between them. You can scale this up to dozens or even hundreds of CPUs and GPUs on a ray cluster simply by modifying the num_actors parameter.

from evotorch.neuroevolution import SupervisedNE

mnist_problem = SupervisedNE(
    train_dataset,  # Using the dataset specified earlier
    MNIST30K,  # Training the MNIST30K module designed earlier
    nn.CrossEntropyLoss(),  # Minimizing CrossEntropyLoss
    minibatch_size = 1024,  # With a minibatch size of 1024
    common_minibatch = True,  # Always using the same minibatch across all solutions on an actor
    num_actors = 4,  # The total number of CPUs used
    num_gpus_per_actor = 'max',  # Dividing all available GPUs between the 4 actors
    subbatch_size = 50,  # Evaluating solutions in sub-batches of size 50 ensures we won't run out of GPU memory for individual workers


Now we can set up the searcher.

In the paper, they used SNES with, effectively, default parameters, and standard deviation 1. The authors achieved 98%+ with only a population size of 1k, but this value can be pushed further. Note that by using the distributed = True keyword argument, we obtain semi-updates from the individual actors which are averaged.

In our example, we use PGPE with a population size of 3200. Hyperparameter configuration can be seen below:

from evotorch.algorithms import PGPE

searcher = PGPE(
    radius_init=2.25, # Initial radius of the search distribution
    center_learning_rate=1e-2, # Learning rate used by adam optimizer
    stdev_learning_rate=0.1, # Learning rate for the standard deviation
    distributed=True, # Gradients are computed locally at actors and averaged
    optimizer="adam", # Using the adam optimizer
    ranking_method=None, # No rank-based fitness shaping is used

Let's create some loggers. We'll run evolution for quite a long time, so it's worth reducing the log frequency.

from evotorch.logging import StdOutLogger, PandasLogger
stdout_logger = StdOutLogger(searcher, interval = 1)
pandas_logger = PandasLogger(searcher, interval = 1)

Running evolution for 400 generations (note that in the paper, it was 10k generations)...

We can visualize the progress:


And of course, it is worth while to measure the test performance

#net = mnist_problem.parameterize_net(searcher.status['center']).cpu()
net = mnist_problem.make_net(searcher.status["center"]).cpu()

loss = torch.nn.CrossEntropyLoss()
test_loader =, batch_size = 256, shuffle = False)
test_loss = 0
correct = 0

with torch.no_grad():
    for data, target in test_loader:
        output = net(data)
        test_loss += loss(output, target).item() * data.shape[0]
        pred =, keepdim=True)[1]
        correct += pred.eq(
    test_loss /= len(test_loader.dataset)
print('Test set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))


[1] Lenc, Karel, et al. "Non-differentiable supervised learning with evolution strategies and hybrid methods." arXiv preprint arXiv:1906.03139 (2019).

See this notebook on GitHub