Skip to content

Gym Experiments with PGPE and CoSyNE

Training Policies for Gym using PGPE and CoSyNE

This example demonstrates how you can train policies using EvoTorch and Gym. To execute this example, you will need to install Gym's subpackages with:

    pip install 'gym[box2d,mujoco]'

This example is based on our paper [1] where we describe the ClipUp optimiser and compare it to the Adam optimiser. In particular, we will re-implement the experiment for the "LunarLanderContinuous-v2" environment.

Defining the Problem

To begin with, we will need to create the Problem class. To do this, we will first define the policy we wish to use. All experiments in [1], except "HumanoidBulletEnv-v0", used a linear policy. Let's define this as a torch module. Additionally, throughout experiments, the presence of a bias was varied, so we'll add that as a parameter to the module so that you can freely play with this parameter.

import torch
from torch import nn

class LinearPolicy(nn.Module):

    def __init__(
        self, 
        obs_length: int, # Number of observations from the environment
        act_length: int, # Number of actions of the environment
        bias: bool = True,  # Whether the policy should use biases
        **kwargs # Anything else that is passed
        ):
        super().__init__()  # Always call super init for nn Modules
        self.linear = nn.Linear(obs_length, act_length, bias = bias)

    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        # Forward pass of model simply applies linear layer to observations
        return self.linear(obs)

Now we're ready to define the problem. Let's start with the "LunarLanderContinuous-v2" environment.

from evotorch.neuroevolution import GymNE

problem = GymNE(
    env_name="LunarLanderContinuous-v2",  # Name of the environment
    network=LinearPolicy,  # Linear policy that we defined earlier
    network_args = {'bias': False},  # Linear policy should not use biases
    num_actors= 4,  # Use 4 available CPUs. Note that you can modify this value, or use 'max' to exploit all available GPUs
    observation_normalization = False,  # Observation normalization was not used in Lunar Lander experiments
)

Creating the searcher

With our problem created, we're ready to create the searcher. We're using PGPE and ClipUp with the parameters described in [2]:

from evotorch.algorithms import PGPE

radius_init = 4.5  # (approximate) radius of initial hypersphere that we will sample from
max_speed = radius_init / 15.  # Rule-of-thumb from the paper
center_learning_rate = max_speed / 2.

searcher = PGPE(
    problem,
    popsize=200,  # For now we use a static population size
    radius_init= radius_init,  # The searcher can be initialised directely with an initial radius, rather than stdev
    center_learning_rate=center_learning_rate,
    stdev_learning_rate=0.1,  # stdev learning rate of 0.1 was used across all experiments
    optimizer="clipup",  # Using the ClipUp optimiser
    optimizer_config = {
        'max_speed': max_speed,  # with the defined max speed 
        'momentum': 0.9,  # and momentum fixed to 0.9
    }
)

Training the policy

Now we're ready to train. We'll run evolution for 50 generations, and use the StdOutLogger logger to track progress.

from evotorch.logging import StdOutLogger

StdOutLogger(searcher)
searcher.run(50)

With our agent trained, it is straight-forward to now visualize the learned behaviour. For this, we will use \(\mu\), the learned center of the search distribution, as a 'best estimate' for a good policy for the environment.

center_solution = searcher.status["center"]  # Get mu
policy_net = problem.to_policy(center_solution)  # Instantiate a policy from mu
for _ in range(10):  # Visualize 10 episodes
    result = problem.visualize(policy_net)
    print('Visualised episode has cumulative reward:', result['cumulative_reward'])

Training with CoSyNE

As an alternative, we consider training the policy with the CoSyNE [2] algorithm. We use a configuration close to that used for pole-balancing experiments [2]. For this, we'll use additional evaluation repeats as the algorithm is more sensitive to noise, so to begin with we'll ensure the actors of the previous GymNE instance are killed and define a new GymNE instance.

problem.kill_actors()
problem = GymNE(
    env_name="LunarLanderContinuous-v2",
    network=LinearPolicy,
    network_args = {'bias': False},
    num_actors= 4, 
    observation_normalization = False,
    num_episodes = 3,
    initial_bounds = (-0.3, 0.3),
)

Defining the algorithm configuration, we aim to keep the overall evaluations-per-generation roughly the same, so use 50 individuals per generation. Additionally, we'll keep 1 elite individual-per-generation, to encourage exploitation.

from evotorch.algorithms import Cosyne
searcher = Cosyne(
    problem,
    num_elites = 1,
    popsize=50,  
    tournament_size = 4,
    mutation_stdev = 0.3,
    mutation_probability = 0.5,
    permute_all = True, 
)

Once again running for 50 generations with a StdOutLogger attached to output the progress:

StdOutLogger(searcher)
searcher.run(50)

And once again we can visualize the learned policy. As CoSyNE is population based, it does not maintain a 'best estimate' of a good policy. Instead, we simply take the best performing solution from the current population.

center_solution = searcher.status["pop_best"]  # Get the best solution in the population
policy_net = problem.to_policy(center_solution)  # Instantiate the policy from the best solution
for _ in range(10): # Visualize 10 episodes
    result = problem.visualize(policy_net)
    print('Visualised episode has cumulative reward:', result['cumulative_reward'])

References

[1] Toklu, et. al. "Clipup: a simple and powerful optimizer for distribution-based policy evolution." International Conference on Parallel Problem Solving from Nature. Springer, Cham, 2020.

[2] Gomez, Faustino, et al. "Accelerated Neural Evolution through Cooperatively Coevolved Synapses." Journal of Machine Learning Research 9.5 (2008).


See this notebook on GitHub