Gym Experiments with PGPE and CoSyNE
Training Policies for Gym using PGPE and CoSyNE¶
This example demonstrates how you can train policies using EvoTorch and Gym. To execute this example, you will need to install Gym's subpackages with:
This example is based on our paper [1] where we describe the ClipUp optimiser and compare it to the Adam optimiser. In particular, we will re-implement the experiment for the "LunarLanderContinuous-v2" environment.
Defining the Problem¶
To begin with, we will need to create the Problem class. To do this, we will first define the policy we wish to use. All experiments in [1], except "HumanoidBulletEnv-v0", used a linear policy. Let's define this as a torch
module. Additionally, throughout experiments, the presence of a bias was varied, so we'll add that as a parameter to the module so that you can freely play with this parameter.
import torch
from torch import nn
class LinearPolicy(nn.Module):
def __init__(
self,
obs_length: int, # Number of observations from the environment
act_length: int, # Number of actions of the environment
bias: bool = True, # Whether the policy should use biases
**kwargs # Anything else that is passed
):
super().__init__() # Always call super init for nn Modules
self.linear = nn.Linear(obs_length, act_length, bias = bias)
def forward(self, obs: torch.Tensor) -> torch.Tensor:
# Forward pass of model simply applies linear layer to observations
return self.linear(obs)
Now we're ready to define the problem. Let's start with the "LunarLanderContinuous-v2" environment.
from evotorch.neuroevolution import GymNE
problem = GymNE(
env_name="LunarLanderContinuous-v2", # Name of the environment
network=LinearPolicy, # Linear policy that we defined earlier
network_args = {'bias': False}, # Linear policy should not use biases
num_actors= 4, # Use 4 available CPUs. Note that you can modify this value, or use 'max' to exploit all available GPUs
observation_normalization = False, # Observation normalization was not used in Lunar Lander experiments
)
Creating the searcher¶
With our problem created, we're ready to create the searcher. We're using PGPE and ClipUp with the parameters described in [2]:
from evotorch.algorithms import PGPE
radius_init = 4.5 # (approximate) radius of initial hypersphere that we will sample from
max_speed = radius_init / 15. # Rule-of-thumb from the paper
center_learning_rate = max_speed / 2.
searcher = PGPE(
problem,
popsize=200, # For now we use a static population size
radius_init= radius_init, # The searcher can be initialised directely with an initial radius, rather than stdev
center_learning_rate=center_learning_rate,
stdev_learning_rate=0.1, # stdev learning rate of 0.1 was used across all experiments
optimizer="clipup", # Using the ClipUp optimiser
optimizer_config = {
'max_speed': max_speed, # with the defined max speed
'momentum': 0.9, # and momentum fixed to 0.9
}
)
Training the policy¶
Now we're ready to train. We'll run evolution for 50 generations, and use the StdOutLogger
logger to track progress.
With our agent trained, it is straight-forward to now visualize the learned behaviour. For this, we will use \(\mu\), the learned center of the search distribution, as a 'best estimate' for a good policy for the environment.
center_solution = searcher.status["center"] # Get mu
policy_net = problem.to_policy(center_solution) # Instantiate a policy from mu
for _ in range(10): # Visualize 10 episodes
result = problem.visualize(policy_net)
print('Visualised episode has cumulative reward:', result['cumulative_reward'])
Training with CoSyNE¶
As an alternative, we consider training the policy with the CoSyNE [2] algorithm. We use a configuration close to that used for pole-balancing experiments [2]. For this, we'll use additional evaluation repeats as the algorithm is more sensitive to noise, so to begin with we'll ensure the actors of the previous GymNE
instance are killed and define a new GymNE
instance.
problem = GymNE(
env_name="LunarLanderContinuous-v2",
network=LinearPolicy,
network_args = {'bias': False},
num_actors= 4,
observation_normalization = False,
num_episodes = 3,
initial_bounds = (-0.3, 0.3),
)
Defining the algorithm configuration, we aim to keep the overall evaluations-per-generation roughly the same, so use 50 individuals per generation. Additionally, we'll keep 1 elite individual-per-generation, to encourage exploitation.
from evotorch.algorithms import Cosyne
searcher = Cosyne(
problem,
num_elites = 1,
popsize=50,
tournament_size = 4,
mutation_stdev = 0.3,
mutation_probability = 0.5,
permute_all = True,
)
Once again running for 50 generations with a StdOutLogger
attached to output the progress:
And once again we can visualize the learned policy. As CoSyNE
is population based, it does not maintain a 'best estimate' of a good policy. Instead, we simply take the best performing solution from the current population.
center_solution = searcher.status["pop_best"] # Get the best solution in the population
policy_net = problem.to_policy(center_solution) # Instantiate the policy from the best solution
for _ in range(10): # Visualize 10 episodes
result = problem.visualize(policy_net)
print('Visualised episode has cumulative reward:', result['cumulative_reward'])
References¶
[1] Toklu, et. al. "Clipup: a simple and powerful optimizer for distribution-based policy evolution." International Conference on Parallel Problem Solving from Nature. Springer, Cham, 2020.
[2] Gomez, Faustino, et al. "Accelerated Neural Evolution through Cooperatively Coevolved Synapses." Journal of Machine Learning Research 9.5 (2008).