Skip to content

Neuroevolution for gym Environments

Overview of GymNE

gym environments are a mainstay of reinforcement learning literature. When attempting to learn agents for these environments with neuroevolution, we typically use an episodic reward. For a given environment and policy, the evaluation of the policy typically follows:

episodic_reward = 0.0
done = False
observation = environment.reset()
while not done:
    action = policy(observation)
    observation, step_reward, done, info = environment.step(action)
    episodic_reward += step_reward

where episodic_reward is a value learn to maximize.

EvoTorch provides direct support for Neuroevolution of agents for gym environments through the GymNE class. This class exploits the gym.make function, meaning that you can create a reinforcement learning problem simply by passing the name of the environment. For example,

from evotorch.neuroevolution import GymNE
problem = GymNE(
    env_name="LunarLanderContinuous-v2",  # Name of the environment
    network= "Linear(obs_length, act_length)",  # Linear policy mapping observations to actions
    num_actors= 4,  # Use 4 available CPUs. Note that you can modify this value, or use 'max' to exploit all available GPUs
)

Will create a GymNE instance for the "LunarLanderContinuous-v2" environment with a Linear policy which takes obs_length inputs (the number of observations) and returns act_length actions (the number of actions). In general, GymNE automatically provides both obs_length, act_length and obs_space (the observation spaces of the policy) to the instantiation of the policy, meaning that you can also define classes with respect to the dimensions of the environment:

from gym.spaces import Space
import torch

class CustomPolicy(torch.nn.Module):

    def __init__(self, obs_length: int, act_length: int, obs_space: Space):
        super().__init__()
        self.lin1 = torch.nn.Linear(obs_length, 32)
        self.act = torch.nn.Tanh()
        self.lin2 = torch.nn.Linear(32, act_length)

    def forward(self, data):
        return self.lin2(self.act(self.lin1(data)))

problem = GymNE(
    env_name="LunarLanderContinuous-v2",
    network= CustomPolicy,
    num_actors= 4,
)

You can specify additional arguments to pass to the instantiation of the environment, as you would pass key-word arguments to gym.make, using the env_config dictionary. For example:

problem = GymNE(
    env_name="LunarLanderContinuous-v2",
    env_config = {
        'gravity': -1e-5,
    },
    network= CustomPolicy,
)

will effectively disable gravity in the "LunarLanderContinuous-v2" environment.

It should be noted that GymNE has its own function, to_policy, which you should use instead of parameterize_net. This function wraps parameterize_net, but adds any additional layers for observation normalization and action clipping as specified by the problem and environment. Therefore, you should generally use to_policy for GymNE, rather than parameterize_net.

GymNE has a number of useful arguments that will help you to recreate experiments from neuroevolution literature:

Controlling the Number of Episodes

Firstly, there is the num_episodes argument, which allows you to evaluate individual networks repeatedly and have their episodic rewards averaged. This is particularly useful when studying noisy environments, and when using population-based evolutionary algorithms whose selection procedures and elitism mechanisms may be more sensitive to noise. For example, instantiating the problem

problem = GymNE(
    env_name="LunarLanderContinuous-v2",
    network= CustomPolicy,
    num_actors= 4,
    num_episodes = 5,
)

will specify that each solution should be evaluated \(5\) times with their episodic rewards averaged, rather than just the default behaviour of evaluating the reward on a single episode.

Using Observation Normalization

In recent neuroevolution studies, observation normalization has been observed to be particularly helpful. Observation normalization tracks the expectation \(\mathbb{E}[o_i]\) and variance \(\mathbb{V}[o_i]\) for each observation variable \(o_i\) as observations are drawn from the environment. Then the observation passed to the policy is the modified:

\[o_i' = \frac{o_i - \mathbb{E}[o_i]}{\sqrt{\mathbb{V}[o_i]}}\]

While in practice this means that the problem is non-stationary, as the expection and variance of each variable is updated as new observations are drawn, the normalizing effect on the observations generally makes successful configuration of neuroevolution substantially easier. You can enable observation normalization using the boolean flag observation_normalization e.g.

problem = GymNE(
    env_name="LunarLanderContinuous-v2",
    network= CustomPolicy,
    num_actors= 4,
    observation_normalization = True,
)

And if you then evaluate a batch (to ensure the observation normalization statistics are initialized) and print the problem's policy:

problem.evaluate(problem.generate_batch(2))
print(problem.to_policy(problem.make_zeros(problem.solution_length)))
Output
Sequential(
    (0): ObsNormLayer()
    (1): CustomPolicy(
        (lin1): Linear(in_features=8, out_features=32, bias=True)
        (act): Tanh()
        (lin2): Linear(in_features=32, out_features=2, bias=True)
    )
    (2): ActClipLayer()
)

you will observe that the policy contains an ObsNormLayer which automatically applies observation normalization to the input to the policy, and an ActClipLayer which automatically clips the actions to the space of the environment.

Modifying the step reward

A number of gym environments use an alive_bonus: a scalar value that is added to the step_reward in each step to encourage RL agents to survive for longer. In evolutionary RL, however, it has been observed that this alive_bonus is detrimental and creates unhelpful local optimal. While you can of course disabled particular rewards with the env_config argument when available, we also provide direct support for you to decrease the step_reward by a scalar amount.

For example, the "Humanoid-v4" environment has an alive_bonus value of 5. You can easily offset this using the decrease_rewards_by keyword argument:

problem = GymNE(
    env_name="Humanoid-v4",
    network= CustomPolicy,
    decrease_rewards_by = 5.0,
)

which will cause each step to return 5.0 less reward.