Index
Toplevel package for evotorch.
algorithms
special
¶
This namespace contains the implementations of various evolutionary algorithms.
cmaes
¶
This namespace contains the CMAES class
CMAES (SearchAlgorithm, SinglePopulationAlgorithmMixin)
¶
This is a GPUaccelerated and vectorized implementation, based on pycma (version r3.2.2) and the below references.
References:
Nikolaus Hansen, Youhei Akimoto, and Petr Baudis.
CMAES/pycma on Github. Zenodo, DOI:10.5281/zenodo.2559634,
February 2019.
<https://github.com/CMAES/pycma>
Nikolaus Hansen, Andreas Ostermeier (2001).
Completely Derandomized SelfAdaptation in Evolution Strategies.
Nikolaus Hansen (2016).
The CMA Evolution Strategy: A Tutorial.
Source code in evotorch/algorithms/cmaes.py
class CMAES(SearchAlgorithm, SinglePopulationAlgorithmMixin):
"""
CMAES: Covariance Matrix Adaptation Evolution Strategy.
This is a GPUaccelerated and vectorized implementation, based on pycma (version r3.2.2)
and the below references.
References:
Nikolaus Hansen, Youhei Akimoto, and Petr Baudis.
CMAES/pycma on Github. Zenodo, DOI:10.5281/zenodo.2559634,
February 2019.
<https://github.com/CMAES/pycma>
Nikolaus Hansen, Andreas Ostermeier (2001).
Completely Derandomized SelfAdaptation in Evolution Strategies.
Nikolaus Hansen (2016).
The CMA Evolution Strategy: A Tutorial.
"""
def __init__(
self,
problem: Problem,
*,
stdev_init: Real,
popsize: Optional[int] = None,
center_init: Optional[Vector] = None,
c_m: Real = 1.0,
c_sigma: Optional[Real] = None,
c_sigma_ratio: Real = 1.0,
damp_sigma: Optional[Real] = None,
damp_sigma_ratio: Real = 1.0,
c_c: Optional[Real] = None,
c_c_ratio: Real = 1.0,
c_1: Optional[Real] = None,
c_1_ratio: Real = 1.0,
c_mu: Optional[Real] = None,
c_mu_ratio: Real = 1.0,
active: bool = True,
csa_squared: bool = False,
stdev_min: Optional[Real] = None,
stdev_max: Optional[Real] = None,
separable: bool = False,
limit_C_decomposition: bool = True,
obj_index: Optional[int] = None,
):
"""
`__init__(...)`: Initialize the CMAES solver.
Args:
problem (Problem): The problem object which is being worked on.
stdev_init (Real): Initial stepsize
popsize: Population size. Can be specified as an int,
or can be left as None in which case the CMAES rule of thumb is applied:
popsize = 4 + floor(3 log d) where d is the dimension
center_init: Initial center point of the search distribution.
Can be given as a Solution or as a 1D array.
If left as None, an initial center point is generated
with the help of the problem object's `generate_values(...)`
method.
c_m (Real): Learning rate for updating the mean
of the search distribution. By default the value is 1.
c_sigma (Optional[Real]): Learning rate for updating the step size. If None,
then the CMAES rules of thumb will be applied.
c_sigma_ratio (Real): Multiplier on the learning rate for the step size.
if c_sigma has been left as None, can be used to rescale the default c_sigma value.
damp_sigma (Optional[Real]): Damping factor for updating the step size. If None,
then the CMAES rules of thumb will be applied.
damp_sigma_ratio (Real): Multiplier on the damping factor for the step size.
if damp_sigma has been left as None, can be used to rescale the default damp_sigma value.
c_c (Optional[Real]): Learning rate for updating the rank1 evolution path.
If None, then the CMAES rules of thumb will be applied.
c_c_ratio (Real): Multiplier on the learning rate for the rank1 evolution path.
if c_c has been left as None, can be used to rescale the default c_c value.
c_1 (Optional[Real]): Learning rate for the rank1 update to the covariance matrix.
If None, then the CMAES rules of thumb will be applied.
c_1_ratio (Real): Multiplier on the learning rate for the rank1 update to the covariance matrix.
if c_1 has been left as None, can be used to rescale the default c_1 value.
c_mu (Optional[Real]): Learning rate for the rankmu update to the covariance matrix.
If None, then the CMAES rules of thumb will be applied.
c_mu_ratio (Real): Multiplier on the learning rate for the rankmu update to the covariance matrix.
if c_mu has been left as None, can be used to rescale the default c_mu value.
active (bool): Whether to use Active CMAES. Defaults to True, consistent with the tutorial paper and pycma.
csa_squared (bool): Whether to use the squared rule ("CSA_squared" in pycma) for the stepsize adapation.
This effectively corresponds to taking the natural gradient for the evolution path on the step size,
rather than the default CMAES rule of thumb.
stdev_min (Optional[Real]): Minimum allowed standard deviation of the search
distribution. Leaving this as None means that no such
boundary is to be used.
Can be given as None or as a scalar.
stdev_max (Optional[Real]): Maximum allowed standard deviation of the search
distribution. Leaving this as None means that no such
boundary is to be used.
Can be given as None or as a scalar.
separable (bool): Provide this as True if you would like the problem
to be treated as a separable one. Treating a problem
as separable means to adapt only the diagonal parts
of the covariance matrix and to keep the nondiagonal
parts 0. High dimensional problems result in large
covariance matrices on which operating is computationally
expensive. Therefore, for such high dimensional problems,
setting `separable` as True might be useful.
limit_C_decomposition (bool): Whether to limit the frequency of decomposition of the shape matrix C
Setting this to True (default) means that C will not be decomposed every generation
This degrades the quality of the sampling and updates, but provides a guarantee of O(d^2) time complexity.
This option can be used with separable=True (e.g. for experimental reasons) but the performance will only degrade
without timecomplexity benefits.
obj_index (Optional[int]): Objective index according to which evaluation
of the solution will be done.
"""
# Initialize the base class
SearchAlgorithm.__init__(self, problem, center=self._get_center, stepsize=self._get_sigma)
# Ensure that the problem is numeric
problem.ensure_numeric()
# Store the objective index
self._obj_index = problem.normalize_obj_index(obj_index)
# Track d = solution length for reference in initialization of hyperparameters
d = self._problem.solution_length
# === Initialize population ===
if not popsize:
# Default value used in CMAES literature 4 + floor(3 log n)
popsize = 4 + int(np.floor(3 * np.log(d)))
self.popsize = int(popsize)
# Half popsize, referred to as mu in CMAES literature
self.mu = int(np.floor(popsize / 2))
self._population = problem.generate_batch(popsize=popsize)
# === Initialize search distribution ===
self.separable = separable
# If `center_init` is not given, generate an initial solution
# with the help of the problem object.
# If it is given as a Solution, then clone the solution's values
# as a PyTorch tensor.
# Otherwise, use the given initial solution as the starting
# point in the search space.
if center_init is None:
center_init = self._problem.generate_values(1)
elif isinstance(center_init, Solution):
center_init = center_init.values.clone()
# Store the center
self.m = self._problem.make_tensor(center_init)
# Store the initial step size
self.sigma = self._problem.make_tensor(stdev_init)
if separable:
# Initialize C as the diagonal vector. Note that when separable, the eigendecomposition is not needed
self.C = self._problem.make_ones(d)
# In this case A is simply the square root of elements of C
self.A = self._problem.make_ones(d)
else:
# Initialize C = AA^T all diagonal.
self.C = self._problem.make_I(d)
self.A = self.C.clone()
# === Initialize raw weights ===
# Conditioned on popsize
# w_i = log((lambda + 1) / 2)  log(i) for i = 1 ... lambda
raw_weights = self.problem.make_tensor(np.log((popsize + 1) / 2)  torch.log(torch.arange(popsize) + 1))
# positive valued weights are the first mu
positive_weights = raw_weights[: self.mu]
negative_weights = raw_weights[self.mu :]
# Variance effective selection mass of positive weights
# Not affected by future updates to raw_weights
self.mu_eff = torch.sum(positive_weights).pow(2.0) / torch.sum(positive_weights.pow(2.0))
# === Initialize search parameters ===
# Conditioned on weights
# Store fixed information
self.c_m = c_m
self.active = active
self.csa_squared = csa_squared
self.stdev_min = stdev_min
self.stdev_max = stdev_max
# Learning rate for stepsize adaption
if c_sigma is None:
c_sigma = (self.mu_eff + 2.0) / (d + self.mu_eff + 3)
self.c_sigma = c_sigma_ratio * c_sigma
# Damping factor for stepsize adapation
if damp_sigma is None:
damp_sigma = 1 + 2 * max(0, torch.sqrt((self.mu_eff  1) / (d + 1))  1) + self.c_sigma
self.damp_sigma = damp_sigma_ratio * damp_sigma
# Learning rate for evolution path for rank1 update
if c_c is None:
# Branches on separability
if separable:
c_c = (1 + (1 / d) + (self.mu_eff / d)) / (d**0.5 + (1 / d) + 2 * (self.mu_eff / d))
else:
c_c = (4 + self.mu_eff / d) / (d + (4 + 2 * self.mu_eff / d))
self.c_c = c_c_ratio * c_c
# Learning rate for rank1 update to covariance matrix
if c_1 is None:
# Branches on separability
if separable:
c_1 = 1.0 / (d + 2.0 * np.sqrt(d) + self.mu_eff / d)
else:
c_1 = min(1, popsize / 6) * 2 / ((d + 1.3) ** 2.0 + self.mu_eff)
self.c_1 = c_1_ratio * c_1
# Learning rate for rankmu update to covariance matrix
if c_mu is None:
# Branches on separability
if separable:
c_mu = (0.25 + self.mu_eff + (1.0 / self.mu_eff)  2) / (d + 4 * np.sqrt(d) + (self.mu_eff / 2.0))
else:
c_mu = min(
1  self.c_1, 2 * ((0.25 + self.mu_eff  2 + (1 / self.mu_eff)) / ((d + 2) ** 2.0 + self.mu_eff))
)
self.c_mu = c_mu_ratio * c_mu
# The 'variance aware' coefficient used for the additive component of the evolution path for sigma
self.variance_discount_sigma = torch.sqrt(self.c_sigma * (2  self.c_sigma) * self.mu_eff)
# The 'variance aware' coefficient used for the additive component of the evolution path for rank1 updates
self.variance_discount_c = torch.sqrt(self.c_c * (2  self.c_c) * self.mu_eff)
# === Finalize weights ===
# Conditioned on search parameters and raw weights
# Positive weights always sum to 1
positive_weights = positive_weights / torch.sum(positive_weights)
if self.active:
# Active CMAES: negative weights sum to alpha
# Get the variance effective selection mass of negative weights
mu_eff_neg = torch.sum(negative_weights).pow(2.0) / torch.sum(negative_weights.pow(2.0))
# Alpha is the minimum of the following 3 terms
alpha_mu = 1 + self.c_1 / self.c_mu
alpha_mu_eff = 1 + 2 * mu_eff_neg / (self.mu_eff + 2)
alpha_pos_def = (1  self.c_mu  self.c_1) / (d * self.c_mu)
alpha = min([alpha_mu, alpha_mu_eff, alpha_pos_def])
# Rescale negative weights
negative_weights = alpha * negative_weights / torch.sum(torch.abs(negative_weights))
else:
# Negative weights are simply zero
negative_weights = torch.zeros_like(negative_weights)
# Concatenate final weights
self.weights = torch.cat([positive_weights, negative_weights], dim=1)
# === Some final setup ===
# Initialize the evolution paths
self.p_sigma = 0.0
self.p_c = 0.0
# Hansen's approximation to the expectation of x x ~ N(0, I_d).
# Note that we could use the exact formulation with Gamma functions, but we'll retain this form for consistency
self.unbiased_expectation = np.sqrt(d) * (1  (1 / (4 * d)) + 1 / (21 * d**2))
self.last_ex = None
# How often to decompose C
if limit_C_decomposition:
self.decompose_C_freq = max(1, int(np.floor(_safe_divide(1, 10 * d * (self.c_1.cpu() + self.c_mu.cpu())))))
else:
self.decompose_C_freq = 1
# Use the SinglePopulationAlgorithmMixin to enable additional status reports regarding the population.
SinglePopulationAlgorithmMixin.__init__(self)
@property
def population(self) > SolutionBatch:
"""Population generated by the CMAES algorithm"""
return self._population
def _get_center(self) > torch.Tensor:
"""Get the center of search distribution, m"""
return self.m
def _get_sigma(self) > float:
"""Get the stepsize of the search distribution, sigma"""
return float(self.sigma.cpu())
@property
def obj_index(self) > int:
"""Index of the objective being focused on"""
return self._obj_index
def sample_distribution(self, num_samples: Optional[int] = None) > Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Sample the population. All 3 representations of solutions are returned for easy calculations of updates.
Note that the computation time of this operation of O(d^2 num_samples) unless separable, in which case O(d num_samples)
Args:
num_samples (Optional[int]): The number of samples to draw. If None, then the population size is used
Returns:
zs (torch.Tensor): A tensor of shape [num_samples, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d)
ys (torch.Tensor): A tensor of shape [num_samples, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C)
xs (torch.Tensor): A tensor of shape [num_samples, d] of samples from the search space e.g. x_i ~ N(m, sigma^2 C)
"""
if num_samples is None:
num_samples = self.popsize
# Generate z values
zs = self._problem.make_gaussian(num_solutions=num_samples)
# Construct ys = A zs
if self.separable:
# In the separable case A is diagonal so is represented as a single vector
ys = self.A.unsqueeze(0) * zs
else:
ys = (self.A @ zs.T).T
# Construct xs = m + sigma ys
xs = self.m.unsqueeze(0) + self.sigma * ys
return zs, ys, xs
def get_population_weights(self, xs: torch.Tensor) > torch.Tensor:
"""Get the assigned weights of the population (e.g. evaluate, rank and return)
Args:
xs (torch.Tensor): The population samples drawn from N(mu, sigma^2 C)
Returns:
assigned_weights (torch.Tensor): A [popsize, ] dimensional tensor of ordered weights
"""
# Computation is O(popsize * F_time) where F_time is the evalutation time per sample
# Fill the population
self._population.set_values(xs)
# Evaluate the current population
self.problem.evaluate(self._population)
# Sort the population
indices = self._population.argsort(obj_index=self.obj_index)
# Invert the sorting of the population to obtain the ranks
# Note that these ranks start at zero, but this is fine as we are just using them for indexing
ranks = torch.zeros_like(indices)
ranks[indices] = torch.arange(self.popsize, dtype=indices.dtype, device=indices.device)
# Get weights corresponding to each rank
assigned_weights = self.weights[ranks]
return assigned_weights
def update_m(self, zs: torch.Tensor, ys: torch.Tensor, assigned_weights: torch.Tensor) > torch.Tensor:
"""Update the center of the search distribution m
With zs and ys retained from sampling, this operation is O(popsize d), as it involves summing across popsize ddimensional vectors.
Args:
zs (torch.Tensor): A tensor of shape [popsize, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d)
ys (torch.Tensor): A tensor of shape [popsize, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C)
assigned_weights (torch.Tensor): A [popsize, ] dimensional tensor of ordered weights
Returns:
local_m_displacement (torch.Tensor): A tensor of shape [d], corresponding to the local transformation of m,
(1/sigma) (C^1/2) (m'  m) where m' is the updated m
shaped_m_displacement (torch.Tensor): A tensor of shape [d], corresponding to the shaped transformation of m,
(1/sigma) (m'  m) where m' is the updated m
"""
# Get the topmu weights
top_mu = torch.topk(assigned_weights, k=self.mu)
top_mu_weights = top_mu.values
top_mu_indices = top_mu.indices
# Compute the weighted recombination in local coordinate space
local_m_displacement = torch.sum(top_mu_weights.unsqueeze(1) * zs[top_mu_indices], dim=0)
# Compute the weighted recombination in shaped coordinate space
shaped_m_displacement = torch.sum(top_mu_weights.unsqueeze(1) * ys[top_mu_indices], dim=0)
# Update m
self.m = self.m + self.c_m * self.sigma * shaped_m_displacement
# Return the weighted recombinations
return local_m_displacement, shaped_m_displacement
def update_p_sigma(self, local_m_displacement: torch.Tensor) > None:
"""Update the evolution path for sigma, p_sigma
This operation is bounded O(d), as is simply the sum of vectors
Args:
local_m_displacement (torch.Tensor): The weighted recombination of local samples zs, corresponding to
(1/sigma) (C^1/2) (m'  m) where m' is the updated m
"""
self.p_sigma = (1  self.c_sigma) * self.p_sigma + self.variance_discount_sigma * local_m_displacement
def update_sigma(self) > None:
"""Update the step size sigma according to its evolution path p_sigma
This operation is bounded O(d), with the most expensive component being the norm of the evolution path, a ddimensional vector.
"""
d = self._problem.solution_length
# Compute the exponential update
if self.csa_squared:
# Exponential update based on natural gradient maximizing squared norm of p_sigma
exponential_update = (torch.norm(self.p_sigma).pow(2.0) / d  1) / 2
else:
# Exponential update increasing likelihood p_sigma having expected norm
exponential_update = torch.norm(self.p_sigma) / self.unbiased_expectation  1
# Rescale exponential update based on learning rate + damping factor
exponential_update = (self.c_sigma / self.damp_sigma) * exponential_update
# Multiplicative update to sigma
self.sigma = self.sigma * torch.exp(exponential_update)
def update_p_c(self, shaped_m_displacement: torch.Tensor, h_sig: torch.Tensor) > None:
"""Update the evolution path for rank1 update, p_c
This operation is bounded O(d), as is simply the sum of vectors
Args:
local_m_displacement (torch.Tensor): The weighted recombination of shaped samples ys, corresponding to
(1/sigma) (m'  m) where m' is the updated m
h_sig (torch.Tensor): Whether to stall the update based on the evolution path on sigma, p_sigma, expressed as a torch float
"""
self.p_c = (1  self.c_c) * self.p_c + h_sig * self.variance_discount_c * shaped_m_displacement
def update_C(self, zs: torch.Tensor, ys: torch.Tensor, assigned_weights: torch.Tensor, h_sig: torch.Tensor) > None:
"""Update the covariance shape matrix C based on rank1 and rankmu updates
This operation is bounded O(d^2 popsize), which is associated with computing the rankmu update (summing across popsize d*d matrices)
Args:
zs (torch.Tensor): A tensor of shape [popsize, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d)
ys (torch.Tensor): A tensor of shape [popsize, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C)
assigned_weights (torch.Tensor): A [popsize, ] dimensional tensor of ordered weights
h_sig (torch.Tensor): Whether to stall the update based on the evolution path on sigma, p_sigma, expressed as a torch float
"""
d = self._problem.solution_length
# If using Active CMAES, reweight negative weights
if self.active:
assigned_weights = torch.where(
assigned_weights > 0, assigned_weights, d * assigned_weights / torch.norm(zs, dim=1).pow(2.0)
)
c1a = self.c_1 * (1  (1  h_sig**2) * self.c_c * (2  self.c_c)) # adjust for variance loss
weighted_pc = (self.c_1 / (c1a + 1e23)) ** 0.5
if self.separable:
# Rank1 update
r1_update = c1a * (self.p_c.pow(2.0)  self.C)
# Rankmu update
rmu_update = self.c_mu * torch.sum(
assigned_weights.unsqueeze(1) * (ys.pow(2.0)  self.C.unsqueeze(0)), dim=0
)
else:
# Rank1 update
r1_update = c1a * (torch.outer(weighted_pc * self.p_c, weighted_pc * self.p_c)  self.C)
# Rankmu update
rmu_update = self.c_mu * (
torch.sum(assigned_weights.unsqueeze(1).unsqueeze(1) * (ys.unsqueeze(1) * ys.unsqueeze(2)), dim=0)
 torch.sum(self.weights) * self.C
)
# Update C
self.C = self.C + r1_update + rmu_update
def decompose_C(self) > None:
"""Perform the decomposition C = AA^T using a cholesky decomposition
Note that traditionally CMAES uses the eigendecomposition C = BDDB^1. In our case,
we keep track of zs, ys and xs when sampling, so we never need C^1/2.
Therefore, a cholesky decomposition is all that is necessary. This generally requires
O(d^3/3) operations, rather than the more costly O(d^3) operations associated with the eigendecomposition.
"""
if self.separable:
self.A = self.C.pow(0.5)
else:
self.A = torch.linalg.cholesky(self.C)
def _step(self):
"""Perform a step of the CMAES solver"""
# === Sampling, evaluation and ranking ===
# Sample the search distribution
zs, ys, xs = self.sample_distribution()
# Get the weights assigned to each solution
assigned_weights = self.get_population_weights(xs)
# === Center adaption ===
local_m_displacement, shaped_m_displacement = self.update_m(zs, ys, assigned_weights)
# === Step size adaption ===
# Update evolution path p_sigma
self.update_p_sigma(local_m_displacement)
# Update sigma
self.update_sigma()
# Compute h_sig, a boolean flag for stalling the update to p_c
h_sig = _h_sig(self.p_sigma, self.c_sigma, self._steps_count)
# === Unscaled covariance adapation ===
# Update evolution path p_c
self.update_p_c(shaped_m_displacement, h_sig)
# Update the covariance shape C
self.update_C(zs, ys, assigned_weights, h_sig)
# === Poststep corrections ===
# Limit elementwise standard deviation of sigma^2 C
if self.stdev_min is not None or self.stdev_max is not None:
self.C = _limit_stdev(self.sigma, self.C, self.stdev_min, self.stdev_max)
# Decompose C
if (self._steps_count + 1) % self.decompose_C_freq == 0:
self.decompose_C()
obj_index: int
property
readonly
¶
Index of the objective being focused on
population: SolutionBatch
property
readonly
¶
Population generated by the CMAES algorithm
__init__(self, problem, *, stdev_init, popsize=None, center_init=None, c_m=1.0, c_sigma=None, c_sigma_ratio=1.0, damp_sigma=None, damp_sigma_ratio=1.0, c_c=None, c_c_ratio=1.0, c_1=None, c_1_ratio=1.0, c_mu=None, c_mu_ratio=1.0, active=True, csa_squared=False, stdev_min=None, stdev_max=None, separable=False, limit_C_decomposition=True, obj_index=None)
special
¶
__init__(...)
: Initialize the CMAES solver.
Parameters:
Name  Type  Description  Default 

problem 
Problem 
The problem object which is being worked on. 
required 
stdev_init 
Real 
Initial stepsize 
required 
popsize 
Optional[int] 
Population size. Can be specified as an int, or can be left as None in which case the CMAES rule of thumb is applied: popsize = 4 + floor(3 log d) where d is the dimension 
None 
center_init 
Union[Iterable[float], torch.Tensor] 
Initial center point of the search distribution.
Can be given as a Solution or as a 1D array.
If left as None, an initial center point is generated
with the help of the problem object's 
None 
c_m 
Real 
Learning rate for updating the mean of the search distribution. By default the value is 1. 
1.0 
c_sigma 
Optional[Real] 
Learning rate for updating the step size. If None, then the CMAES rules of thumb will be applied. 
None 
c_sigma_ratio 
Real 
Multiplier on the learning rate for the step size. if c_sigma has been left as None, can be used to rescale the default c_sigma value. 
1.0 
damp_sigma 
Optional[Real] 
Damping factor for updating the step size. If None, then the CMAES rules of thumb will be applied. 
None 
damp_sigma_ratio 
Real 
Multiplier on the damping factor for the step size. if damp_sigma has been left as None, can be used to rescale the default damp_sigma value. 
1.0 
c_c 
Optional[Real] 
Learning rate for updating the rank1 evolution path. If None, then the CMAES rules of thumb will be applied. 
None 
c_c_ratio 
Real 
Multiplier on the learning rate for the rank1 evolution path. if c_c has been left as None, can be used to rescale the default c_c value. 
1.0 
c_1 
Optional[Real] 
Learning rate for the rank1 update to the covariance matrix. If None, then the CMAES rules of thumb will be applied. 
None 
c_1_ratio 
Real 
Multiplier on the learning rate for the rank1 update to the covariance matrix. if c_1 has been left as None, can be used to rescale the default c_1 value. 
1.0 
c_mu 
Optional[Real] 
Learning rate for the rankmu update to the covariance matrix. If None, then the CMAES rules of thumb will be applied. 
None 
c_mu_ratio 
Real 
Multiplier on the learning rate for the rankmu update to the covariance matrix. if c_mu has been left as None, can be used to rescale the default c_mu value. 
1.0 
active 
bool 
Whether to use Active CMAES. Defaults to True, consistent with the tutorial paper and pycma. 
True 
csa_squared 
bool 
Whether to use the squared rule ("CSA_squared" in pycma) for the stepsize adapation. This effectively corresponds to taking the natural gradient for the evolution path on the step size, rather than the default CMAES rule of thumb. 
False 
stdev_min 
Optional[Real] 
Minimum allowed standard deviation of the search distribution. Leaving this as None means that no such boundary is to be used. Can be given as None or as a scalar. 
None 
stdev_max 
Optional[Real] 
Maximum allowed standard deviation of the search distribution. Leaving this as None means that no such boundary is to be used. Can be given as None or as a scalar. 
None 
separable 
bool 
Provide this as True if you would like the problem
to be treated as a separable one. Treating a problem
as separable means to adapt only the diagonal parts
of the covariance matrix and to keep the nondiagonal
parts 0. High dimensional problems result in large
covariance matrices on which operating is computationally
expensive. Therefore, for such high dimensional problems,
setting 
False 
limit_C_decomposition 
bool 
Whether to limit the frequency of decomposition of the shape matrix C Setting this to True (default) means that C will not be decomposed every generation This degrades the quality of the sampling and updates, but provides a guarantee of O(d^2) time complexity. This option can be used with separable=True (e.g. for experimental reasons) but the performance will only degrade without timecomplexity benefits. 
True 
obj_index 
Optional[int] 
Objective index according to which evaluation of the solution will be done. 
None 
Source code in evotorch/algorithms/cmaes.py
def __init__(
self,
problem: Problem,
*,
stdev_init: Real,
popsize: Optional[int] = None,
center_init: Optional[Vector] = None,
c_m: Real = 1.0,
c_sigma: Optional[Real] = None,
c_sigma_ratio: Real = 1.0,
damp_sigma: Optional[Real] = None,
damp_sigma_ratio: Real = 1.0,
c_c: Optional[Real] = None,
c_c_ratio: Real = 1.0,
c_1: Optional[Real] = None,
c_1_ratio: Real = 1.0,
c_mu: Optional[Real] = None,
c_mu_ratio: Real = 1.0,
active: bool = True,
csa_squared: bool = False,
stdev_min: Optional[Real] = None,
stdev_max: Optional[Real] = None,
separable: bool = False,
limit_C_decomposition: bool = True,
obj_index: Optional[int] = None,
):
"""
`__init__(...)`: Initialize the CMAES solver.
Args:
problem (Problem): The problem object which is being worked on.
stdev_init (Real): Initial stepsize
popsize: Population size. Can be specified as an int,
or can be left as None in which case the CMAES rule of thumb is applied:
popsize = 4 + floor(3 log d) where d is the dimension
center_init: Initial center point of the search distribution.
Can be given as a Solution or as a 1D array.
If left as None, an initial center point is generated
with the help of the problem object's `generate_values(...)`
method.
c_m (Real): Learning rate for updating the mean
of the search distribution. By default the value is 1.
c_sigma (Optional[Real]): Learning rate for updating the step size. If None,
then the CMAES rules of thumb will be applied.
c_sigma_ratio (Real): Multiplier on the learning rate for the step size.
if c_sigma has been left as None, can be used to rescale the default c_sigma value.
damp_sigma (Optional[Real]): Damping factor for updating the step size. If None,
then the CMAES rules of thumb will be applied.
damp_sigma_ratio (Real): Multiplier on the damping factor for the step size.
if damp_sigma has been left as None, can be used to rescale the default damp_sigma value.
c_c (Optional[Real]): Learning rate for updating the rank1 evolution path.
If None, then the CMAES rules of thumb will be applied.
c_c_ratio (Real): Multiplier on the learning rate for the rank1 evolution path.
if c_c has been left as None, can be used to rescale the default c_c value.
c_1 (Optional[Real]): Learning rate for the rank1 update to the covariance matrix.
If None, then the CMAES rules of thumb will be applied.
c_1_ratio (Real): Multiplier on the learning rate for the rank1 update to the covariance matrix.
if c_1 has been left as None, can be used to rescale the default c_1 value.
c_mu (Optional[Real]): Learning rate for the rankmu update to the covariance matrix.
If None, then the CMAES rules of thumb will be applied.
c_mu_ratio (Real): Multiplier on the learning rate for the rankmu update to the covariance matrix.
if c_mu has been left as None, can be used to rescale the default c_mu value.
active (bool): Whether to use Active CMAES. Defaults to True, consistent with the tutorial paper and pycma.
csa_squared (bool): Whether to use the squared rule ("CSA_squared" in pycma) for the stepsize adapation.
This effectively corresponds to taking the natural gradient for the evolution path on the step size,
rather than the default CMAES rule of thumb.
stdev_min (Optional[Real]): Minimum allowed standard deviation of the search
distribution. Leaving this as None means that no such
boundary is to be used.
Can be given as None or as a scalar.
stdev_max (Optional[Real]): Maximum allowed standard deviation of the search
distribution. Leaving this as None means that no such
boundary is to be used.
Can be given as None or as a scalar.
separable (bool): Provide this as True if you would like the problem
to be treated as a separable one. Treating a problem
as separable means to adapt only the diagonal parts
of the covariance matrix and to keep the nondiagonal
parts 0. High dimensional problems result in large
covariance matrices on which operating is computationally
expensive. Therefore, for such high dimensional problems,
setting `separable` as True might be useful.
limit_C_decomposition (bool): Whether to limit the frequency of decomposition of the shape matrix C
Setting this to True (default) means that C will not be decomposed every generation
This degrades the quality of the sampling and updates, but provides a guarantee of O(d^2) time complexity.
This option can be used with separable=True (e.g. for experimental reasons) but the performance will only degrade
without timecomplexity benefits.
obj_index (Optional[int]): Objective index according to which evaluation
of the solution will be done.
"""
# Initialize the base class
SearchAlgorithm.__init__(self, problem, center=self._get_center, stepsize=self._get_sigma)
# Ensure that the problem is numeric
problem.ensure_numeric()
# Store the objective index
self._obj_index = problem.normalize_obj_index(obj_index)
# Track d = solution length for reference in initialization of hyperparameters
d = self._problem.solution_length
# === Initialize population ===
if not popsize:
# Default value used in CMAES literature 4 + floor(3 log n)
popsize = 4 + int(np.floor(3 * np.log(d)))
self.popsize = int(popsize)
# Half popsize, referred to as mu in CMAES literature
self.mu = int(np.floor(popsize / 2))
self._population = problem.generate_batch(popsize=popsize)
# === Initialize search distribution ===
self.separable = separable
# If `center_init` is not given, generate an initial solution
# with the help of the problem object.
# If it is given as a Solution, then clone the solution's values
# as a PyTorch tensor.
# Otherwise, use the given initial solution as the starting
# point in the search space.
if center_init is None:
center_init = self._problem.generate_values(1)
elif isinstance(center_init, Solution):
center_init = center_init.values.clone()
# Store the center
self.m = self._problem.make_tensor(center_init)
# Store the initial step size
self.sigma = self._problem.make_tensor(stdev_init)
if separable:
# Initialize C as the diagonal vector. Note that when separable, the eigendecomposition is not needed
self.C = self._problem.make_ones(d)
# In this case A is simply the square root of elements of C
self.A = self._problem.make_ones(d)
else:
# Initialize C = AA^T all diagonal.
self.C = self._problem.make_I(d)
self.A = self.C.clone()
# === Initialize raw weights ===
# Conditioned on popsize
# w_i = log((lambda + 1) / 2)  log(i) for i = 1 ... lambda
raw_weights = self.problem.make_tensor(np.log((popsize + 1) / 2)  torch.log(torch.arange(popsize) + 1))
# positive valued weights are the first mu
positive_weights = raw_weights[: self.mu]
negative_weights = raw_weights[self.mu :]
# Variance effective selection mass of positive weights
# Not affected by future updates to raw_weights
self.mu_eff = torch.sum(positive_weights).pow(2.0) / torch.sum(positive_weights.pow(2.0))
# === Initialize search parameters ===
# Conditioned on weights
# Store fixed information
self.c_m = c_m
self.active = active
self.csa_squared = csa_squared
self.stdev_min = stdev_min
self.stdev_max = stdev_max
# Learning rate for stepsize adaption
if c_sigma is None:
c_sigma = (self.mu_eff + 2.0) / (d + self.mu_eff + 3)
self.c_sigma = c_sigma_ratio * c_sigma
# Damping factor for stepsize adapation
if damp_sigma is None:
damp_sigma = 1 + 2 * max(0, torch.sqrt((self.mu_eff  1) / (d + 1))  1) + self.c_sigma
self.damp_sigma = damp_sigma_ratio * damp_sigma
# Learning rate for evolution path for rank1 update
if c_c is None:
# Branches on separability
if separable:
c_c = (1 + (1 / d) + (self.mu_eff / d)) / (d**0.5 + (1 / d) + 2 * (self.mu_eff / d))
else:
c_c = (4 + self.mu_eff / d) / (d + (4 + 2 * self.mu_eff / d))
self.c_c = c_c_ratio * c_c
# Learning rate for rank1 update to covariance matrix
if c_1 is None:
# Branches on separability
if separable:
c_1 = 1.0 / (d + 2.0 * np.sqrt(d) + self.mu_eff / d)
else:
c_1 = min(1, popsize / 6) * 2 / ((d + 1.3) ** 2.0 + self.mu_eff)
self.c_1 = c_1_ratio * c_1
# Learning rate for rankmu update to covariance matrix
if c_mu is None:
# Branches on separability
if separable:
c_mu = (0.25 + self.mu_eff + (1.0 / self.mu_eff)  2) / (d + 4 * np.sqrt(d) + (self.mu_eff / 2.0))
else:
c_mu = min(
1  self.c_1, 2 * ((0.25 + self.mu_eff  2 + (1 / self.mu_eff)) / ((d + 2) ** 2.0 + self.mu_eff))
)
self.c_mu = c_mu_ratio * c_mu
# The 'variance aware' coefficient used for the additive component of the evolution path for sigma
self.variance_discount_sigma = torch.sqrt(self.c_sigma * (2  self.c_sigma) * self.mu_eff)
# The 'variance aware' coefficient used for the additive component of the evolution path for rank1 updates
self.variance_discount_c = torch.sqrt(self.c_c * (2  self.c_c) * self.mu_eff)
# === Finalize weights ===
# Conditioned on search parameters and raw weights
# Positive weights always sum to 1
positive_weights = positive_weights / torch.sum(positive_weights)
if self.active:
# Active CMAES: negative weights sum to alpha
# Get the variance effective selection mass of negative weights
mu_eff_neg = torch.sum(negative_weights).pow(2.0) / torch.sum(negative_weights.pow(2.0))
# Alpha is the minimum of the following 3 terms
alpha_mu = 1 + self.c_1 / self.c_mu
alpha_mu_eff = 1 + 2 * mu_eff_neg / (self.mu_eff + 2)
alpha_pos_def = (1  self.c_mu  self.c_1) / (d * self.c_mu)
alpha = min([alpha_mu, alpha_mu_eff, alpha_pos_def])
# Rescale negative weights
negative_weights = alpha * negative_weights / torch.sum(torch.abs(negative_weights))
else:
# Negative weights are simply zero
negative_weights = torch.zeros_like(negative_weights)
# Concatenate final weights
self.weights = torch.cat([positive_weights, negative_weights], dim=1)
# === Some final setup ===
# Initialize the evolution paths
self.p_sigma = 0.0
self.p_c = 0.0
# Hansen's approximation to the expectation of x x ~ N(0, I_d).
# Note that we could use the exact formulation with Gamma functions, but we'll retain this form for consistency
self.unbiased_expectation = np.sqrt(d) * (1  (1 / (4 * d)) + 1 / (21 * d**2))
self.last_ex = None
# How often to decompose C
if limit_C_decomposition:
self.decompose_C_freq = max(1, int(np.floor(_safe_divide(1, 10 * d * (self.c_1.cpu() + self.c_mu.cpu())))))
else:
self.decompose_C_freq = 1
# Use the SinglePopulationAlgorithmMixin to enable additional status reports regarding the population.
SinglePopulationAlgorithmMixin.__init__(self)
decompose_C(self)
¶
Perform the decomposition C = AA^T using a cholesky decomposition Note that traditionally CMAES uses the eigendecomposition C = BDDB^1. In our case, we keep track of zs, ys and xs when sampling, so we never need C^½. Therefore, a cholesky decomposition is all that is necessary. This generally requires O(d^3/3) operations, rather than the more costly O(d^3) operations associated with the eigendecomposition.
Source code in evotorch/algorithms/cmaes.py
def decompose_C(self) > None:
"""Perform the decomposition C = AA^T using a cholesky decomposition
Note that traditionally CMAES uses the eigendecomposition C = BDDB^1. In our case,
we keep track of zs, ys and xs when sampling, so we never need C^1/2.
Therefore, a cholesky decomposition is all that is necessary. This generally requires
O(d^3/3) operations, rather than the more costly O(d^3) operations associated with the eigendecomposition.
"""
if self.separable:
self.A = self.C.pow(0.5)
else:
self.A = torch.linalg.cholesky(self.C)
get_population_weights(self, xs)
¶
Get the assigned weights of the population (e.g. evaluate, rank and return)
Parameters:
Name  Type  Description  Default 

xs 
torch.Tensor 
The population samples drawn from N(mu, sigma^2 C) 
required 
Returns:
Type  Description 

assigned_weights (torch.Tensor) 
A [popsize, ] dimensional tensor of ordered weights 
Source code in evotorch/algorithms/cmaes.py
def get_population_weights(self, xs: torch.Tensor) > torch.Tensor:
"""Get the assigned weights of the population (e.g. evaluate, rank and return)
Args:
xs (torch.Tensor): The population samples drawn from N(mu, sigma^2 C)
Returns:
assigned_weights (torch.Tensor): A [popsize, ] dimensional tensor of ordered weights
"""
# Computation is O(popsize * F_time) where F_time is the evalutation time per sample
# Fill the population
self._population.set_values(xs)
# Evaluate the current population
self.problem.evaluate(self._population)
# Sort the population
indices = self._population.argsort(obj_index=self.obj_index)
# Invert the sorting of the population to obtain the ranks
# Note that these ranks start at zero, but this is fine as we are just using them for indexing
ranks = torch.zeros_like(indices)
ranks[indices] = torch.arange(self.popsize, dtype=indices.dtype, device=indices.device)
# Get weights corresponding to each rank
assigned_weights = self.weights[ranks]
return assigned_weights
sample_distribution(self, num_samples=None)
¶
Sample the population. All 3 representations of solutions are returned for easy calculations of updates. Note that the computation time of this operation of O(d^2 num_samples) unless separable, in which case O(d num_samples)
Parameters:
Name  Type  Description  Default 

num_samples 
Optional[int] 
The number of samples to draw. If None, then the population size is used 
None 
Returns:
Type  Description 

zs (torch.Tensor) 
A tensor of shape [num_samples, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d) ys (torch.Tensor): A tensor of shape [num_samples, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C) xs (torch.Tensor): A tensor of shape [num_samples, d] of samples from the search space e.g. x_i ~ N(m, sigma^2 C) 
Source code in evotorch/algorithms/cmaes.py
def sample_distribution(self, num_samples: Optional[int] = None) > Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Sample the population. All 3 representations of solutions are returned for easy calculations of updates.
Note that the computation time of this operation of O(d^2 num_samples) unless separable, in which case O(d num_samples)
Args:
num_samples (Optional[int]): The number of samples to draw. If None, then the population size is used
Returns:
zs (torch.Tensor): A tensor of shape [num_samples, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d)
ys (torch.Tensor): A tensor of shape [num_samples, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C)
xs (torch.Tensor): A tensor of shape [num_samples, d] of samples from the search space e.g. x_i ~ N(m, sigma^2 C)
"""
if num_samples is None:
num_samples = self.popsize
# Generate z values
zs = self._problem.make_gaussian(num_solutions=num_samples)
# Construct ys = A zs
if self.separable:
# In the separable case A is diagonal so is represented as a single vector
ys = self.A.unsqueeze(0) * zs
else:
ys = (self.A @ zs.T).T
# Construct xs = m + sigma ys
xs = self.m.unsqueeze(0) + self.sigma * ys
return zs, ys, xs
update_C(self, zs, ys, assigned_weights, h_sig)
¶
Update the covariance shape matrix C based on rank1 and rankmu updates This operation is bounded O(d^2 popsize), which is associated with computing the rankmu update (summing across popsize d*d matrices)
Parameters:
Name  Type  Description  Default 

zs 
torch.Tensor 
A tensor of shape [popsize, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d) 
required 
ys 
torch.Tensor 
A tensor of shape [popsize, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C) 
required 
assigned_weights 
torch.Tensor 
A [popsize, ] dimensional tensor of ordered weights 
required 
h_sig 
torch.Tensor 
Whether to stall the update based on the evolution path on sigma, p_sigma, expressed as a torch float 
required 
Source code in evotorch/algorithms/cmaes.py
def update_C(self, zs: torch.Tensor, ys: torch.Tensor, assigned_weights: torch.Tensor, h_sig: torch.Tensor) > None:
"""Update the covariance shape matrix C based on rank1 and rankmu updates
This operation is bounded O(d^2 popsize), which is associated with computing the rankmu update (summing across popsize d*d matrices)
Args:
zs (torch.Tensor): A tensor of shape [popsize, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d)
ys (torch.Tensor): A tensor of shape [popsize, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C)
assigned_weights (torch.Tensor): A [popsize, ] dimensional tensor of ordered weights
h_sig (torch.Tensor): Whether to stall the update based on the evolution path on sigma, p_sigma, expressed as a torch float
"""
d = self._problem.solution_length
# If using Active CMAES, reweight negative weights
if self.active:
assigned_weights = torch.where(
assigned_weights > 0, assigned_weights, d * assigned_weights / torch.norm(zs, dim=1).pow(2.0)
)
c1a = self.c_1 * (1  (1  h_sig**2) * self.c_c * (2  self.c_c)) # adjust for variance loss
weighted_pc = (self.c_1 / (c1a + 1e23)) ** 0.5
if self.separable:
# Rank1 update
r1_update = c1a * (self.p_c.pow(2.0)  self.C)
# Rankmu update
rmu_update = self.c_mu * torch.sum(
assigned_weights.unsqueeze(1) * (ys.pow(2.0)  self.C.unsqueeze(0)), dim=0
)
else:
# Rank1 update
r1_update = c1a * (torch.outer(weighted_pc * self.p_c, weighted_pc * self.p_c)  self.C)
# Rankmu update
rmu_update = self.c_mu * (
torch.sum(assigned_weights.unsqueeze(1).unsqueeze(1) * (ys.unsqueeze(1) * ys.unsqueeze(2)), dim=0)
 torch.sum(self.weights) * self.C
)
# Update C
self.C = self.C + r1_update + rmu_update
update_m(self, zs, ys, assigned_weights)
¶
Update the center of the search distribution m With zs and ys retained from sampling, this operation is O(popsize d), as it involves summing across popsize ddimensional vectors.
Parameters:
Name  Type  Description  Default 

zs 
torch.Tensor 
A tensor of shape [popsize, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d) 
required 
ys 
torch.Tensor 
A tensor of shape [popsize, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C) 
required 
assigned_weights 
torch.Tensor 
A [popsize, ] dimensional tensor of ordered weights 
required 
Returns:
Type  Description 

local_m_displacement (torch.Tensor) 
A tensor of shape [d], corresponding to the local transformation of m, (1/sigma) (C^½) (m'  m) where m' is the updated m shaped_m_displacement (torch.Tensor): A tensor of shape [d], corresponding to the shaped transformation of m, (1/sigma) (m'  m) where m' is the updated m 
Source code in evotorch/algorithms/cmaes.py
def update_m(self, zs: torch.Tensor, ys: torch.Tensor, assigned_weights: torch.Tensor) > torch.Tensor:
"""Update the center of the search distribution m
With zs and ys retained from sampling, this operation is O(popsize d), as it involves summing across popsize ddimensional vectors.
Args:
zs (torch.Tensor): A tensor of shape [popsize, d] of samples from the local coordinate space e.g. z_i ~ N(0, I_d)
ys (torch.Tensor): A tensor of shape [popsize, d] of samples from the shaped coordinate space e.g. y_i ~ N(0, C)
assigned_weights (torch.Tensor): A [popsize, ] dimensional tensor of ordered weights
Returns:
local_m_displacement (torch.Tensor): A tensor of shape [d], corresponding to the local transformation of m,
(1/sigma) (C^1/2) (m'  m) where m' is the updated m
shaped_m_displacement (torch.Tensor): A tensor of shape [d], corresponding to the shaped transformation of m,
(1/sigma) (m'  m) where m' is the updated m
"""
# Get the topmu weights
top_mu = torch.topk(assigned_weights, k=self.mu)
top_mu_weights = top_mu.values
top_mu_indices = top_mu.indices
# Compute the weighted recombination in local coordinate space
local_m_displacement = torch.sum(top_mu_weights.unsqueeze(1) * zs[top_mu_indices], dim=0)
# Compute the weighted recombination in shaped coordinate space
shaped_m_displacement = torch.sum(top_mu_weights.unsqueeze(1) * ys[top_mu_indices], dim=0)
# Update m
self.m = self.m + self.c_m * self.sigma * shaped_m_displacement
# Return the weighted recombinations
return local_m_displacement, shaped_m_displacement
update_p_c(self, shaped_m_displacement, h_sig)
¶
Update the evolution path for rank1 update, p_c This operation is bounded O(d), as is simply the sum of vectors
Parameters:
Name  Type  Description  Default 

local_m_displacement 
torch.Tensor 
The weighted recombination of shaped samples ys, corresponding to (1/sigma) (m'  m) where m' is the updated m 
required 
h_sig 
torch.Tensor 
Whether to stall the update based on the evolution path on sigma, p_sigma, expressed as a torch float 
required 
Source code in evotorch/algorithms/cmaes.py
def update_p_c(self, shaped_m_displacement: torch.Tensor, h_sig: torch.Tensor) > None:
"""Update the evolution path for rank1 update, p_c
This operation is bounded O(d), as is simply the sum of vectors
Args:
local_m_displacement (torch.Tensor): The weighted recombination of shaped samples ys, corresponding to
(1/sigma) (m'  m) where m' is the updated m
h_sig (torch.Tensor): Whether to stall the update based on the evolution path on sigma, p_sigma, expressed as a torch float
"""
self.p_c = (1  self.c_c) * self.p_c + h_sig * self.variance_discount_c * shaped_m_displacement
update_p_sigma(self, local_m_displacement)
¶
Update the evolution path for sigma, p_sigma This operation is bounded O(d), as is simply the sum of vectors
Parameters:
Name  Type  Description  Default 

local_m_displacement 
torch.Tensor 
The weighted recombination of local samples zs, corresponding to (1/sigma) (C^½) (m'  m) where m' is the updated m 
required 
Source code in evotorch/algorithms/cmaes.py
def update_p_sigma(self, local_m_displacement: torch.Tensor) > None:
"""Update the evolution path for sigma, p_sigma
This operation is bounded O(d), as is simply the sum of vectors
Args:
local_m_displacement (torch.Tensor): The weighted recombination of local samples zs, corresponding to
(1/sigma) (C^1/2) (m'  m) where m' is the updated m
"""
self.p_sigma = (1  self.c_sigma) * self.p_sigma + self.variance_discount_sigma * local_m_displacement
update_sigma(self)
¶
Update the step size sigma according to its evolution path p_sigma This operation is bounded O(d), with the most expensive component being the norm of the evolution path, a ddimensional vector.
Source code in evotorch/algorithms/cmaes.py
def update_sigma(self) > None:
"""Update the step size sigma according to its evolution path p_sigma
This operation is bounded O(d), with the most expensive component being the norm of the evolution path, a ddimensional vector.
"""
d = self._problem.solution_length
# Compute the exponential update
if self.csa_squared:
# Exponential update based on natural gradient maximizing squared norm of p_sigma
exponential_update = (torch.norm(self.p_sigma).pow(2.0) / d  1) / 2
else:
# Exponential update increasing likelihood p_sigma having expected norm
exponential_update = torch.norm(self.p_sigma) / self.unbiased_expectation  1
# Rescale exponential update based on learning rate + damping factor
exponential_update = (self.c_sigma / self.damp_sigma) * exponential_update
# Multiplicative update to sigma
self.sigma = self.sigma * torch.exp(exponential_update)
distributed
special
¶
gaussian
¶
CEM (GaussianSearchAlgorithm)
¶
The crossentropy method (CEM) (Rubinstein, 1999).
This CEM implementation is focused on continuous optimization, and follows the variant explained in Duan et al. (2016).
The adaptive population size mechanism explained in Toklu et al. (2020)
(and previously used in the accompanying source code of the study
Salimans et al. (2017)) is supported, where the population size in an
iteration keeps increasing until a certain numberof interactions with
the simulator of the reinforcement learning environment is made.
See the initialization arguments num_interactions
, popsize_max
.
References:
Rubinstein, R. (1999). The crossentropy method for combinatorial
and continuous optimization.
Methodology and computing in applied probability, 1(2), 127190.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P. (2016).
Benchmarking deep reinforcement learning for continuous control.
International conference on machine learning. PMLR, 2016.
Salimans, T., Ho, J., Chen, X., Sidor, S. and Sutskever, I. (2017).
Evolution Strategies as a Scalable Alternative to
Reinforcement Learning.
Toklu, N.E., Liskowski, P., Srivastava, R.K. (2020).
ClipUp: A Simple and Powerful Optimizer
for Distributionbased Policy Evolution.
Parallel Problem Solving from Nature (PPSN 2020).
Source code in evotorch/algorithms/distributed/gaussian.py
class CEM(GaussianSearchAlgorithm):
"""
The crossentropy method (CEM) (Rubinstein, 1999).
This CEM implementation is focused on continuous optimization,
and follows the variant explained in Duan et al. (2016).
The adaptive population size mechanism explained in Toklu et al. (2020)
(and previously used in the accompanying source code of the study
Salimans et al. (2017)) is supported, where the population size in an
iteration keeps increasing until a certain numberof interactions with
the simulator of the reinforcement learning environment is made.
See the initialization arguments `num_interactions`, `popsize_max`.
References:
Rubinstein, R. (1999). The crossentropy method for combinatorial
and continuous optimization.
Methodology and computing in applied probability, 1(2), 127190.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P. (2016).
Benchmarking deep reinforcement learning for continuous control.
International conference on machine learning. PMLR, 2016.
Salimans, T., Ho, J., Chen, X., Sidor, S. and Sutskever, I. (2017).
Evolution Strategies as a Scalable Alternative to
Reinforcement Learning.
Toklu, N.E., Liskowski, P., Srivastava, R.K. (2020).
ClipUp: A Simple and Powerful Optimizer
for Distributionbased Policy Evolution.
Parallel Problem Solving from Nature (PPSN 2020).
"""
DISTRIBUTION_TYPE = SeparableGaussian
DISTRIBUTION_PARAMS = NotImplemented # To be filled by the CEM instance
def __init__(
self,
problem: Problem,
*,
popsize: int,
parenthood_ratio: float,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[Union[float, RealOrVector]] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the search algorithm.
Args:
problem: The problem object to work on.
popsize: The population size.
parenthood_ratio: Expected as a float larger than 0 and smaller
than 1. For example, setting this value to 0.1 means that
the top 10% of the population will be declared as the parents,
and those parents will be used for updating the population.
The amount of parents is always computed according to the
specified `popsize`, not according to the adapted population
size, and not according to `popsize_max`.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
center_init: The initial center solution.
Can be left as None.
stdev_min: The minimum value for the standard deviation
values of the Gaussian search distribution.
Can be left as None (which is the default),
or can be given as a scalar or as a 1dimensional array.
stdev_max: The maximum value for the standard deviation
values of the Gaussian search distribution.
Can be left as None (which is the default),
or can be given as a scalar or as a 1dimensional array.
stdev_max_change: The maximum update ratio allowed on the
standard deviation. Expected as None if no such limiter
is needed, or as a real number within 0.0 and 1.0 otherwise.
In the PGPE implementation of Ha (2017, 2018), a value of
0.2 (20%) was used.
For this CEM implementation, the default is None.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
self.DISTRIBUTION_PARAMS = {"parenthood_ratio": float(parenthood_ratio)}
super().__init__(
problem,
popsize=popsize,
center_learning_rate=1.0,
stdev_learning_rate=1.0,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=None,
optimizer_config=None,
ranking_method=None,
center_init=center_init,
stdev_min=stdev_min,
stdev_max=stdev_max,
stdev_max_change=stdev_max_change,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
)
DISTRIBUTION_TYPE (Distribution)
¶
Separable Multivariate Gaussian, as used by PGPE
Source code in evotorch/algorithms/distributed/gaussian.py
class SeparableGaussian(Distribution):
"""Separable Multivariate Gaussian, as used by PGPE"""
MANDATORY_PARAMETERS = {"mu", "sigma"}
OPTIONAL_PARAMETERS = {"divide_mu_grad_by", "divide_sigma_grad_by", "parenthood_ratio"}
def __init__(
self,
parameters: dict,
*,
solution_length: Optional[int] = None,
device: Optional[Device] = None,
dtype: Optional[DType] = None,
):
[mu_length] = parameters["mu"].shape
[sigma_length] = parameters["sigma"].shape
if solution_length is None:
solution_length = mu_length
else:
if solution_length != mu_length:
raise ValueError(
f"The argument `solution_length` does not match the length of `mu` provided in `parameters`."
f" solution_length={solution_length},"
f' parameters["mu"]={mu_length}.'
)
if mu_length != sigma_length:
raise ValueError(
f"The tensors `mu` and `sigma` provided within `parameters` have mismatching lengths."
f' parameters["mu"]={mu_length},'
f' parameters["sigma"]={sigma_length}.'
)
super().__init__(
solution_length=solution_length,
parameters=parameters,
device=device,
dtype=dtype,
)
@property
def mu(self) > torch.Tensor:
return self.parameters["mu"]
@mu.setter
def mu(self, new_mu: Iterable):
self.parameters["mu"] = torch.as_tensor(new_mu, dtype=self.dtype, device=self.device)
@property
def sigma(self) > torch.Tensor:
return self.parameters["sigma"]
@sigma.setter
def sigma(self, new_sigma: Iterable):
self.parameters["sigma"] = torch.as_tensor(new_sigma, dtype=self.dtype, device=self.device)
def _fill(self, out: torch.Tensor, *, generator: Optional[torch.Generator] = None):
self.make_gaussian(out=out, center=self.mu, stdev=self.sigma, generator=generator)
def _divide_grad(self, param_name: str, grad: torch.Tensor, weights: torch.Tensor) > torch.Tensor:
option = f"divide_{param_name}_grad_by"
if option in self.parameters:
div_by_what = self.parameters[option]
if div_by_what == "num_solutions":
[num_solutions] = weights.shape
grad = grad / num_solutions
elif div_by_what == "num_directions":
[num_solutions] = weights.shape
num_directions = num_solutions // 2
grad = grad / num_directions
elif div_by_what == "total_weight":
total_weight = torch.sum(torch.abs(weights))
grad = grad / total_weight
elif div_by_what == "weight_stdev":
weight_stdev = torch.std(weights)
grad = grad / weight_stdev
else:
raise ValueError(f"The parameter {option} has an unrecognized value: {div_by_what}")
return grad
def _compute_gradients_via_parenthood_ratio(self, samples: torch.Tensor, weights: torch.Tensor) > dict:
[num_samples, _] = samples.shape
num_elites = math.floor(num_samples * self.parameters["parenthood_ratio"])
elite_indices = weights.argsort(descending=True)[:num_elites]
elites = samples[elite_indices, :]
return {
"mu": torch.mean(elites, dim=0)  self.parameters["mu"],
"sigma": torch.std(elites, dim=0)  self.parameters["sigma"],
}
def _compute_gradients(self, samples: torch.Tensor, weights: torch.Tensor, ranking_used: Optional[str]) > dict:
if "parenthood_ratio" in self.parameters:
return self._compute_gradients_via_parenthood_ratio(samples, weights)
else:
mu = self.mu
sigma = self.sigma
# Compute the scaled noises, that is, the noise vectors which
# were used for generating the solutions
# (solution = scaled_noise + center)
scaled_noises = samples  mu
# Make sure that the weights (utilities) are 0centered
# (Otherwise the formulations would have to consider a bias term)
if ranking_used not in ("centered", "normalized"):
weights = weights  torch.mean(weights)
mu_grad = self._divide_grad(
"mu",
total(dot(weights, scaled_noises)),
weights,
)
sigma_grad = self._divide_grad(
"sigma",
total(dot(weights, ((scaled_noises**2)  (sigma**2)) / sigma)),
weights,
)
return {
"mu": mu_grad,
"sigma": sigma_grad,
}
def update_parameters(
self,
gradients: dict,
*,
learning_rates: Optional[dict] = None,
optimizers: Optional[dict] = None,
) > "SeparableGaussian":
mu_grad = gradients["mu"]
sigma_grad = gradients["sigma"]
new_mu = self.mu + self._follow_gradient("mu", mu_grad, learning_rates=learning_rates, optimizers=optimizers)
new_sigma = self.sigma + self._follow_gradient(
"sigma", sigma_grad, learning_rates=learning_rates, optimizers=optimizers
)
return self.modified_copy(mu=new_mu, sigma=new_sigma)
def relative_entropy(dist_0: "SeparableGaussian", dist_1: "SeparableGaussian") > float:
mu_0 = dist_0.parameters["mu"]
mu_1 = dist_1.parameters["mu"]
sigma_0 = dist_0.parameters["sigma"]
sigma_1 = dist_1.parameters["sigma"]
cov_0 = sigma_0.pow(2.0)
cov_1 = sigma_1.pow(2.0)
mu_delta = mu_1  mu_0
trace_cov = torch.sum(cov_0 / cov_1)
k = dist_0.solution_length
scaled_mu = torch.sum(mu_delta.pow(2.0) / cov_1)
log_det = torch.sum(torch.log(cov_1))  torch.sum(torch.log(cov_0))
return 0.5 * (trace_cov  k + scaled_mu + log_det)
update_parameters(self, gradients, *, learning_rates=None, optimizers=None)
¶Do an update on the distribution by following the given gradients.
It is expected that the inheriting class has its own implementation for this method.
Parameters:
Name  Type  Description  Default 

gradients 
dict 
Gradients, as a dictionary, which will be used for computing the necessary updates. 
required 
learning_rates 
Optional[dict] 
A dictionary which contains learning rates for parameters that will be updated using a learning rate coefficient. 
None 
optimizers 
Optional[dict] 
A dictionary which contains optimizer objects for parameters that will be updated using an adaptive optimizer. 
None 
Returns:
Type  Description 

SeparableGaussian 
The updated copy of the distribution. 
Source code in evotorch/algorithms/distributed/gaussian.py
def update_parameters(
self,
gradients: dict,
*,
learning_rates: Optional[dict] = None,
optimizers: Optional[dict] = None,
) > "SeparableGaussian":
mu_grad = gradients["mu"]
sigma_grad = gradients["sigma"]
new_mu = self.mu + self._follow_gradient("mu", mu_grad, learning_rates=learning_rates, optimizers=optimizers)
new_sigma = self.sigma + self._follow_gradient(
"sigma", sigma_grad, learning_rates=learning_rates, optimizers=optimizers
)
return self.modified_copy(mu=new_mu, sigma=new_sigma)
__init__(self, problem, *, popsize, parenthood_ratio, stdev_init=None, radius_init=None, num_interactions=None, popsize_max=None, center_init=None, stdev_min=None, stdev_max=None, stdev_max_change=None, obj_index=None, distributed=False, popsize_weighted_grad_avg=None)
special
¶
__init__(...)
: Initialize the search algorithm.
Parameters:
Name  Type  Description  Default 

problem 
Problem 
The problem object to work on. 
required 
popsize 
int 
The population size. 
required 
parenthood_ratio 
float 
Expected as a float larger than 0 and smaller
than 1. For example, setting this value to 0.1 means that
the top 10% of the population will be declared as the parents,
and those parents will be used for updating the population.
The amount of parents is always computed according to the
specified 
required 
stdev_init 
Union[float, Iterable[float], torch.Tensor] 
The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
radius_init 
Union[float, Iterable[float], torch.Tensor] 
The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
num_interactions 
Optional[int] 
When given as an integer n, it is ensured that a population has interacted with the GymProblem's environment n times. If this target has not been reached yet, then the population is declared too small, and gets extended with more samples, until n amount of interactions is reached. When given as None, popsize is the only configuration affecting the size of a population. 
None 
popsize_max 
Optional[int] 
Having 
None 
center_init 
Union[float, Iterable[float], torch.Tensor] 
The initial center solution. Can be left as None. 
None 
stdev_min 
Union[float, Iterable[float], torch.Tensor] 
The minimum value for the standard deviation values of the Gaussian search distribution. Can be left as None (which is the default), or can be given as a scalar or as a 1dimensional array. 
None 
stdev_max 
Union[float, Iterable[float], torch.Tensor] 
The maximum value for the standard deviation values of the Gaussian search distribution. Can be left as None (which is the default), or can be given as a scalar or as a 1dimensional array. 
None 
stdev_max_change 
Union[float, Iterable[float], torch.Tensor] 
The maximum update ratio allowed on the standard deviation. Expected as None if no such limiter is needed, or as a real number within 0.0 and 1.0 otherwise. In the PGPE implementation of Ha (2017, 2018), a value of 0.2 (20%) was used. For this CEM implementation, the default is None. 
None 
obj_index 
Optional[int] 
Index of the objective according to which the gradient estimations will be done. For singleobjective problems, this can be left as None. 
None 
distributed 
bool 
Whether or not the gradient computation will
be distributed. If 
False 
popsize_weighted_grad_avg 
Optional[bool] 
Only to be used in distributed mode.
(where being in distributed mode means 
None 
Source code in evotorch/algorithms/distributed/gaussian.py
def __init__(
self,
problem: Problem,
*,
popsize: int,
parenthood_ratio: float,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[Union[float, RealOrVector]] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the search algorithm.
Args:
problem: The problem object to work on.
popsize: The population size.
parenthood_ratio: Expected as a float larger than 0 and smaller
than 1. For example, setting this value to 0.1 means that
the top 10% of the population will be declared as the parents,
and those parents will be used for updating the population.
The amount of parents is always computed according to the
specified `popsize`, not according to the adapted population
size, and not according to `popsize_max`.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
center_init: The initial center solution.
Can be left as None.
stdev_min: The minimum value for the standard deviation
values of the Gaussian search distribution.
Can be left as None (which is the default),
or can be given as a scalar or as a 1dimensional array.
stdev_max: The maximum value for the standard deviation
values of the Gaussian search distribution.
Can be left as None (which is the default),
or can be given as a scalar or as a 1dimensional array.
stdev_max_change: The maximum update ratio allowed on the
standard deviation. Expected as None if no such limiter
is needed, or as a real number within 0.0 and 1.0 otherwise.
In the PGPE implementation of Ha (2017, 2018), a value of
0.2 (20%) was used.
For this CEM implementation, the default is None.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
self.DISTRIBUTION_PARAMS = {"parenthood_ratio": float(parenthood_ratio)}
super().__init__(
problem,
popsize=popsize,
center_learning_rate=1.0,
stdev_learning_rate=1.0,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=None,
optimizer_config=None,
ranking_method=None,
center_init=center_init,
stdev_min=stdev_min,
stdev_max=stdev_max,
stdev_max_change=stdev_max_change,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
)
GaussianSearchAlgorithm (SearchAlgorithm, SinglePopulationAlgorithmMixin)
¶
Base class for search algorithms based on Gaussian distribution.
Source code in evotorch/algorithms/distributed/gaussian.py
class GaussianSearchAlgorithm(SearchAlgorithm, SinglePopulationAlgorithmMixin):
"""
Base class for search algorithms based on Gaussian distribution.
"""
DISTRIBUTION_TYPE = NotImplemented
DISTRIBUTION_PARAMS = NotImplemented
def __init__(
self,
problem: Problem,
*,
popsize: int,
center_learning_rate: float,
stdev_learning_rate: float,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer=None,
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = None,
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[RealOrVector] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
ensure_even_popsize: bool = False,
):
# Ensure that the problem is numeric
problem.ensure_numeric()
# The distributionbased algorithms we consider here cannot handle strict lower and upper bound constraints.
# Therefore, we ensure that the given problem is unbounded.
problem.ensure_unbounded()
# Initialize the SearchAlgorithm, which is the parent class
SearchAlgorithm.__init__(
self,
problem,
center=self._get_mu,
stdev=self._get_sigma,
mean_eval=self._get_mean_eval,
)
self._ensure_even_popsize = bool(ensure_even_popsize)
if not distributed:
# self.add_status_getters({"median_eval": self._get_median_eval})
if num_interactions is not None:
self.add_status_getters({"popsize": self._get_popsize})
if self._ensure_even_popsize:
if (popsize % 2) != 0:
raise ValueError(
f"`popsize` was expected as an even number. However, the received `popsize` is {popsize}."
)
if center_init is None:
# If a starting point for the search distribution is not given,
# then we use the problem object to generate us one.
mu = problem.generate_values(1).reshape(1)
else:
# If a starting point for the search distribution is given,
# then we make sure that its length, dtype, and device
# are correct.
mu = problem.ensure_tensor_length_and_dtype(center_init, allow_scalar=False, about="center_init")
# Get the standard deviation or the radius configuration from the arguments
stdev_init = to_stdev_init(
solution_length=problem.solution_length, stdev_init=stdev_init, radius_init=radius_init
)
# Make sure that the provided initial standard deviation is
# of correct length, dtype, and device.
sigma = problem.ensure_tensor_length_and_dtype(stdev_init, about="stdev_init", allow_scalar=False)
# Create the distribution
dist_cls = self.DISTRIBUTION_TYPE
dist_params = deepcopy(self.DISTRIBUTION_PARAMS) if self.DISTRIBUTION_PARAMS is not None else {}
dist_params.update({"mu": mu, "sigma": sigma})
self._distribution: Distribution = dist_cls(dist_params, dtype=problem.dtype, device=problem.device)
# Store the following keyword arguments to use later
self._popsize = int(popsize)
self._popsize_max = None if popsize_max is None else int(popsize_max)
self._num_interactions = None if num_interactions is None else int(num_interactions)
self._center_learning_rate = float(center_learning_rate)
self._stdev_learning_rate = float(stdev_learning_rate)
self._optimizer = self._initialize_optimizer(self._center_learning_rate, optimizer, optimizer_config)
self._ranking_method = None if ranking_method is None else str(ranking_method)
self._stdev_min = (
None
if stdev_min is None
else problem.ensure_tensor_length_and_dtype(stdev_min, about="stdev_min", allow_scalar=True)
)
self._stdev_max = (
None
if stdev_max is None
else problem.ensure_tensor_length_and_dtype(stdev_max, about="stdev_max", allow_scalar=True)
)
self._stdev_max_change = (
None
if stdev_max_change is None
else problem.ensure_tensor_length_and_dtype(stdev_max_change, about="stdev_max_change", allow_scalar=True)
)
self._obj_index = problem.normalize_obj_index(obj_index)
if distributed and (problem.num_actors > 0):
# If the algorithm is initialized in distributed mode, and also if the problem is configured
# for parallelization, then the _step method becomes an alias for _step_distributed
self._step = self._step_distributed
else:
# Otherwise, the _step method becomes an alias for _step_non_distributed
self._step = self._step_non_distributed
if popsize_weighted_grad_avg is None:
self._popsize_weighted_grad_avg = num_interactions is None
else:
if not distributed:
raise ValueError(
"The argument `popsize_weighted_grad_avg` can only be used in distributed mode."
" (i.e. when the argument `distributed` is given as True)."
" When `distributed` is False, please leave `popsize_weighted_grad_avg` as None."
)
self._popsize_weighted_grad_avg = bool(popsize_weighted_grad_avg)
self._mean_eval: Optional[float] = None
self._population: Optional[SolutionBatch] = None
self._first_iter: bool = True
# We would like to add the reporting capabilities of the mixin class `singlePopulationAlgorithmMixin`.
# However, we exclude "mean_eval" from the reporting services requested from `SinglePopulationAlgorithmMixin`
# because this class has its own reporting mechanism for `mean_eval`.
# Additionally, we enable the reporting services of `SinglePopulationAlgorithmMixin` only when we are
# in the nondistributed mode. This is because we do not have a centrally stored population at all in the
# distributed mode.
SinglePopulationAlgorithmMixin.__init__(self, exclude="mean_eval", enable=(not distributed))
def _initialize_optimizer(
self, learning_rate: float, optimizer=None, optimizer_config: Optional[dict] = None
) > object:
if optimizer is None:
return None
elif isinstance(optimizer, str):
center_optim_cls = get_optimizer_class(optimizer, optimizer_config)
return center_optim_cls(
stepsize=float(learning_rate),
dtype=self._distribution.dtype,
solution_length=self._distribution.solution_length,
device=self._distribution.device,
)
else:
return optimizer
def _step(self):
raise NotImplementedError
def _step_distributed(self):
# Use the problem object's `sample_and_compute_gradients` method
# to do parallelized and distributed gradient computation
fetched = self.problem.sample_and_compute_gradients(
self._distribution,
self._popsize,
popsize_max=self._popsize_max,
obj_index=self._obj_index,
num_interactions=self._num_interactions,
ranking_method=self._ranking_method,
ensure_even_popsize=self._ensure_even_popsize,
)
# The method `sample_and_compute_gradients(...)` returns a list of dictionaries, each dictionary being
# the result of a different remote computation.
# For each remote computation, the list will contain a dictionary that looks like this:
# {"gradients": <gradients dictionary here>, "num_solutions": ..., "mean_eval": ...}
# We will now accumulate all the gradients, num_solutions, and mean_evals in their own lists.
# So, in the end, we will have a list of gradients, a list of num_solutions, and a list of
# mean_eval.
# These lists will be stored by the following temporary class:
class list_of:
gradients = []
num_solutions = []
mean_eval = []
# We are now filling the lists declared above
n = len(fetched)
for i in range(n):
list_of.gradients.append(fetched[i]["gradients"])
list_of.num_solutions.append(fetched[i]["num_solutions"])
list_of.mean_eval.append(fetched[i]["mean_eval"])
# Here, we get the keys of our gradient dictionaries.
# For most simple Gaussian distributions, grad_keys should be {"mu", "sigma"}.
grad_keys = set(list_of.gradients[0].keys())
# We now find the total number of solutions and the overall average mean_eval.
# The overall average mean will be reported to the user.
total_num_solutions = 0
total_weighted_eval = 0
for i in range(n):
total_num_solutions += list_of.num_solutions[i]
total_weighted_eval += float(list_of.num_solutions[i] * list_of.mean_eval[i])
avg_mean_eval = total_weighted_eval / total_num_solutions
# For each gradient (in most cases among 'mu' and 'sigma'), we allocate a new 0filled tensor.
avg_gradients = {}
for key in grad_keys:
avg_gradients[key] = self._distribution.make_zeros(num_solutions=1).reshape(1)
# Below, we iterate over all collected results and add their gradients, in a weighted manner, onto the
# `avg_gradients` we allocated above.
# At the end, `avg_gradients` will store the weightedaveraged gradients to be followed by the algorithm.
for i in range(n):
# For each collected result, we compute a weight for the gradient, which is the number of solutions
# sampled divided by the total number of sampled solutions.
num_solutions = list_of.num_solutions[i]
if self._popsize_weighted_grad_avg:
# If we are to weigh each gradient by its popsize (i.e. by its sample size)
# then the its weight is computed as its number of solutions divided by the
# total number of solutions
weight = num_solutions / total_num_solutions
else:
# If we are NOT to weigh each gradient by its popsize (i.e. by its sample size)
# then the weight of this gradient simply becomes 1 divided by the number of gradients.
weight = 1 / n
for key in grad_keys:
grad = list_of.gradients[i][key]
avg_gradients[key] += weight * grad
self._update_distribution(avg_gradients)
self._mean_eval = avg_mean_eval
def _step_non_distributed(self):
# First, we define an inner function which fills the current population by sampling from the distribution.
def fill_and_eval_pop():
# This inner function is responsible for filling the main population with samples
# and evaluate them.
if self._num_interactions is None:
# If num_interactions is configured as None, this means that we are not going to adapt
# the population size according to the number of simulation interactions reported
# by the problem object.
# We first make sure that the population (which is to be of constant size, since we are
# not in the adaptive population size mode) is allocated.
if self._population is None:
self._population = SolutionBatch(
self.problem, popsize=self._popsize, device=self._distribution.device, empty=True
)
# Now, we do inplace sampling on the population.
self._distribution.sample(out=self._population.access_values(), generator=self.problem)
# Finally, here, the solutions are evaluated.
self.problem.evaluate(self._population)
else:
# If num_interactions is not None, then this means that we have a threshold for the number
# of simulator interactions to reach before declaring the phase of sampling complete.
# In other words, we have to adapt our population size according to the number of simulator
# interactions reported by the problem object.
# The 'total_interaction_count' status reported by the problem object shows the global interaction count.
# Therefore, to properly count the simulator interactions we made during this generation, we need
# to get the interaction count before starting our sampling and evaluation operations.
first_num_interactions = self.problem.status.get("total_interaction_count", 0)
# We will keep allocating and evaluating new populations until the interaction count threshold is reached.
# These newly allocated populations will eventually concatenated into one.
# The notyetconcatenated populations and the total allocated population size will be stored below:
populations = []
total_popsize = 0
# Below, we repeatedly allocate, sample, and evaluate, until our thresholds are reached.
while True:
# Allocate a new population
newpop = SolutionBatch(
self.problem,
popsize=self._popsize,
like=self._population,
empty=True,
)
# Update the total population size
total_popsize += len(newpop)
# Sample new solutions within the newly allocated population
self._distribution.sample(out=newpop.access_values(), generator=self.problem)
# Evaluate the new population
self.problem.evaluate(newpop)
# Add the newly allocated and evaluated population into the populations list
populations.append(newpop)
# In addition to the num_interactions threshold, we might also have a popsize_max threshold.
# We now check this threshold.
if (self._popsize_max is not None) and (total_popsize >= self._popsize_max):
# If the popsize_max threshold is reached, we leave the loop.
break
# We now compute the number of interactions we have made during this while loop.
interactions_made = self.problem.status["total_interaction_count"]  first_num_interactions
if interactions_made > self._num_interactions:
# If the number of interactions exceeds our threshold, we leave the loop.
break
# Finally, we concatenate all our populations into one.
self._population = SolutionBatch.cat(populations)
if self._first_iter:
# If we are computing the first generation, we just sample from our distribution and evaluate
# the solutions.
fill_and_eval_pop()
self._first_iter = False
else:
# If we are computing next generations, then we need to compute the gradients of the last
# generation, sample a new population, and evaluate the new population's solutions.
samples = self._population.access_values(keep_evals=True)
fitnesses = self._population.access_evals()[:, self._obj_index]
obj_sense = self.problem.senses[self._obj_index]
ranking_method = self._ranking_method
gradients = self._distribution.compute_gradients(
samples, fitnesses, objective_sense=obj_sense, ranking_method=ranking_method
)
self._update_distribution(gradients)
fill_and_eval_pop()
def _update_distribution(self, gradients: dict):
# This is where we follow the gradients with the help of the stored Distribution object.
# First, we check whether or not we will need to do a controlled update on the
# standard deviation (do we have imposed lower and upper bounds for the standard deviation,
# and do we have a maximum change limiter?)
controlled_stdev_update = (
(self._stdev_min is not None) or (self._stdev_max is not None) or (self._stdev_max_change is not None)
)
if controlled_stdev_update:
# If the standard deviation update needs to be controlled, we store the standard deviation just before
# the update. We will use this later.
old_sigma = self._distribution.sigma
# Here, we determine for which distribution parameter we have a learning rate and for which distribution
# parameter we have an optimizer.
learning_rates = {}
optimizers = {}
if self._optimizer is not None:
# If there is an optimizer, then we declare that "mu" has an optimizer
optimizers["mu"] = self._optimizer
else:
# If we do not have an optimizer, then we declare that "mu" has a raw learning rate coefficient
learning_rates["mu"] = self._center_learning_rate
# Here, we declare that "sigma" has a learning rate
learning_rates["sigma"] = self._stdev_learning_rate
# With the help of the Distribution object's `update_parameters(...)` method, we follow the gradients
updated_dist = self._distribution.update_parameters(
gradients, learning_rates=learning_rates, optimizers=optimizers
)
if controlled_stdev_update:
# If our standard deviation update needs to be controlled, then, considering the preupdate
# standard deviation, we ensure that the update constraints (lower and upper bounds and maximum change)
# are not violated.
updated_dist = updated_dist.modified_copy(
sigma=modify_tensor(
old_sigma,
updated_dist.sigma,
lb=self._stdev_min,
ub=self._stdev_max,
max_change=self._stdev_max_change,
)
)
# Now we can declare that our main distribution is the updated one
self._distribution = updated_dist
def _get_mu(self) > torch.Tensor:
return self._distribution.parameters["mu"]
def _get_sigma(self) > torch.Tensor:
return self._distribution.parameters["sigma"]
def _get_mean_eval(self) > Optional[float]:
if self._population is None:
return self._mean_eval
else:
return float(torch.mean(self._population.evals[:, self._obj_index]))
# def _get_median_eval(self) > Optional[float]:
# if self._population is None:
# return None
# else:
# return float(torch.median(self._population.evals[:, self._obj_index]))
def _get_popsize(self) > int:
return 0 if self._population is None else len(self._population)
@property
def optimizer(self):
"""
Get the optimizer used by this search algorithm.
If an optimizer is not being used, the result will be `None`.
If a PyTorch optimizer is being used, the result will be an instance of
`torch.optim.Optimizer`.
If the returned optimizer is "clipup", then the returned object will be
an instance of `evotorch.optimizers.ClipUp`.
The returned optimizer object can be used for reading/writing the
hyperparameters. For example, to read the learning of the optimizer,
one can do:
```python
learning_rate = my_search_algorithm.optimizer.param_groups[0]["lr"]
```
One can also update the learning rate like this:
```python
my_search_algorithm.optimizer.param_groups[0]["lr"] = new_learning_rate
```
**Note for when updating the learning rate of ClipUp.**
At the moment of initialization, if one provides `center_learning_rate`
but the maximum speed is not specified (i.e. the search algorithm is not
given something like `optimizer_config={"max_speed": ...}`), then, the
maximum speed is initialized as `2*center_learning_rate`. However, when
this `center_learning_rate` is later modified (via
`my_search_algorithm.optimizer.param_groups[0]["lr"] = new_center_learning_rate`
the maximum speed will NOT be automatically adjusted.
Therefore, when updating the center learning rate of ClipUp, consider
also adjusting the maximum speed of ClipUp via:
`my_search_algorithm.optimizer.param_groups[0]["max_speed"] = ...`
"""
return None if self._optimizer is None else self._optimizer.contained_optimizer
@property
def population(self) > Optional[SolutionBatch]:
"""
The population, represented by a SolutionBatch.
If the population is not initialized yet, the retrieved value will
be None.
Also note that, if this algorithm is in distributed mode, the
retrieved value will be None, since the distributed mode causes the
population to be generated in the remote actors, and not in the main
process.
"""
return self._population
@property
def obj_index(self) > int:
"""
Index of the focused objective
"""
return self._obj_index
obj_index: int
property
readonly
¶
Index of the focused objective
optimizer
property
readonly
¶
Get the optimizer used by this search algorithm.
If an optimizer is not being used, the result will be None
.
If a PyTorch optimizer is being used, the result will be an instance of
torch.optim.Optimizer
.
If the returned optimizer is "clipup", then the returned object will be
an instance of evotorch.optimizers.ClipUp
.
The returned optimizer object can be used for reading/writing the hyperparameters. For example, to read the learning of the optimizer, one can do:
One can also update the learning rate like this:
Note for when updating the learning rate of ClipUp.
At the moment of initialization, if one provides center_learning_rate
but the maximum speed is not specified (i.e. the search algorithm is not
given something like optimizer_config={"max_speed": ...}
), then, the
maximum speed is initialized as 2*center_learning_rate
. However, when
this center_learning_rate
is later modified (via
my_search_algorithm.optimizer.param_groups[0]["lr"] = new_center_learning_rate
the maximum speed will NOT be automatically adjusted.
Therefore, when updating the center learning rate of ClipUp, consider
also adjusting the maximum speed of ClipUp via:
my_search_algorithm.optimizer.param_groups[0]["max_speed"] = ...
population: Optional[evotorch.core.SolutionBatch]
property
readonly
¶
The population, represented by a SolutionBatch.
If the population is not initialized yet, the retrieved value will be None. Also note that, if this algorithm is in distributed mode, the retrieved value will be None, since the distributed mode causes the population to be generated in the remote actors, and not in the main process.
PGPE (GaussianSearchAlgorithm)
¶
This implementation is the symmetricsampling variant proposed in the paper Sehnke et al. (2010).
Inspired by the PGPE implementations used in the studies of Ha (2017, 2019), and by the evolution strategy variant of Salimans et al. (2017), this PGPE implementation uses 0centered ranking by default. The default optimizer for this PGPE implementation is ClipUp (Toklu et al., 2020).
References:
Frank Sehnke, Christian Osendorfer, Thomas Ruckstiess,
Alex Graves, Jan Peters, Jurgen Schmidhuber (2010).
Parameterexploring Policy Gradients.
Neural Networks 23(4), 551559.
David Ha (2017). Evolving Stable Strategies.
<http://blog.otoro.net/2017/11/12/evolvingstablestrategies/>
Salimans, T., Ho, J., Chen, X., Sidor, S. and Sutskever, I. (2017).
Evolution Strategies as a Scalable Alternative to
Reinforcement Learning.
David Ha (2019). Reinforcement Learning for Improving Agent Design.
Artificial life 25 (4), 352365.
Toklu, N.E., Liskowski, P., Srivastava, R.K. (2020).
ClipUp: A Simple and Powerful Optimizer
for Distributionbased Policy Evolution.
Parallel Problem Solving from Nature (PPSN 2020).
Source code in evotorch/algorithms/distributed/gaussian.py
class PGPE(GaussianSearchAlgorithm):
"""
PGPE: Policy gradient with parameterbased exploration.
This implementation is the symmetricsampling variant proposed
in the paper Sehnke et al. (2010).
Inspired by the PGPE implementations used in the studies
of Ha (2017, 2019), and by the evolution strategy variant of
Salimans et al. (2017), this PGPE implementation uses 0centered
ranking by default.
The default optimizer for this PGPE implementation is ClipUp
(Toklu et al., 2020).
References:
Frank Sehnke, Christian Osendorfer, Thomas Ruckstiess,
Alex Graves, Jan Peters, Jurgen Schmidhuber (2010).
Parameterexploring Policy Gradients.
Neural Networks 23(4), 551559.
David Ha (2017). Evolving Stable Strategies.
<http://blog.otoro.net/2017/11/12/evolvingstablestrategies/>
Salimans, T., Ho, J., Chen, X., Sidor, S. and Sutskever, I. (2017).
Evolution Strategies as a Scalable Alternative to
Reinforcement Learning.
David Ha (2019). Reinforcement Learning for Improving Agent Design.
Artificial life 25 (4), 352365.
Toklu, N.E., Liskowski, P., Srivastava, R.K. (2020).
ClipUp: A Simple and Powerful Optimizer
for Distributionbased Policy Evolution.
Parallel Problem Solving from Nature (PPSN 2020).
"""
DISTRIBUTION_TYPE = NotImplemented # To be filled by the PGPE instance
DISTRIBUTION_PARAMS = NotImplemented # To be filled by the PGPE instance
def __init__(
self,
problem: Problem,
*,
popsize: int,
center_learning_rate: float,
stdev_learning_rate: float,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer="clipup",
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = "centered",
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[RealOrVector] = 0.2,
symmetric: bool = True,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the PGPE algorithm.
Args:
problem: The problem object which is being worked on.
The problem must have its dtype defined
(which means it works on Solution objects,
not with custom Solution objects).
Also, the problem must be singleobjective.
popsize: The population size.
In the case of PGPE, `popsize` is expected as an even number
in nondistributed mode. In distributed mode, PGPE will
ensure that each subpopulation size assigned to a remote
actor is an even number.
This behavior is because PGPE does symmetric sampling
(i.e. solutions are sampled in pairs).
center_learning_rate: The learning rate for the center
of the search distribution.
stdev_learning_rate: The learning rate for the standard
deviation values of the search distribution.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
optimizer: The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is 'clipup'.
Note that, for ClipUp, the default maximum speed is set
as twice the given `center_learning_rate`.
This maximum speed can be configured by passing
`{"max_speed": ...}` to `optimizer_config`.
optimizer_config: Configuration which will be passed
to the optimizer as keyword arguments.
See `evotorch.optimizers` for details about
which optimizer accepts which keyword arguments.
ranking_method: Which ranking method will be used for
fitness shaping. See the documentation of
`evotorch.ranking.rank(...)` for details.
As in the study of Salimans et al. (2017),
the default is 'centered'.
Can be given as None if no such ranking is required.
center_init: The initial center solution.
Can be left as None.
stdev_min: Lower bound for the standard deviation value/array.
Can be given as a real number, or as an array of real numbers.
stdev_max: Upper bound for the standard deviation value/array.
Can be given as a real number, or as an array of real numbers.
stdev_max_change: The maximum update ratio allowed on the
standard deviation. Expected as None if no such limiter
is needed, or as a real number within 0.0 and 1.0 otherwise.
Like in the implementation of Ha (2017, 2018),
the default value for this setting is 0.2, meaning that
the update on the standard deviation values can not be
more than 20% of their original values.
symmetric: Whether or not the solutions will be sampled
in a symmetric/mirrored/antithetic manner.
The default is True.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
if symmetric:
self.DISTRIBUTION_TYPE = SymmetricSeparableGaussian
divide_by = "num_directions"
else:
self.DISTRIBUTION_TYPE = SeparableGaussian
divide_by = "num_solutions"
self.DISTRIBUTION_PARAMS = {"divide_mu_grad_by": divide_by, "divide_sigma_grad_by": divide_by}
super().__init__(
problem,
popsize=popsize,
center_learning_rate=center_learning_rate,
stdev_learning_rate=stdev_learning_rate,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=optimizer,
optimizer_config=optimizer_config,
ranking_method=ranking_method,
center_init=center_init,
stdev_min=stdev_min,
stdev_max=stdev_max,
stdev_max_change=stdev_max_change,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
ensure_even_popsize=symmetric,
)
__init__(self, problem, *, popsize, center_learning_rate, stdev_learning_rate, stdev_init=None, radius_init=None, num_interactions=None, popsize_max=None, optimizer='clipup', optimizer_config=None, ranking_method='centered', center_init=None, stdev_min=None, stdev_max=None, stdev_max_change=0.2, symmetric=True, obj_index=None, distributed=False, popsize_weighted_grad_avg=None)
special
¶
__init__(...)
: Initialize the PGPE algorithm.
Parameters:
Name  Type  Description  Default 

problem 
Problem 
The problem object which is being worked on. The problem must have its dtype defined (which means it works on Solution objects, not with custom Solution objects). Also, the problem must be singleobjective. 
required 
popsize 
int 
The population size.
In the case of PGPE, 
required 
center_learning_rate 
float 
The learning rate for the center of the search distribution. 
required 
stdev_learning_rate 
float 
The learning rate for the standard deviation values of the search distribution. 
required 
stdev_init 
Union[float, Iterable[float], torch.Tensor] 
The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
radius_init 
Union[float, Iterable[float], torch.Tensor] 
The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
num_interactions 
Optional[int] 
When given as an integer n, it is ensured that a population has interacted with the GymProblem's environment n times. If this target has not been reached yet, then the population is declared too small, and gets extended with more samples, until n amount of interactions is reached. When given as None, popsize is the only configuration affecting the size of a population. 
None 
popsize_max 
Optional[int] 
Having 
None 
optimizer 
The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is 'clipup'.
Note that, for ClipUp, the default maximum speed is set
as twice the given 
'clipup' 

optimizer_config 
Optional[dict] 
Configuration which will be passed
to the optimizer as keyword arguments.
See 
None 
ranking_method 
Optional[str] 
Which ranking method will be used for
fitness shaping. See the documentation of

'centered' 
center_init 
Union[float, Iterable[float], torch.Tensor] 
The initial center solution. Can be left as None. 
None 
stdev_min 
Union[float, Iterable[float], torch.Tensor] 
Lower bound for the standard deviation value/array. Can be given as a real number, or as an array of real numbers. 
None 
stdev_max 
Union[float, Iterable[float], torch.Tensor] 
Upper bound for the standard deviation value/array. Can be given as a real number, or as an array of real numbers. 
None 
stdev_max_change 
Union[float, Iterable[float], torch.Tensor] 
The maximum update ratio allowed on the standard deviation. Expected as None if no such limiter is needed, or as a real number within 0.0 and 1.0 otherwise. Like in the implementation of Ha (2017, 2018), the default value for this setting is 0.2, meaning that the update on the standard deviation values can not be more than 20% of their original values. 
0.2 
symmetric 
bool 
Whether or not the solutions will be sampled in a symmetric/mirrored/antithetic manner. The default is True. 
True 
obj_index 
Optional[int] 
Index of the objective according to which the gradient estimations will be done. For singleobjective problems, this can be left as None. 
None 
distributed 
bool 
Whether or not the gradient computation will
be distributed. If 
False 
popsize_weighted_grad_avg 
Optional[bool] 
Only to be used in distributed mode.
(where being in distributed mode means 
None 
Source code in evotorch/algorithms/distributed/gaussian.py
def __init__(
self,
problem: Problem,
*,
popsize: int,
center_learning_rate: float,
stdev_learning_rate: float,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer="clipup",
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = "centered",
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[RealOrVector] = 0.2,
symmetric: bool = True,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the PGPE algorithm.
Args:
problem: The problem object which is being worked on.
The problem must have its dtype defined
(which means it works on Solution objects,
not with custom Solution objects).
Also, the problem must be singleobjective.
popsize: The population size.
In the case of PGPE, `popsize` is expected as an even number
in nondistributed mode. In distributed mode, PGPE will
ensure that each subpopulation size assigned to a remote
actor is an even number.
This behavior is because PGPE does symmetric sampling
(i.e. solutions are sampled in pairs).
center_learning_rate: The learning rate for the center
of the search distribution.
stdev_learning_rate: The learning rate for the standard
deviation values of the search distribution.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
optimizer: The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is 'clipup'.
Note that, for ClipUp, the default maximum speed is set
as twice the given `center_learning_rate`.
This maximum speed can be configured by passing
`{"max_speed": ...}` to `optimizer_config`.
optimizer_config: Configuration which will be passed
to the optimizer as keyword arguments.
See `evotorch.optimizers` for details about
which optimizer accepts which keyword arguments.
ranking_method: Which ranking method will be used for
fitness shaping. See the documentation of
`evotorch.ranking.rank(...)` for details.
As in the study of Salimans et al. (2017),
the default is 'centered'.
Can be given as None if no such ranking is required.
center_init: The initial center solution.
Can be left as None.
stdev_min: Lower bound for the standard deviation value/array.
Can be given as a real number, or as an array of real numbers.
stdev_max: Upper bound for the standard deviation value/array.
Can be given as a real number, or as an array of real numbers.
stdev_max_change: The maximum update ratio allowed on the
standard deviation. Expected as None if no such limiter
is needed, or as a real number within 0.0 and 1.0 otherwise.
Like in the implementation of Ha (2017, 2018),
the default value for this setting is 0.2, meaning that
the update on the standard deviation values can not be
more than 20% of their original values.
symmetric: Whether or not the solutions will be sampled
in a symmetric/mirrored/antithetic manner.
The default is True.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
if symmetric:
self.DISTRIBUTION_TYPE = SymmetricSeparableGaussian
divide_by = "num_directions"
else:
self.DISTRIBUTION_TYPE = SeparableGaussian
divide_by = "num_solutions"
self.DISTRIBUTION_PARAMS = {"divide_mu_grad_by": divide_by, "divide_sigma_grad_by": divide_by}
super().__init__(
problem,
popsize=popsize,
center_learning_rate=center_learning_rate,
stdev_learning_rate=stdev_learning_rate,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=optimizer,
optimizer_config=optimizer_config,
ranking_method=ranking_method,
center_init=center_init,
stdev_min=stdev_min,
stdev_max=stdev_max,
stdev_max_change=stdev_max_change,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
ensure_even_popsize=symmetric,
)
SNES (GaussianSearchAlgorithm)
¶
Inspired by the implementation at: http://schaul.site44.com/code/snes.py
Reference:
Schaul, T., Glasmachers, T., Schmidhuber, J. (2011).
High Dimensions and Heavy Tails for Natural Evolution Strategies.
Proceedings of the 13th annual conference on Genetic and evolutionary
computation (GECCO 2011).
Source code in evotorch/algorithms/distributed/gaussian.py
class SNES(GaussianSearchAlgorithm):
"""
SNES: Separable Natural Evolution Strategies
Inspired by the implementation at: http://schaul.site44.com/code/snes.py
Reference:
Schaul, T., Glasmachers, T., Schmidhuber, J. (2011).
High Dimensions and Heavy Tails for Natural Evolution Strategies.
Proceedings of the 13th annual conference on Genetic and evolutionary
computation (GECCO 2011).
"""
DISTRIBUTION_TYPE = ExpSeparableGaussian
DISTRIBUTION_PARAMS = None
def __init__(
self,
problem: Problem,
*,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
popsize: Optional[int] = None,
center_learning_rate: Optional[float] = None,
stdev_learning_rate: Optional[float] = None,
scale_learning_rate: bool = True,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer=None,
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = "nes",
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[RealOrVector] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the SNES algorithm.
Args:
problem: The problem object which is being worked on.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
popsize: Population size. Can be specified as an int,
or can be left as None to let the solver decide.
In the case of SNES, `popsize` can be left as None,
in which case the default `popsize` will be computed
as `4 + floor(3 * log(n))` where `n` is the length
of a solution.
center_learning_rate: Learning rate for updating the mean
of the search distribution. Default value is 1.0
stdev_learning_rate: Learning rate for updating the covariance
matrix of the search distribution.
The default value is `0.2 * (3 + log(n)) / sqrt(n)`
where `n` is the length of a solution.
scale_learning_rate: For SNES, there is a default standard
deviation learning rate value which is computed as
`0.2 * (3 + log(n)) / sqrt(n)` (where `n` is the solution
length).
If scale_learning_rate is True (which is the default),
then the effective learning rate for the standard deviation
becomes the provided `stdev_learning_rate` multiplied by this
default value. If `scale_learning_rate` is False, then the
effective standard deviation learning rate becomes
equal to the provided `stdev_learning_rate` value.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
optimizer: The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is None.
Note that, for ClipUp, the default maximum speed is set
as twice the given `center_learning_rate`.
This maximum speed can be configured by passing
`{"max_speed": ...}` to `optimizer_config`.
optimizer_config: Configuration which will be passed
to the optimizer as keyword arguments.
See `evotorch.optimizers` for details about
which optimizer accepts which keyword arguments.
ranking_method: Which ranking method will be used for
fitness shaping. See the documentation of
`evotorch.ranking.rank(...)` for details.
The default is 'nes'.
Can be given as None if no such ranking is required.
center_init: The initial center solution.
Can be left as None.
stdev_min: Minimum values for the standard deviation.
Expected as a 1dimensional array to serve as a limiter
to the diagonals of the covariance matrix's square root.
stdev_max: Maximum values for the standard deviation.
Expected as a 1dimensional array to serve as a limiter
to the diagonals of the covariance matrix's square root.
stdev_max_change: Maximum change allowed for when updating
the square roort of the covariance matrix.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
if popsize is None:
popsize = int(4 + math.floor(3 * math.log(problem.solution_length)))
if center_learning_rate is None:
center_learning_rate = 1.0
def default_stdev_lr():
n = problem.solution_length
return 0.2 * (3 + math.log(n)) / math.sqrt(n)
if stdev_learning_rate is None:
stdev_learning_rate = default_stdev_lr()
else:
stdev_learning_rate = float(stdev_learning_rate)
if scale_learning_rate:
stdev_learning_rate *= default_stdev_lr()
super().__init__(
problem,
popsize=popsize,
center_learning_rate=center_learning_rate,
stdev_learning_rate=stdev_learning_rate,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=optimizer,
optimizer_config=optimizer_config,
ranking_method=ranking_method,
center_init=center_init,
stdev_min=stdev_min,
stdev_max=stdev_max,
stdev_max_change=stdev_max_change,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
)
DISTRIBUTION_TYPE (SeparableGaussian)
¶
exponentialseparable Multivariate Gaussian, as used by SNES
Source code in evotorch/algorithms/distributed/gaussian.py
class ExpSeparableGaussian(SeparableGaussian):
"""exponentialseparable Multivariate Gaussian, as used by SNES"""
MANDATORY_PARAMETERS = {"mu", "sigma"}
OPTIONAL_PARAMETERS = set()
def _compute_gradients(self, samples: torch.Tensor, weights: torch.Tensor, ranking_used: Optional[str]) > dict:
if ranking_used != "nes":
weights = weights / torch.sum(torch.abs(weights))
scaled_noises = samples  self.mu
raw_noises = scaled_noises / self.sigma
mu_grad = total(dot(weights, scaled_noises))
sigma_grad = total(dot(weights, (raw_noises**2)  1))
return {"mu": mu_grad, "sigma": sigma_grad}
def update_parameters(
self,
gradients: dict,
*,
learning_rates: Optional[dict] = None,
optimizers: Optional[dict] = None,
) > "ExpSeparableGaussian":
mu_grad = gradients["mu"]
sigma_grad = gradients["sigma"]
new_mu = self.mu + self._follow_gradient("mu", mu_grad, learning_rates=learning_rates, optimizers=optimizers)
new_sigma = self.sigma * torch.exp(
0.5 * self._follow_gradient("sigma", sigma_grad, learning_rates=learning_rates, optimizers=optimizers)
)
return self.modified_copy(mu=new_mu, sigma=new_sigma)
update_parameters(self, gradients, *, learning_rates=None, optimizers=None)
¶Do an update on the distribution by following the given gradients.
It is expected that the inheriting class has its own implementation for this method.
Parameters:
Name  Type  Description  Default 

gradients 
dict 
Gradients, as a dictionary, which will be used for computing the necessary updates. 
required 
learning_rates 
Optional[dict] 
A dictionary which contains learning rates for parameters that will be updated using a learning rate coefficient. 
None 
optimizers 
Optional[dict] 
A dictionary which contains optimizer objects for parameters that will be updated using an adaptive optimizer. 
None 
Returns:
Type  Description 

ExpSeparableGaussian 
The updated copy of the distribution. 
Source code in evotorch/algorithms/distributed/gaussian.py
def update_parameters(
self,
gradients: dict,
*,
learning_rates: Optional[dict] = None,
optimizers: Optional[dict] = None,
) > "ExpSeparableGaussian":
mu_grad = gradients["mu"]
sigma_grad = gradients["sigma"]
new_mu = self.mu + self._follow_gradient("mu", mu_grad, learning_rates=learning_rates, optimizers=optimizers)
new_sigma = self.sigma * torch.exp(
0.5 * self._follow_gradient("sigma", sigma_grad, learning_rates=learning_rates, optimizers=optimizers)
)
return self.modified_copy(mu=new_mu, sigma=new_sigma)
__init__(self, problem, *, stdev_init=None, radius_init=None, popsize=None, center_learning_rate=None, stdev_learning_rate=None, scale_learning_rate=True, num_interactions=None, popsize_max=None, optimizer=None, optimizer_config=None, ranking_method='nes', center_init=None, stdev_min=None, stdev_max=None, stdev_max_change=None, obj_index=None, distributed=False, popsize_weighted_grad_avg=None)
special
¶
__init__(...)
: Initialize the SNES algorithm.
Parameters:
Name  Type  Description  Default 

problem 
Problem 
The problem object which is being worked on. 
required 
stdev_init 
Union[float, Iterable[float], torch.Tensor] 
The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
radius_init 
Union[float, Iterable[float], torch.Tensor] 
The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
popsize 
Optional[int] 
Population size. Can be specified as an int,
or can be left as None to let the solver decide.
In the case of SNES, 
None 
center_learning_rate 
Optional[float] 
Learning rate for updating the mean of the search distribution. Default value is 1.0 
None 
stdev_learning_rate 
Optional[float] 
Learning rate for updating the covariance
matrix of the search distribution.
The default value is 
None 
scale_learning_rate 
bool 
For SNES, there is a default standard
deviation learning rate value which is computed as

True 
num_interactions 
Optional[int] 
When given as an integer n, it is ensured that a population has interacted with the GymProblem's environment n times. If this target has not been reached yet, then the population is declared too small, and gets extended with more samples, until n amount of interactions is reached. When given as None, popsize is the only configuration affecting the size of a population. 
None 
popsize_max 
Optional[int] 
Having 
None 
num_interactions 
Optional[int] 
When given as an integer n, it is ensured that a population has interacted with the GymProblem's environment n times. If this target has not been reached yet, then the population is declared too small, and gets extended with more samples, until n amount of interactions is reached. When given as None, popsize is the only configuration affecting the size of a population. 
None 
popsize_max 
Optional[int] 
Having 
None 
optimizer 
The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is None.
Note that, for ClipUp, the default maximum speed is set
as twice the given 
None 

optimizer_config 
Optional[dict] 
Configuration which will be passed
to the optimizer as keyword arguments.
See 
None 
ranking_method 
Optional[str] 
Which ranking method will be used for
fitness shaping. See the documentation of

'nes' 
center_init 
Union[float, Iterable[float], torch.Tensor] 
The initial center solution. Can be left as None. 
None 
stdev_min 
Union[float, Iterable[float], torch.Tensor] 
Minimum values for the standard deviation. Expected as a 1dimensional array to serve as a limiter to the diagonals of the covariance matrix's square root. 
None 
stdev_max 
Union[float, Iterable[float], torch.Tensor] 
Maximum values for the standard deviation. Expected as a 1dimensional array to serve as a limiter to the diagonals of the covariance matrix's square root. 
None 
stdev_max_change 
Union[float, Iterable[float], torch.Tensor] 
Maximum change allowed for when updating the square roort of the covariance matrix. 
None 
obj_index 
Optional[int] 
Index of the objective according to which the gradient estimations will be done. For singleobjective problems, this can be left as None. 
None 
distributed 
bool 
Whether or not the gradient computation will
be distributed. If 
False 
popsize_weighted_grad_avg 
Optional[bool] 
Only to be used in distributed mode.
(where being in distributed mode means 
None 
Source code in evotorch/algorithms/distributed/gaussian.py
def __init__(
self,
problem: Problem,
*,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
popsize: Optional[int] = None,
center_learning_rate: Optional[float] = None,
stdev_learning_rate: Optional[float] = None,
scale_learning_rate: bool = True,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer=None,
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = "nes",
center_init: Optional[RealOrVector] = None,
stdev_min: Optional[RealOrVector] = None,
stdev_max: Optional[RealOrVector] = None,
stdev_max_change: Optional[RealOrVector] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the SNES algorithm.
Args:
problem: The problem object which is being worked on.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
popsize: Population size. Can be specified as an int,
or can be left as None to let the solver decide.
In the case of SNES, `popsize` can be left as None,
in which case the default `popsize` will be computed
as `4 + floor(3 * log(n))` where `n` is the length
of a solution.
center_learning_rate: Learning rate for updating the mean
of the search distribution. Default value is 1.0
stdev_learning_rate: Learning rate for updating the covariance
matrix of the search distribution.
The default value is `0.2 * (3 + log(n)) / sqrt(n)`
where `n` is the length of a solution.
scale_learning_rate: For SNES, there is a default standard
deviation learning rate value which is computed as
`0.2 * (3 + log(n)) / sqrt(n)` (where `n` is the solution
length).
If scale_learning_rate is True (which is the default),
then the effective learning rate for the standard deviation
becomes the provided `stdev_learning_rate` multiplied by this
default value. If `scale_learning_rate` is False, then the
effective standard deviation learning rate becomes
equal to the provided `stdev_learning_rate` value.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
optimizer: The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is None.
Note that, for ClipUp, the default maximum speed is set
as twice the given `center_learning_rate`.
This maximum speed can be configured by passing
`{"max_speed": ...}` to `optimizer_config`.
optimizer_config: Configuration which will be passed
to the optimizer as keyword arguments.
See `evotorch.optimizers` for details about
which optimizer accepts which keyword arguments.
ranking_method: Which ranking method will be used for
fitness shaping. See the documentation of
`evotorch.ranking.rank(...)` for details.
The default is 'nes'.
Can be given as None if no such ranking is required.
center_init: The initial center solution.
Can be left as None.
stdev_min: Minimum values for the standard deviation.
Expected as a 1dimensional array to serve as a limiter
to the diagonals of the covariance matrix's square root.
stdev_max: Maximum values for the standard deviation.
Expected as a 1dimensional array to serve as a limiter
to the diagonals of the covariance matrix's square root.
stdev_max_change: Maximum change allowed for when updating
the square roort of the covariance matrix.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
if popsize is None:
popsize = int(4 + math.floor(3 * math.log(problem.solution_length)))
if center_learning_rate is None:
center_learning_rate = 1.0
def default_stdev_lr():
n = problem.solution_length
return 0.2 * (3 + math.log(n)) / math.sqrt(n)
if stdev_learning_rate is None:
stdev_learning_rate = default_stdev_lr()
else:
stdev_learning_rate = float(stdev_learning_rate)
if scale_learning_rate:
stdev_learning_rate *= default_stdev_lr()
super().__init__(
problem,
popsize=popsize,
center_learning_rate=center_learning_rate,
stdev_learning_rate=stdev_learning_rate,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=optimizer,
optimizer_config=optimizer_config,
ranking_method=ranking_method,
center_init=center_init,
stdev_min=stdev_min,
stdev_max=stdev_max,
stdev_max_change=stdev_max_change,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
)
XNES (GaussianSearchAlgorithm)
¶
Inspired by the implementation at: http://schaul.site44.com/code/xnes.py https://github.com/pybrain/pybrain/blob/master/pybrain/optimization/distributionbased/xnes.py
Reference
Glasmachers, Tobias, et al. Exponential natural evolution strategies. Proceedings of the 12^{th} annual conference on Genetic and evolutionary computation (GECCO 2010).
Source code in evotorch/algorithms/distributed/gaussian.py
class XNES(GaussianSearchAlgorithm):
"""
XNES: Exponential Natural Evolution Strategies
Inspired by the implementation at:
http://schaul.site44.com/code/xnes.py
https://github.com/pybrain/pybrain/blob/master/pybrain/optimization/distributionbased/xnes.py
Reference:
Glasmachers, Tobias, et al.
Exponential natural evolution strategies.
Proceedings of the 12th annual conference on Genetic and evolutionary
computation (GECCO 2010).
"""
DISTRIBUTION_TYPE = ExpGaussian
DISTRIBUTION_PARAMS = None
def __init__(
self,
problem: Problem,
*,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
popsize: Optional[int] = None,
center_learning_rate: Optional[float] = None,
stdev_learning_rate: Optional[float] = None,
scale_learning_rate: bool = True,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer=None,
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = "nes",
center_init: Optional[RealOrVector] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the XNES algorithm.
Args:
problem: The problem object which is being worked on.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
popsize: Population size. Can be specified as an int,
or can be left as None to let the solver decide.
In the case of SNES, `popsize` can be left as None,
in which case the default `popsize` will be computed
as `4 + floor(3 * log(n))` where `n` is the length
of a solution.
center_learning_rate: Learning rate for updating the mean
of the search distribution. Default value is 1.0
stdev_learning_rate: Learning rate for updating the covariance
matrix of the search distribution.
The default value is `0.6 * (3 + log(n)) / (n * sqrt(n))`
where `n` is the length of a solution.
scale_learning_rate: For SNES, there is a default standard
deviation learning rate value which is computed as
`0.6 * (3 + log(n)) / (n * sqrt(n))` (where `n` is the solution
length).
If scale_learning_rate is True (which is the default),
then the effective learning rate for the standard deviation
becomes the provided `stdev_learning_rate` multiplied by this
default value. If `scale_learning_rate` is False, then the
effective standard deviation learning rate becomes
equal to the provided `stdev_learning_rate` value.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
optimizer: The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is None.
Note that, for ClipUp, the default maximum speed is set
as twice the given `center_learning_rate`.
This maximum speed can be configured by passing
`{"max_speed": ...}` to `optimizer_config`.
optimizer_config: Configuration which will be passed
to the optimizer as keyword arguments.
See `evotorch.optimizers` for details about
which optimizer accepts which keyword arguments.
ranking_method: Which ranking method will be used for
fitness shaping. See the documentation of
`evotorch.ranking.rank(...)` for details.
The default is 'nes'.
Can be given as None if no such ranking is required.
center_init: The initial center solution.
Can be left as None.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
if popsize is None:
popsize = int(4 + math.floor(3 * math.log(problem.solution_length)))
if center_learning_rate is None:
center_learning_rate = 1.0
def default_stdev_lr():
n = problem.solution_length
return 0.6 * (3 + math.log(n)) / (n * math.sqrt(n))
if stdev_learning_rate is None:
stdev_learning_rate = default_stdev_lr()
else:
stdev_learning_rate = float(stdev_learning_rate)
if scale_learning_rate:
stdev_learning_rate *= default_stdev_lr()
super().__init__(
problem,
popsize=popsize,
center_learning_rate=center_learning_rate,
stdev_learning_rate=stdev_learning_rate,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=optimizer,
optimizer_config=optimizer_config,
ranking_method=ranking_method,
center_init=center_init,
stdev_min=None,
stdev_max=None,
stdev_max_change=None,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
)
DISTRIBUTION_TYPE (Distribution)
¶
exponential Multivariate Gaussian, as used by XNES
Source code in evotorch/algorithms/distributed/gaussian.py
class ExpGaussian(Distribution):
"""exponential Multivariate Gaussian, as used by XNES"""
# Corresponding to mu and A in symbols used in xNES paper
MANDATORY_PARAMETERS = {"mu", "sigma"}
# Inverse of sigma, numerically more stable to track this independently to sigma
OPTIONAL_PARAMETERS = {"sigma_inv"}
def __init__(
self,
parameters: dict,
*,
solution_length: Optional[int] = None,
device: Optional[Device] = None,
dtype: Optional[DType] = None,
):
[mu_length] = parameters["mu"].shape
# Make sigma 2D
if len(parameters["sigma"].shape) == 1:
parameters["sigma"] = torch.diag(parameters["sigma"])
# Automatically generate sigma_inv if not provided
if "sigma_inv" not in parameters:
parameters["sigma_inv"] = torch.inverse(parameters["sigma"])
[sigma_length, _] = parameters["sigma"].shape
if solution_length is None:
solution_length = mu_length
else:
if solution_length != mu_length:
raise ValueError(
f"The argument `solution_length` does not match the length of `mu` provided in `parameters`."
f" solution_length={solution_length},"
f' parameters["mu"]={mu_length}.'
)
if mu_length != sigma_length:
raise ValueError(
f"The tensors `mu` and `sigma` provided within `parameters` have mismatching lengths."
f' parameters["mu"]={mu_length},'
f' parameters["sigma"]={sigma_length}.'
)
super().__init__(
solution_length=solution_length,
parameters=parameters,
device=device,
dtype=dtype,
)
# Make identity matrix as this is used throughout in gradient computation
self.eye = self.make_I(solution_length)
@property
def mu(self) > torch.Tensor:
"""Getter for mu
Returns:
mu (torch.Tensor): The center of the search distribution
"""
return self.parameters["mu"]
@mu.setter
def mu(self, new_mu: Iterable):
"""Setter for mu
Args:
new_mu (torch.Tensor): The new value of mu
"""
self.parameters["mu"] = torch.as_tensor(new_mu, dtype=self.dtype, device=self.device)
@property
def cov(self) > torch.Tensor:
"""The covariance matrix A^T A"""
return self.sigma.transpose(0, 1) @ self.sigma
@property
def sigma(self) > torch.Tensor:
"""Getter for sigma
Returns:
sigma (torch.Tensor): The square root of the covariance matrix
"""
return self.parameters["sigma"]
@property
def sigma_inv(self) > torch.Tensor:
"""Getter for sigma_inv
Returns:
sigma_inv (torch.Tensor): The inverse square root of the covariance matrix
"""
if "sigma_inv" in self.parameters:
return self.parameters["sigma_inv"]
else:
return torch.inverse(self.parameters["sigma"])
@property
def A(self) > torch.Tensor:
"""Alias for self.sigma, for notational consistency with paper"""
return self.sigma
@property
def A_inv(self) > torch.Tensor:
"""Alias for self.sigma_inv, for notational consistency with paper"""
return self.sigma_inv
@sigma.setter
def sigma(self, new_sigma: Iterable):
"""Setter for sigma
Args:
new_sigma (torch.Tensor): The new value of sigma, the square root of the covariance matrix
"""
self.parameters["sigma"] = torch.as_tensor(new_sigma, dtype=self.dtype, device=self.device)
def to_global_coordinates(self, local_coordinates: torch.Tensor) > torch.Tensor:
"""Map samples from local coordinate space N(0, I_d) to global coordinate space N(mu, A^T A)
This function is the inverse of to_local_coordinates
Args:
local_coordinates (torch.Tensor): The local coordinates sampled from N(0, I_d)
Returns:
global_coordinates (torch.Tensor): The global coordinates sampled from N(mu, A^T A)
"""
# Global samples are constructed as x = mu + A z where z is local coordinate
# We use transpose here to simplify the batched application of A
return self.mu.unsqueeze(0) + (self.A @ local_coordinates.T).T
def to_local_coordinates(self, global_coordinates: torch.Tensor) > torch.Tensor:
"""Map samples from global coordinate space N(mu, A^T A) to local coordinate space N(0, I_d)
This function is the inverse of to_global_coordinates
Args:
global_coordinates (torch.Tensor): The global coordinates sampled from N(mu, A^T A)
Returns:
local_coordinates (torch.Tensor): The local coordinates sampled from N(0, I_d)
"""
# Global samples are constructed as x = mu + A z where z is local coordinate
# Therefore, we can recover z according to z = A_inv (x  mu)
return (self.A_inv @ (global_coordinates  self.mu.unsqueeze(0)).T).T
def _fill(self, out: torch.Tensor, *, generator: Optional[torch.Generator] = None):
"""Fill a tensor with samples from N(mu, A^T A)
Args:
out (torch.Tensor): The tensor to fill
generator (Optional[torch.Generator]): A generator to use to generate random values
"""
# Fill with local coordinates from N(0, I_d)
self.make_gaussian(out=out, generator=generator)
# Map local coordinates to global coordinate system
out[:] = self.to_global_coordinates(out)
def _compute_gradients(self, samples: torch.Tensor, weights: torch.Tensor, ranking_used: Optional[str]) > dict:
"""Compute the gradients with respect to a given set of samples and weights
Args:
samples (torch.Tensor): Samples drawn from N(mu, A^T A), ideally using self._fill
weights (torch.Tensor): Weights e.g. fitnesses or utilities assigned to samples
ranking_used (optional[str]): The ranking method used to compute weights
Returns:
grads (dict): A dictionary containing the approximated natural gradient on d and M
"""
# Compute the local coordinates
local_coordinates = self.to_local_coordinates(samples)
# Make sure that the weights (utilities) are 0centered
# (Otherwise the formulations would have to consider a bias term)
if ranking_used not in ("centered", "normalized"):
weights = weights  torch.mean(weights)
d_grad = total(dot(weights, local_coordinates))
local_coordinates_outer = local_coordinates.unsqueeze(1) * local_coordinates.unsqueeze(2)
M_grad = torch.sum(
weights.unsqueeze(1).unsqueeze(1) * (local_coordinates_outer  self.eye.unsqueeze(0)), dim=0
)
return {
"d": d_grad,
"M": M_grad,
}
def update_parameters(
self,
gradients: dict,
*,
learning_rates: Optional[dict] = None,
optimizers: Optional[dict] = None,
) > "ExpGaussian":
d_grad = gradients["d"]
M_grad = gradients["M"]
if "d" not in learning_rates:
learning_rates["d"] = learning_rates["mu"]
if "M" not in learning_rates:
learning_rates["M"] = learning_rates["sigma"]
# Follow gradients for d, and M
update_d = self._follow_gradient("d", d_grad, learning_rates=learning_rates, optimizers=optimizers)
update_M = self._follow_gradient("M", M_grad, learning_rates=learning_rates, optimizers=optimizers)
# Fold into parameters mu, A and A inv
new_mu = self.mu + torch.mv(self.A, update_d)
new_A = self.A @ torch.matrix_exp(0.5 * update_M)
new_A_inv = torch.matrix_exp(0.5 * update_M) @ self.A_inv
# Return modified distribution
return self.modified_copy(mu=new_mu, sigma=new_A, sigma_inv=new_A_inv)
A: Tensor
property
readonly
¶Alias for self.sigma, for notational consistency with paper
A_inv: Tensor
property
readonly
¶Alias for self.sigma_inv, for notational consistency with paper
cov: Tensor
property
readonly
¶The covariance matrix A^T A
mu: Tensor
property
writable
¶Getter for mu
Returns:
Type  Description 

mu (torch.Tensor) 
The center of the search distribution 
sigma: Tensor
property
writable
¶Getter for sigma
Returns:
Type  Description 

sigma (torch.Tensor) 
The square root of the covariance matrix 
sigma_inv: Tensor
property
readonly
¶Getter for sigma_inv
Returns:
Type  Description 

sigma_inv (torch.Tensor) 
The inverse square root of the covariance matrix 
to_global_coordinates(self, local_coordinates)
¶Map samples from local coordinate space N(0, I_d) to global coordinate space N(mu, A^T A) This function is the inverse of to_local_coordinates
Parameters:
Name  Type  Description  Default 

local_coordinates 
torch.Tensor 
The local coordinates sampled from N(0, I_d) 
required 
Returns:
Type  Description 

global_coordinates (torch.Tensor) 
The global coordinates sampled from N(mu, A^T A) 
Source code in evotorch/algorithms/distributed/gaussian.py
def to_global_coordinates(self, local_coordinates: torch.Tensor) > torch.Tensor:
"""Map samples from local coordinate space N(0, I_d) to global coordinate space N(mu, A^T A)
This function is the inverse of to_local_coordinates
Args:
local_coordinates (torch.Tensor): The local coordinates sampled from N(0, I_d)
Returns:
global_coordinates (torch.Tensor): The global coordinates sampled from N(mu, A^T A)
"""
# Global samples are constructed as x = mu + A z where z is local coordinate
# We use transpose here to simplify the batched application of A
return self.mu.unsqueeze(0) + (self.A @ local_coordinates.T).T
to_local_coordinates(self, global_coordinates)
¶Map samples from global coordinate space N(mu, A^T A) to local coordinate space N(0, I_d) This function is the inverse of to_global_coordinates
Parameters:
Name  Type  Description  Default 

global_coordinates 
torch.Tensor 
The global coordinates sampled from N(mu, A^T A) 
required 
Returns:
Type  Description 

local_coordinates (torch.Tensor) 
The local coordinates sampled from N(0, I_d) 
Source code in evotorch/algorithms/distributed/gaussian.py
def to_local_coordinates(self, global_coordinates: torch.Tensor) > torch.Tensor:
"""Map samples from global coordinate space N(mu, A^T A) to local coordinate space N(0, I_d)
This function is the inverse of to_global_coordinates
Args:
global_coordinates (torch.Tensor): The global coordinates sampled from N(mu, A^T A)
Returns:
local_coordinates (torch.Tensor): The local coordinates sampled from N(0, I_d)
"""
# Global samples are constructed as x = mu + A z where z is local coordinate
# Therefore, we can recover z according to z = A_inv (x  mu)
return (self.A_inv @ (global_coordinates  self.mu.unsqueeze(0)).T).T
update_parameters(self, gradients, *, learning_rates=None, optimizers=None)
¶Do an update on the distribution by following the given gradients.
It is expected that the inheriting class has its own implementation for this method.
Parameters:
Name  Type  Description  Default 

gradients 
dict 
Gradients, as a dictionary, which will be used for computing the necessary updates. 
required 
learning_rates 
Optional[dict] 
A dictionary which contains learning rates for parameters that will be updated using a learning rate coefficient. 
None 
optimizers 
Optional[dict] 
A dictionary which contains optimizer objects for parameters that will be updated using an adaptive optimizer. 
None 
Returns:
Type  Description 

ExpGaussian 
The updated copy of the distribution. 
Source code in evotorch/algorithms/distributed/gaussian.py
def update_parameters(
self,
gradients: dict,
*,
learning_rates: Optional[dict] = None,
optimizers: Optional[dict] = None,
) > "ExpGaussian":
d_grad = gradients["d"]
M_grad = gradients["M"]
if "d" not in learning_rates:
learning_rates["d"] = learning_rates["mu"]
if "M" not in learning_rates:
learning_rates["M"] = learning_rates["sigma"]
# Follow gradients for d, and M
update_d = self._follow_gradient("d", d_grad, learning_rates=learning_rates, optimizers=optimizers)
update_M = self._follow_gradient("M", M_grad, learning_rates=learning_rates, optimizers=optimizers)
# Fold into parameters mu, A and A inv
new_mu = self.mu + torch.mv(self.A, update_d)
new_A = self.A @ torch.matrix_exp(0.5 * update_M)
new_A_inv = torch.matrix_exp(0.5 * update_M) @ self.A_inv
# Return modified distribution
return self.modified_copy(mu=new_mu, sigma=new_A, sigma_inv=new_A_inv)
__init__(self, problem, *, stdev_init=None, radius_init=None, popsize=None, center_learning_rate=None, stdev_learning_rate=None, scale_learning_rate=True, num_interactions=None, popsize_max=None, optimizer=None, optimizer_config=None, ranking_method='nes', center_init=None, obj_index=None, distributed=False, popsize_weighted_grad_avg=None)
special
¶
__init__(...)
: Initialize the XNES algorithm.
Parameters:
Name  Type  Description  Default 

problem 
Problem 
The problem object which is being worked on. 
required 
stdev_init 
Union[float, Iterable[float], torch.Tensor] 
The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
radius_init 
Union[float, Iterable[float], torch.Tensor] 
The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument 
None 
popsize 
Optional[int] 
Population size. Can be specified as an int,
or can be left as None to let the solver decide.
In the case of SNES, 
None 
center_learning_rate 
Optional[float] 
Learning rate for updating the mean of the search distribution. Default value is 1.0 
None 
stdev_learning_rate 
Optional[float] 
Learning rate for updating the covariance
matrix of the search distribution.
The default value is 
None 
scale_learning_rate 
bool 
For SNES, there is a default standard
deviation learning rate value which is computed as

True 
num_interactions 
Optional[int] 
When given as an integer n, it is ensured that a population has interacted with the GymProblem's environment n times. If this target has not been reached yet, then the population is declared too small, and gets extended with more samples, until n amount of interactions is reached. When given as None, popsize is the only configuration affecting the size of a population. 
None 
popsize_max 
Optional[int] 
Having 
None 
num_interactions 
Optional[int] 
When given as an integer n, it is ensured that a population has interacted with the GymProblem's environment n times. If this target has not been reached yet, then the population is declared too small, and gets extended with more samples, until n amount of interactions is reached. When given as None, popsize is the only configuration affecting the size of a population. 
None 
optimizer 
The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is None.
Note that, for ClipUp, the default maximum speed is set
as twice the given 
None 

optimizer_config 
Optional[dict] 
Configuration which will be passed
to the optimizer as keyword arguments.
See 
None 
ranking_method 
Optional[str] 
Which ranking method will be used for
fitness shaping. See the documentation of

'nes' 
center_init 
Union[float, Iterable[float], torch.Tensor] 
The initial center solution. Can be left as None. 
None 
obj_index 
Optional[int] 
Index of the objective according to which the gradient estimations will be done. For singleobjective problems, this can be left as None. 
None 
distributed 
bool 
Whether or not the gradient computation will
be distributed. If 
False 
popsize_weighted_grad_avg 
Optional[bool] 
Only to be used in distributed mode.
(where being in distributed mode means 
None 
Source code in evotorch/algorithms/distributed/gaussian.py
def __init__(
self,
problem: Problem,
*,
stdev_init: Optional[RealOrVector] = None,
radius_init: Optional[RealOrVector] = None,
popsize: Optional[int] = None,
center_learning_rate: Optional[float] = None,
stdev_learning_rate: Optional[float] = None,
scale_learning_rate: bool = True,
num_interactions: Optional[int] = None,
popsize_max: Optional[int] = None,
optimizer=None,
optimizer_config: Optional[dict] = None,
ranking_method: Optional[str] = "nes",
center_init: Optional[RealOrVector] = None,
obj_index: Optional[int] = None,
distributed: bool = False,
popsize_weighted_grad_avg: Optional[bool] = None,
):
"""
`__init__(...)`: Initialize the XNES algorithm.
Args:
problem: The problem object which is being worked on.
stdev_init: The initial standard deviation of the search
distribution, expressed as a scalar or as an array.
Determines the initial coverage area of the search
distribution.
If one wishes to configure the coverage area via the
argument `radius_init` instead, then `stdev_init` is expected
as None.
radius_init: The initial radius of the search distribution,
expressed as a scalar.
Determines the initial coverage area of the search
distribution.
Here, "radius" is defined as the norm of the search
distribution.
If one wishes to configure the coverage area via the
argument `stdev_init` instead, then `radius_init` is expected
as None.
popsize: Population size. Can be specified as an int,
or can be left as None to let the solver decide.
In the case of SNES, `popsize` can be left as None,
in which case the default `popsize` will be computed
as `4 + floor(3 * log(n))` where `n` is the length
of a solution.
center_learning_rate: Learning rate for updating the mean
of the search distribution. Default value is 1.0
stdev_learning_rate: Learning rate for updating the covariance
matrix of the search distribution.
The default value is `0.6 * (3 + log(n)) / (n * sqrt(n))`
where `n` is the length of a solution.
scale_learning_rate: For SNES, there is a default standard
deviation learning rate value which is computed as
`0.6 * (3 + log(n)) / (n * sqrt(n))` (where `n` is the solution
length).
If scale_learning_rate is True (which is the default),
then the effective learning rate for the standard deviation
becomes the provided `stdev_learning_rate` multiplied by this
default value. If `scale_learning_rate` is False, then the
effective standard deviation learning rate becomes
equal to the provided `stdev_learning_rate` value.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
popsize_max: Having `num_interactions` set as an integer
might cause the effective population size jump to
unnecesarily large numbers. To prevent this,
one can set `popsize_max` to specify an upper
bound for the effective population size.
num_interactions: When given as an integer n,
it is ensured that a population has interacted with
the GymProblem's environment n times. If this target
has not been reached yet, then the population is declared
too small, and gets extended with more samples,
until n amount of interactions is reached.
When given as None, popsize is the only configuration
affecting the size of a population.
optimizer: The optimizer to be used while following the
estimated the gradients.
Can be given as None if a momentumbased optimizer
is not required.
Otherwise, can be given as a str containing the name
of the optimizer (e.g. 'adam', 'clipup');
or as an instance of evotorch.optimizers.TorchOptimizer
or evotorch.optimizers.ClipUp.
The default is None.
Note that, for ClipUp, the default maximum speed is set
as twice the given `center_learning_rate`.
This maximum speed can be configured by passing
`{"max_speed": ...}` to `optimizer_config`.
optimizer_config: Configuration which will be passed
to the optimizer as keyword arguments.
See `evotorch.optimizers` for details about
which optimizer accepts which keyword arguments.
ranking_method: Which ranking method will be used for
fitness shaping. See the documentation of
`evotorch.ranking.rank(...)` for details.
The default is 'nes'.
Can be given as None if no such ranking is required.
center_init: The initial center solution.
Can be left as None.
obj_index: Index of the objective according to which the
gradient estimations will be done.
For singleobjective problems, this can be left as None.
distributed: Whether or not the gradient computation will
be distributed. If `distributed` is given as False and
the problem is not parallelized, then everything will
be centralized (i.e. the entire computation will happen
in the main process).
If `distributed` is given as False, and the problem
is parallelized, then the population will be created
in the main process and then sent to remote workers
for parallelized evaluation, and then the remote fitnesses
will be collected by the main process again for computing
the search gradients.
If `distributed` is given as True, and the problem
is parallelized, then the search algorithm itself will
be distributed, in the sense that each remote actor will
generate its own population (such that the total population
size across all these actors becomes equal to `popsize`)
and will compute its own gradient, and then the main process
will collect these gradients, compute the averaged gradients
and update the main search distribution.
Nondistributed mode has the advantage of keeping the
population in the main process, which is good when one wishes
to do detailed monitoring during the evolutionary process,
but has the disadvantage of having to pass the solutions to
the remote actors and having to collect fitnesses, which
might result in increased interprocess communication traffic.
On the other hand, while it is not possible to monitor the
population in distributed mode, the distributed mode has the
advantage of significantly reducing the interprocess
communication traffic, since the only things communicated
with the remote actors are the search distributions (not the
solutions) and the gradients.
popsize_weighted_grad_avg: Only to be used in distributed mode.
(where being in distributed mode means `distributed` is given
as True). In distributed mode, each actor remotely samples
its own solution batches and computes its own gradients.
These gradients are then collected, and a final average
gradient is computed.
If `popsize_weighted_grad_avg` is True, then, while averaging
over the gradients, each gradient will have its own weight
that is computed according to how many solutions were sampled
by the actor that produced the gradient.
If `popsize_weighted_grad_avg` is False, then, there will not
be weighted averaging (or, each gradient will have equal
weight).
If `popsize_weighted_grad_avg` is None, then, the gradient
weights will be equal a value for `num_interactions` is given
(because `num_interactions` affects the number of solutions
according to the episode lengths, and popsizeweighting the
gradients could be misleading); and the gradient weights will
be weighted according to the subpopulation (i.e. subbatch)
sizes if `num_interactions` is left as None.
The default value for `popsize_weighted_grad_avg` is None.
When the distributed mode is disabled (i.e. when `distributed`
is False), then the argument `popsize_weighted_grad_avg` is
expected as None.
"""
if popsize is None:
popsize = int(4 + math.floor(3 * math.log(problem.solution_length)))
if center_learning_rate is None:
center_learning_rate = 1.0
def default_stdev_lr():
n = problem.solution_length
return 0.6 * (3 + math.log(n)) / (n * math.sqrt(n))
if stdev_learning_rate is None:
stdev_learning_rate = default_stdev_lr()
else:
stdev_learning_rate = float(stdev_learning_rate)
if scale_learning_rate:
stdev_learning_rate *= default_stdev_lr()
super().__init__(
problem,
popsize=popsize,
center_learning_rate=center_learning_rate,
stdev_learning_rate=stdev_learning_rate,
stdev_init=stdev_init,
radius_init=radius_init,
popsize_max=popsize_max,
num_interactions=num_interactions,
optimizer=optimizer,
optimizer_config=optimizer_config,
ranking_method=ranking_method,
center_init=center_init,
stdev_min=None,
stdev_max=None,
stdev_max_change=None,
obj_index=obj_index,
distributed=distributed,
popsize_weighted_grad_avg=popsize_weighted_grad_avg,
)
ga
¶
Genetic algorithm variants: GeneticAlgorithm, Cosyne.
Cosyne (SearchAlgorithm, SinglePopulationAlgorithmMixin)
¶
Implementation of the CoSyNE algorithm.
References:
F.Gomez, J.Schmidhuber, R.Miikkulainen, M.Mitchell (2008).
Accelerated Neural Evolution through Cooperatively Coevolved Synapses.
Journal of Machine Learning Research 9 (5).
Source code in evotorch/algorithms/ga.py
class Cosyne(SearchAlgorithm, SinglePopulationAlgorithmMixin):
"""
Implementation of the CoSyNE algorithm.
References:
F.Gomez, J.Schmidhuber, R.Miikkulainen, M.Mitchell (2008).
Accelerated Neural Evolution through Cooperatively Coevolved Synapses.
Journal of Machine Learning Research 9 (5).
"""
def __init__(
self,</