Problem Parallelization¶
Not every fitness function can be so straight-forwardly vectorised or placed on a CUDA-capable device. To provide a significant speedup in this scenario, EvoTorch supports parallelisation using Ray out-of-the-box. This page guides you through the various use cases of Ray.
Using Multiple Actors¶
To get started using Ray, you simply need to change the num_actors
argument.
import torch
from evotorch import Problem
def sphere(x: torch.Tensor) -> torch.Tensor:
return torch.sum(x.pow(2.0))
problem = Problem(
objective_sense="min",
objective_func=sphere,
solution_length=10,
initial_bounds=(-1, 1),
num_actors=4,
)
by changing num_actors > 1
, the Problem instance will automatically spawn Ray actors the first time an evaluation is called. Each actor will be sent a sub-batch of the population, evaluate that sub-batch and return their associated fitness values.
You should note that even if the Problem instance is vectorized, either through the vectorized = True
boolean flag or through the custom definition of the _evaluate_batch
function, each Ray actor will continue to evaluate solutions in a vectorized manner, working on sub-batches of the SolutionBatch passed to problem.evaluate
which is automatically split by the main Problem instance.
By using Ray for parallelisation by default, EvoTorch therefore supports deployment of Problem instances to large clusters, across multiple machines and CPUs. However, by default, EvoTorch will only be able to exploit the resources available on the single machine. For further guidance on setting up a Ray node to use a cluster, visit the library's official documentation and refer to our own short tutorial on the topic for tips on getting started.
Using Ray with GPUs¶
In our guide on Defining Problems, we demonstrated the use of CUDA-capable devices using the device
argument. However, as Ray communicates between actors on the CPU, it is recommended that when you have num_actors > 1
, you use the default value device = 'cpu'
so that the main Problem instance remains on the CPU, and resultingly, SolutionBatch instances created by SearchAlgorithm instances attached to the problem will also be on the CPU.
This does not, however, mean that you cannot still use CUDA-capable devices for evaluation within the individual Ray actors.
When working with a custom Problem subclass, from within the methods _evaluate(self, ...)
or _evaluate_batch(self, ...)
, a GPU device can be obtained (as a torch.device
instance) via the property self.aux_device
(where aux_device
stands for "auxiliary device"). If this custom problem at hand is not parallelised through Ray, the aux_device
property will return the first visible CUDA-capable device within the main execution environment. If the problem is parallelised through Ray, the aux_device
property will return the first visible CUDA-capable device within the environment of the actor (therefore, for each actor, it will return the device assigned to that actor). For example, consider this simple custom problem:
def vectorised_sphere(x: torch.Tensor) -> torch.Tensor:
return torch.sum(x.pow(2.0), dim=-1)
class VecSphere(Problem):
def _evaluate_batch(self, solutions: SolutionBatch):
xs = solutions.values.to(self.aux_device)
fs = vectorised_sphere(xs)
solutions.set_evals(fs.to(solutions.device))
when a SolutionBatch is passed to a VecSphere
instance's _evaluate_batch
method, its values
property is moved to the local aux_device
for evaluation. Once the decision values have been evaluated (using the vectorised_sphere
function in this case), then their fitnesses are moved back to the original device of solutions
and assigned as the fitness evaluations.
To use the GPU within Ray actors, simply use the num_gpus_per_actor
keyword argument to specify how many GPUs will be allocated to each actor. For example, if you have 2 GPUs, you can do:
problem = VecSphere(
objective_sense="min",
solution_length=10,
initial_bounds=(-1, 1),
num_actors=4,
num_gpus_per_actor=0.5,
)
so that the 4 Ray actors will each be assigned half of a GPU. Then when a SolutionBatch to be evaluated is passed to problem
, EvoTorch will split it into sub-batches, pass each sub-batch to a Ray actor which will then evaluate the sub-batch. In the case of VecSphere
, this means that each of the sub-batches will then be moved to each Ray actor's local auxilliary CUDA-capable device, evaluated, and then returned to the main problem instance.
Common Use-cases¶
We have provided a number of special values for the arguments num_actors
and num_gpus_per_actor
to easily support a number of common use-cases:
"I do not want Ray-based parallelization"¶
To avoid creation of Ray actors, you can set num_actors
as None
(or as 0, or as 1, which are equivalent).
Info
For regular optimization problems expressed by the class Problem, this is the default. Because the Problem class does not make assumptions about whether or not parallelization is needed for the regular problem at hand. Of course, you can obtain parallelization for such a problem by manually setting num_actors
to an integer, or to one of the special values listed below.
"I want to use all available CPUs"¶
To use all available CPUs, use the num_actors = 'max'
argument. This will automatically detect the number of available CPUs on your machine/Ray cluster and set num_actors
to that value.
Info
It is a very common practice to parallelize reinforcement learning tasks across multiple CPUs for shortening the required execution time when running the necessary episodes. Therefore, the option num_actors = 'max'
is the default for GymNE (the problem class which expresses tasks of solving classical CPU-bound gym environments).
"I want to assign the maximum available GPUs, split across my actors"¶
To use all available GPUs, use the num_gpus_per_actor = 'max'
argument. This will automatically detect the number of GPUs on your machine/Ray cluster and assign num_gpus_per_actor
to the total number of GPUs divided by num_actors
.
"I want to use all available CPUs and GPUs"¶
To use all available compute on your machine, set both num_actors = 'max'
and num_gpus_per_actor' = 'max'
.
"I want to use as many CPUs as I have GPUs"¶
To create an actor per GPU, use the num_actors = 'num_gpus'
argument. This will automatically detect the number of GPUs on your machine/Ray cluster and assign num_actors
to that value and num_gpus_per_actor = 1
.
"I want one-to-one mapping between CPUs and GPUs"¶
To create multiple actors and to configure each actor to allocate one of the GPUs entirely for itself, use the num_actors = 'num_devices'
argument. This is similar to num_actors = 'num_gpus'
, however, it is not the same, because the setting num_actors = 'num_devices'
takes into account both the number of CPUs and of GPUs. In more details, with num_actors = 'num_devices'
, the following steps are taken automatically:
- The number of CPUs are counted.
- The number of GPUs are counted.
- The minimum value among the number of CPUs and number of GPUs is computed. Let us call this value
n
. n
actors are created.- Each actor is assigned a GPU. This way, it is ensured that each actor gets an entire GPU to itself, while also ensuring that the number of actors do not exceed the number of available CPUs.
Info
When dealing with neuro-evolution, it is usually desired to split the workload of evaluating the neural network parameters across multiple GPUs. For this reason, num_actors = 'num_devices'
is the default option for NEProblem (the generic neuro-evolution problem type) and for SupervisedNE (the supervised learning problem type).
Tip
Sometimes, for some neuro-evolution problems that are small enough, it can be performance-wise more beneficial to run everything on a single GPU (both the evaluations of the networks and the evolutionary algorithm itself). In such cases, just setting device = 'cuda'
(with a device index if desired, e.g., device = 'cuda:0'
) should be enough. The setting num_actors = 'num_devices'
refrains from creating Ray actors if device
is not left as None and is instead set to a GPU.
"I want to freely use GPUs within my Ray actors"¶
Ray automatically sets/modifies the CUDA_VISIBLE_DEVICES
environment variable so that each Ray actor can only see its allocated GPUs. You can override this by setting num_gpus_per_actors = 'all'
, in which case each actor will be able to see every available CUDA-capable device.
Info
With num_gpus_per_actors
set as 'all'
, since all the GPUs are visible to all actors, the aux_device
property can only guess which auxiliary device is being targeted. The simple guess made by aux_device
in this case is 'cuda:I'
where I
is the index of the actor. This might be an erroneous guess if the number of actors are more than the number of GPUs. Therefore, with num_gpus_per_actor
set as 'all'
it is recommended that the users do not heavily rely on aux_device
, and instead introduce their own case-specific rules for sharing/using the GPUs. Alternatively, if the users wish to rely on the property aux_device
, they might want to consider another option from this list of common use-cases.
A general remark is that, for all these special values for num_actors
which count the number of CPUs (and GPUs), if it turns out that the number of actors to be created is only 1 (most probably because there is only one CPU provided by the ray cluster), then no actor will be created (and therefore there won't be any GPU assignment to any actor). This is because having only 1 actor would not bring any parallelization benefit, while still bringing the performance overhead of interprocess communication.