Graph Framework
Loading...
Searching...
No Matches
Kernel Optimization

Notes on optimizing kernel performance.

Introduction

There is a balance kernel run time, compile time, and kernel source code size. To discuss these issues we can explore a case study for a simple field solve of a Particle In Cell (PIC) code. The field solve involves computing the contribution of a every particle to fields. Fields are typically discretized onto a grid which will have a smaller size compared to the number of particles.

Source Code

Take the example case.

template<jit::float_scalar T>
void field_solve_example() {
const size_t batch = 1;
const size_t num_particles = 1000000;
auto particle_positions = graph::variable<T> (num_particles, "x");
const size_t num_grid = 1000;
auto grid_value = graph::variable<T> (num_grid, "g");
auto particle_index = graph::variable<T> (num_grid, "i");
auto grid_position = graph::variable<T> (num_grid, "gx");
std::vector<T> buffer(1000, static_cast<T> (0.0));
grid_value->set(buffer);
particle_index->set(buffer);
for (size_t index = 0; index < num_grid; index++) {
buffer[index] = static_cast<T> (2*index)/static_cast<T> (999)
- static_cast<T> (1);
}
grid_position->set(buffer);
auto indexed_particle = graph::index_1D(particle_positions,
particle_index,
static_cast<T> (1),
static_cast<T> (0));
auto next_index = particle_index + static_cast<T> (1.0);
auto arg = indexed_particle - grid_position;
auto next_grid_value = grid_value
+ graph::exp(static_cast<T> (-1)*arg*arg/static_cast<T> (10));
// Unroll kernel call loop by running multiple iterations in a single kernel
// call.
for (size_t i = 1; i < batch; i++) {
indexed_particle = graph::index_1D(particle_positions,
particle_index,
static_cast<T> (1),
static_cast<T> (0));
next_index = next_index + static_cast<T> (1.0);
arg = indexed_particle - grid_position;
next_grid_value = next_grid_value
+ graph::exp(static_cast<T> (-1)*arg*arg/static_cast<T> (10));
}
const T max = 1.0;
const T min = -1.0;
auto random_real = (max - min)/graph::random_scale<T> ()*random + min;
init.print();
work.add_preitem({
graph::variable_cast(particle_positions)
}, {}, {
{random_real, variable_cast(particle_positions)}
}, graph::random_state_cast(state), "Initialize_x", num_particles);
work.add_item({
graph::variable_cast(particle_positions),
graph::variable_cast(indexed_particle),
graph::variable_cast(grid_value),
graph::variable_cast(grid_position)
}, {}, {
{next_index, graph::variable_cast(indexed_particle)},
{next_grid_value, graph::variable_cast(grid_value)}
}, NULL, "Set_Grid", num_grid);
work.compile();
compile.print();
work.pre_run();
timing::measure_diagnostic field_solve("field_solve");
for (size_t index = 0, ie = num_particles/batch; index < ie; index++) {
work.run();
}
work.wait();
field_solve.print();
}
Class for JIT compile of the GPU kernels.
Definition jit.hpp:49
A timing object.
Definition timing.hpp:18
Class representing a workflow manager.
Definition workflow.hpp:171
void compile(graph::input_nodes< T > inputs, graph::output_nodes< T > outputs, graph::map_nodes< T > setters, const T expected, const T tolerance)
Compile kernal and check the result of the output.
Definition jit_test.cpp:49
shared_random_state< T, SAFE_MATH > random_state_cast(shared_leaf< T, SAFE_MATH > x)
Cast to a random_state node.
Definition random.hpp:275
constexpr shared_leaf< T, SAFE_MATH > zero()
Forward declare for zero.
Definition node.hpp:986
shared_leaf< T, SAFE_MATH > index_1D(shared_leaf< T, SAFE_MATH > v, shared_leaf< T, SAFE_MATH > x, const T scale, const T offset)
Define index_1D convenience function.
Definition piecewise.hpp:1622
shared_leaf< T, SAFE_MATH > random(shared_random_state< T, SAFE_MATH > state)
Define random convenience function.
Definition random.hpp:490
shared_leaf< T, SAFE_MATH > exp(shared_leaf< T, SAFE_MATH > x)
Define exp convenience function.
Definition math.hpp:544
shared_variable< T, SAFE_MATH > variable_cast(shared_leaf< T, SAFE_MATH > x)
Cast to a variable node.
Definition node.hpp:1727

In this example we build two kernels. The Initialize_x kernel set random values between \(1\) and \(-1\) for particle positions to represent a particle state after a particle push step. This kernel is applied over each of the \(1\times10^{6}\) particles.

The Set_Grid kernel mimics a simple field. For a given particle position \(x\), the contribution to grid point is defined as

\begin{equation}g_{next}=g+e^{-\frac{\left(x\left[i\right]-gx\right)^{2}}{10}}\end{equation}

where \(i\) is the particle index, \(g\) is the current grid value, \(gx\) is the grid position, and \(g_{next}\) is the updated value. After each kernel call, the particle index is incremented. This kernel is applied over each of the \(1\times10^{3}\) gird points.

Kernel Performance

Wall time scaling comparing compile time and run time for the number of loop unroll iterations.

In this section we focus on the Set_Grid kernel and the batch. When batch = 1, the kernel only adds the kernel contribution for a single particle. To compute the full field, we would need to call this kernel \(1\times10^{6}\). Setting batch > 1 allows us to unroll the kernel call loop. This reduces the number of times we need to call the kernel but comes at the expense of a larger kernel and an associated compile penalty. If we were to set batch = 1000000 then we would fully unroll the loop resulting in only needing to call the kernel once. However it would come at the expense of the generated kernel source having over \(1\times10^{6}\) lines of code. In testing, loop unrolling beyond \(1\times10^{5}\) resulted in a segmentation fault.

The figure above, shows the scaling of the setup time for the graphs, the compile time for the kernel and the resulting run time. A benchmark was performed on a Apple M1 Pro for both float and double precision. Note that the GPU on the M1 Pro only support single precision. CPU benchmark was performed on a single core.

Unrolling the loop up to \(500\) iterations shows a significant reduction in the GPU kernel run time with a negligible affect on the initialization and kernel compile time. Beyond \(1000\) iterations, the improvement to kernel runtime becomes negligible. However, any gaines we made in runtime improvement is lost due to penalty to kernel compiling. For CPU execution there is little to no noticeable performance improvement when unrolling kernel call loops.