Introduction

There is a balance kernel run time, compile time, and kernel source code size. To discuss these issues we can explore a case study for a simple field solve of a Particle In Cell (PIC) code. The field solve involves computing the contribution of a every particle to fields. Fields are typically discretized onto a grid which will have a smaller size compared to the number of particles.

Source Code

Take the example case.

template<jit::float_scalar T>
void field_solve_example() {
    timing::measure_diagnostic init("Init");
    const size_t batch = 1;
    const size_t num_particles = 1000000;
    auto particle_positions = graph::variable<T> (num_particles, "x");
 
    const size_t num_grid = 1000;
    auto grid_value = graph::variable<T> (num_grid, "g");
    auto particle_index = graph::variable<T> (num_grid, "i");
    auto grid_position = graph::variable<T> (num_grid, "gx");
 
    std::vector<T> buffer(1000, static_cast<T> (0.0));
    grid_value->set(buffer);
    particle_index->set(buffer);
 
    for (size_t index = 0; index < num_grid; index++) {
        buffer[index] = static_cast<T> (2*index)/static_cast<T> (999)
                      - static_cast<T> (1);
    }
    grid_position->set(buffer);
    
    auto indexed_particle = graph::index_1D(particle_positions,
                                            particle_index,
                                            static_cast<T> (1),
                                            static_cast<T> (0));
    auto next_index = particle_index + static_cast<T> (1.0);
    auto arg = indexed_particle - grid_position;
    auto next_grid_value = grid_value
                         + graph::exp(static_cast<T> (-1)*arg*arg/static_cast<T> (10));
 
//  Unroll kernel call loop by running multiple iterations in a single kernel
//  call.
    for (size_t i = 1; i < batch; i++) {
        indexed_particle = graph::index_1D(particle_positions,
                                           next_index,
                                           static_cast<T> (1),
                                           static_cast<T> (0));
        next_index = next_index + static_cast<T> (1.0);
        arg = indexed_particle - grid_position;
        next_grid_value = next_grid_value
                        + graph::exp(static_cast<T> (-1)*arg*arg/static_cast<T> (10));
    }
 
    auto state = graph::random_state<T> (jit::context<T>::random_state_size, 0);
    auto random = graph::random<T> (graph::random_state_cast(state));
    const T max = 1.0;
    const T min = -1.0;
    auto random_real = (max - min)/graph::random_scale<T> ()*random + min;
 
    init.print();
 
    timing::measure_diagnostic compile("compile");
 
    workflow::manager<T> work(0);
    work.add_preitem({
        graph::variable_cast(particle_positions)
    }, {}, {
        {random_real, variable_cast(particle_positions)}
    }, graph::random_state_cast(state), "Initialize_x", num_particles);
 
    work.add_item({
        graph::variable_cast(particle_positions),
        graph::variable_cast(indexed_particle),
        graph::variable_cast(grid_value),
        graph::variable_cast(grid_position)
    }, {}, {
        {next_index, graph::variable_cast(indexed_particle)},
        {next_grid_value, graph::variable_cast(grid_value)}
    }, NULL, "Set_Grid", num_grid);
 
    work.compile();
 
    compile.print();
 
    work.pre_run();
    timing::measure_diagnostic field_solve("field_solve");
    for (size_t index = 0, ie = num_particles/batch; index < ie; index++) {
        work.run();
    }
    work.wait();
    field_solve.print();
}

In this example we build two kernels. The Initialize_x kernel set random values between \(1\) and \(-1\) for particle positions to represent a particle state after a particle push step. This kernel is applied over each of the \(1\times10^{6}\) particles.

The Set_Grid kernel mimics a simple field. For a given particle position \(x\), the contribution to grid point is defined as

\begin{equation}g_{next}=g+e^{-\frac{\left(x\left[i\right]-gx\right)^{2}}{10}}\end{equation}

where \(i\) is the particle index, \(g\) is the current grid value, \(gx\) is the grid position, and \(g_{next}\) is the updated value. After each kernel call, the particle index is incremented. This kernel is applied over each of the \(1\times10^{3}\) gird points.

Kernel Performance

Wall time scaling comparing compile time and run time for the number of loop unroll iterations.

In this section we focus on the Set_Grid kernel and the batch. When batch = 1, the kernel only adds the kernel contribution for a single particle. To compute the full field, we would need to call this kernel \(1\times10^{6}\). Setting batch > 1 allows us to unroll the kernel call loop. This reduces the number of times we need to call the kernel but comes at the expense of a larger kernel and an associated compile penalty. If we were to set batch = 1000000 then we would fully unroll the loop resulting in only needing to call the kernel once. However it would come at the expense of the generated kernel source having over \(1\times10^{6}\) lines of code. In testing, loop unrolling beyond \(1\times10^{5}\) resulted in a segmentation fault.

The figure above, shows the scaling of the setup time for the graphs, the compile time for the kernel and the resulting run time. A benchmark was performed on a Apple M1 Pro for both float and double precision. Note that the GPU on the M1 Pro only support single precision. CPU benchmark was performed on a single core.

Unrolling the loop up to \(500\) iterations shows a significant reduction in the GPU kernel run time with a negligible affect on the initialization and kernel compile time. Beyond \(1000\) iterations, the improvement to kernel runtime becomes negligible. However, any gaines we made in runtime improvement is lost due to penalty to kernel compiling. For CPU execution there is little to no noticeable performance improvement when unrolling kernel call loops.

Table of Contents

Introduction

Source Code

Kernel Performance