Introduction

In this section we will look at the peformance of codes deveoped using this framework.

Strong Scaling

Left: Strong scaling wall time for 100000 Rays traced in a realistic tokamak equilibrium. Right: Strong scaling speedup normalized to the wall time for a single device or core. The dashed diagonal line references the best possible scaling. The M2 Max has 8 fast performance cores and 4 slower energy efficiency cores resulting drop off in improvement beyond 8 cores.

To benchmark code performance we traced $10^{6}$ rays for $10^{3}$ timesteps using the cold plasma dispersion relation in a realistic tokamak equilibrium. The figure above shows the strong scaling of wall time as the number of GPU and CPU devices are increased. The figure above shows the strong scaling speed up

\begin{equation}SpeedUp = \frac{time\left(1\right)}{time\left(n\right)}\end{equation}

Benchmarking was prepared on two different setups.

The first set up as a Mac Studio with an Apple M2 Max chip. The M2 chip contains a 12 core CPU where 8 cores are faster performance codes and the remaining 4 are slower efficiency cores. The M2 Max also contains a single 38-core GPU which only support single precision operations. The second setup is a server with 4 Nvidia A100 GPUs. Benchmarking measures the time to trace $10^{6}$ rays but does not include the setup and JIT times.

The figure above shows the advantage even a single GPU has over CPU execution. In single precision, the M2's GPU is almost $100\times$ faster a single CPU core while the a single A100 has a nearly $800\times$ advantage. An interesting thing to note is the M2 Max CPU show no advantage between single and double precision execution.

For large problem sizes the framework is expected to show good scaling with number of devices as the problems we are applying are embarrassingly parallel in nature. The figure above shows the strong scaling speed up with the number of devices. The framework shows good strong scaling as the problem is split among more devices. The architecture of the M2 Chip contains 8 fast performance cores and 4 slower energy efficiency cores. This produces a noticeable knee in the scaling after 8 core are used. Overall, the framework demonstrates good scaling across CPU and GPU devices.

Comparison to other Frameworks

Particle throughput for graph framework compared to MLX and JAX.

To benchmark against other frameworks we will look at the simple case of gyro motion in a uniform magnetic field $\vec{B}=B_{0}\hat{z}$.

\begin{equation}\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}\end{equation}

\begin{equation}\frac{\partial\vec{x}}{\partial t} = dt\vec{v}\end{equation}

We compared the graph framework against the MLX framework since it supports Apple GPUs and JAX due to it's popularity. Source codes for this benchmark case is available in the appendix. Figure fig:compare} shows the through put of pushing $10^{8}$ particles for $10^{3}$ time steps. The graph framework consistently shows the best throughput on both CPUs and GPUs. Note MLX CPU throughput could by improved by splitting the problem to multiple threads.

Source codes for throughput benchmark comparison

Graph Framework

const size_t size = 100000000;
const size_t steps = 1000;
 
const unsigned int num_devices = jit::context<float>::max_concurrency();
std::vector<std::thread>
    threads(std::max(std::min(num_devices),
                     static_cast<unsigned int> (size)),
            static_cast<unsigned int> (1)));
 
const size_t batch = size/threads.size();
const size_t extra = size%threads.size();
 
timing::measure_diagnostic_threaded time_steps("Time Steps");
 
for (size_t i = 0, ie = threads.size(); i < ie; i++) {
    threads[i] =
        std::thread([batch, extra,
                     &time_steps] (const size_t thread_number) -> void {
        const size_t local_size = batch + (extra > thread_number ? 1 : 0);
 
        auto x = graph::variable<float> (local_size, 0.0, "x");
        auto y = graph::variable<float> (local_size, 0.0, "y");
        auto z = graph::variable<float> (local_size, 0.0, "z");
            
        auto vx = graph::variable<float> (local_size, 1.0, "vx");
        auto vy = graph::variable<float> (local_size, 0.0, "vy");
        auto vz = graph::variable<float> (local_size, 1.0, "vz");
            
        auto b = graph::vector<float> (0.0, 0.0, 1.0);
        auto v = graph::vector(vx, vy, vz);
        auto pos = graph::vector(x, y, z);
            
        auto lorentz = v->cross(b);
        auto dt = graph::constant<float> (0.000001);
            
        auto v_next = v + dt*lorentz;
        auto pos_next = pos + dt*v_next;
            
        workflow::manager<float> work(0);
        work.add_item({
            graph::variable_cast(x),
            graph::variable_cast(y),
            graph::variable_cast(z),
            graph::variable_cast(vx),
            graph::variable_cast(vy),
            graph::variable_cast(vz)
        }, {}, {
            {pos_next->get_x(), graph::variable_cast(x)},
            {pos_next->get_y(), graph::variable_cast(y)},
            {pos_next->get_z(), graph::variable_cast(z)},
            {v_next->get_x(), graph::variable_cast(vx)},
            {v_next->get_y(), graph::variable_cast(vy)},
            {v_next->get_z(), graph::variable_cast(vz)}
        }, NULL, "Lorentz_kernel", local_size);
        work.compile();
            
        time_steps.start_time(thread_number);
        for (size_t j = 0; j < steps; j++) {
            work.run();
        }
        work.wait();
        time_steps.end_time(thread_number);
    }, i);
}
 
for (std::thread &t : threads) {
    t.join();
}
 
time_steps.print_max();

MLX

typedef const std::vector<mlx::core::array> &inputs;
typedef std::vector<mlx::core::array> outputs;
typedef std::function<outputs(inputs)> function;
 
mlx::core::set_default_device(mlx::core::Device::gpu);
 
function push = mlx::core::compile([](inputs in) -> outputs {
    const float dt = 0.000001;
    const mlx::core::array zero = mlx::core::zeros({1});
    const mlx::core::array one = mlx::core::zeros({1});
    const mlx::core::array vx_next = in[3] + dt*(in[4]*one - in[5]*zero);
    const mlx::core::array vy_next = in[4] + dt*(in[5]*zero - in[3]*one);
    const mlx::core::array vz_next = in[5] + dt*(in[3]*zero - in[4]*zero);
    const mlx::core::array x_next = in[0] + dt*vx_next;
    const mlx::core::array y_next = in[1] + dt*vy_next;
    const mlx::core::array z_next = in[2] + dt*vz_next;
    return {x_next, y_next, z_next, vx_next, vy_next, vz_next};
});
 
const int size = 100000000;
const int steps = 1000;
 
mlx::core::array x = mlx::core::zeros({size});
mlx::core::array y = mlx::core::zeros({size});
mlx::core::array z = mlx::core::zeros({size});
mlx::core::array vx = mlx::core::ones({size});
mlx::core::array vy = mlx::core::zeros({size});
mlx::core::array vz = mlx::core::ones({size});
outputs in = {x, y, z, vx, vy, vz};
 
const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
 
for(size_t i = 0; i < steps; i++) {
    in = push(in);
    for (mlx::core::array &i : in) {
        i.eval();
    }
}
 
std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
const auto total_time = end - start;

JAX

def push(x, y, z, vx, vy, vz):
    dt = 0.000001
    vx_next = vx + dt*(vy*1 - vz*0)
    vy_next = vy + dt*(vz*0 - vy*1)
    vz_next = vz + dt*(vx*0 - vy*0)
    return vx_next, vy_next, vz_next,
           x + dt*vx_next, y + dt*vy_next, z + dt*vz_next
 
push_jit = jax.jit(push)
 
size = 100000000
steps = 1000
 
x = jax.numpy.zeros((size))
y = jax.numpy.zeros((size))
z = jax.numpy.zeros((size))
vx = jax.numpy.ones((size))
vy = jax.numpy.zeros((size))
vz = jax.numpy.ones((size))
 
start = time.time()
for i in range(0, steps):
    x, y, z, vx, vy, vz = push_jit(x, y, z, vx, vy, vz)
jax.block_until_ready([x, y, z, vx, vy, vz])
end = time.time()
 
print(end - start)

Table of Contents

Introduction

Strong Scaling

Comparison to other Frameworks

Source codes for throughput benchmark comparison

Graph Framework

MLX

JAX