Avial is a compiler infrastructure built using MLIR that enables efficient execution of programs across distributed and heterogeneous computing systems (CPU, GPU, cluster). Avial introduces a novel task-centric intermediate representation (IR) where tasks are first-class citizens, capturing their parallelism, device targets, and interdependencies.
Parallel programming is notoriously difficult. Developers must reason about concurrency, memory consistency, synchronization, and performance optimization, all of which are challenging even in a single-threaded environment. The situation is further complicated by the fragmented nature of parallel programming frameworks: multicore CPUs are commonly programmed using POSIX threads or OpenMP, distributed memory systems use MPI, and accelerators like GPUs are typically programmed using CUDA, OpenCL, or OpenACC. Each of these paradigms comes with its own abstractions and programming idioms.
In parallel programming, you need to think about how and where your code runs, not just what it does.
Unifying these paradigms into a single coherent programming or compilation model is non-trivial due to fundamental differences in their memory models, synchronization semantics, and communication mechanisms. While there have been commendable efforts at unifying heterogeneous computing within a node. Such as OpenCL, OpenACC, and more recently Mojo. There is a noticeable gap when it comes to extending these unifications across distributed environments. The gap remains largely due to the complexity of distributed computing: issues such as explicit data movement between the nodes and network topology cannot be abstracted away as easily.
While MLIR includes dialects like omp for shared-memory parallelism and gpu for targeting accelerators such as CUDA or ROCm, there is currently no dialect that provides a unified abstraction for distributed heterogeneous computing, that is, for clusters of nodes with diverse compute units like CPUs and GPUs.
The mpi
dialect in MLIR offers low-level building blocks that reflect traditional MPI operations (e.g., send
, recv
, bcast
). However, it requires the programmer to manage rank assignments, data partitioning, topology awareness, and task scheduling manually. This is error-prone and non-trivial, especially as system complexity scales.
Our dialect builds on top of the MPI dialect, But raises the level of abstraction. Users express what computation needs to be performed and whether it should run on a CPU or GPU, without worrying about the underlying distributed communication or resource allocation.
This dialect bridges the gap between
device
andcluster
level parallelism, making it an MLIR dialect that can target distributed heterogeneous systems.
In high-performance computing environments, applications often contain a mix of compute regions. Some better suited for multicore CPUs, others for GPUs. Traditionally, orchestrating these different parts involves complex code: different frameworks (e.g., OpenMP, CUDA, MPI), manual device management, and tedious boilerplate. The CodeDrop approach introduces a task-oriented, declarative model that streamlines this process:
Drop your computation. Declare the target. Let the dialect take care of the rest.
Here's how it works in practice:
- Wrap your compute region inside a TaskOp. This region can contain operations from any dialect whether it's affine loops, linalg ops, or custom dialects.
- Attach a targetOp to specify where the task should execute. e.g., cpu or gpu.
- Let the Dialect Handle the Rest
- Automatically schedules tasks to the right hardware
- Inserts necessary MPI coordination
- Lowers the task to the different backend (e.g., LLVM, CUDA, ROCm)
- Handles device setup and data movement
Thanks to the CodeDrop
approach, integrating the Avial dialect into existing compiler pipelines is both trivial and non-intrusive. The process begins by identifying performance-critical regions such as loops, compute kernels, or math-heavy operations regardless of which dialect they're written in. These regions are then wrapped in a TaskOp. That’s it. From there, Avial takes full control, automatically lowering tasks to the appropriate execution backends including MPI for distributed execution and ultimately to LLVM IR.
This approach not only simplifies integration but also scales easily across heterogeneous and distributed environments. Whether running on a single multicore CPU or across a CPU-GPU cluster with MPI, Avial ensures consistent handling of task distribution and coordination.