You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sharding in reactant requires you to specify how the data is sharded across devices on a mesh. We start by specifying the mesh [`Sharding.Mesh`](@ref) which is a collection of the devices reshaped into an N-D grid. Additionally, we can specify names
85
+
Sharding in reactant requires you to specify how the data is sharded across devices on a mesh. We start by specifying the mesh [`Sharding.Mesh`](@ref) which is a collection of the devices reshaped into an N-D grid. Additionally, we can specify names for each axis of the mesh, that are then referenced when specifying how the data is sharded.
86
+
87
+
1.`Sharding.Mesh(reshape(Reactant.devices()[1:4], 2, 2), (:x, :y))`: Creates a 2D grid of 4 devices arranged in a 2x2 grid. The first axis is named `:x` and the second axis is named `:y`.
88
+
2.`Sharding.Mesh(reshape(Reactant.devices()[1:4], 4, 1), (:x, :y))`: Creates a 2D grid of 4 devices arranged in a 4x1 grid. The first axis is named `:x` and the second axis is named `:y`.
89
+
90
+
Given the mesh, we will specify how the data is sharded across the devices.
86
91
87
92
<!-- TODO describe how arrays are the "global data arrays, even though data is itself only stored on relevant device and computation is performed only devices with the required data (effectively showing under the hood how execution occurs) -->
88
93
@@ -104,6 +109,11 @@ To keep things simple, let's implement a 1-dimensional heat equation here. We st
104
109
105
110
Let's consider how this can be implemented with explicit MPI communication. Each node will contain a subset of the total data. For example, if we simulate with 100 points, and have 4 devices, each device will contain 25 data points. We're going to allocate some extra room at each end of the buffer to store the ``halo'', or the data at the boundary. Each time step that we take will first copy in the data from its neighbors into the halo via an explicit MPI send and recv call. We'll then compute the updated data for our slice of the data.
106
111
112
+
With sharding, things are a bit more simple. We can write the code as if we only had one device. No explicit send or recv's are necessary
113
+
as they will be added automatically by Reactant when it deduces they are needed. In fact, Reactant will attempt to optimize the placement of the communicatinos to minimize total runtime. While Reactant tries to do a good job (which could be faster than an initial implementation -- especially for complex codebases), an expert may be able to find a better placement of the communication.
114
+
115
+
The only difference for the sharded code again occurs during allocation. Here we explicitly specify that we want to subdivide the initial grid of 100 amongst all devices. Analagously if we had 4 devices to work with, each device would have 25 elements in its local storage. From the user's standpoint, however, all arrays give access to the entire dataset.
116
+
107
117
::: code-group
108
118
109
119
```julia [MPI Based Parallelism]
@@ -147,11 +157,6 @@ end
147
157
simulate(data, 100)
148
158
```
149
159
150
-
With sharding, things are a bit more simple. We can write the code as if we only had one device. No explicit send or recv's are necessary
151
-
as they will be added automatically by Reactant when it deduces they are needed. In fact, Reactant will attempt to optimize the placement of the communicatinos to minimize total runtime. While Reactant tries to do a good job (which could be faster than an initial implementation -- especially for complex codebases), an expert may be able to find a better placement of the communication.
152
-
153
-
The only difference for the sharded code again occurs during allocation. Here we explicitly specify that we want to subdivide the initial grid of 100 amongst all devices. Analagously if we had 4 devices to work with, each device would have 25 elements in its local storage. From the user's standpoint, however, all arrays give access to the entire dataset.
You can query the available devices that Reactant can access as follows:
197
+
You can query the available devices that Reactant can access as follows using
198
+
[`Reactant.devices`](@ref).
199
+
200
+
```@example sharding_tutorial
201
+
Reactant.devices()
192
202
```
193
-
TODO
203
+
204
+
Not all devices are accessible from each process for [multi-node execution](@ref multihost).
205
+
To query the devices accessible from the current process, use
206
+
[`Reactant.addressable_devices`](@ref).
207
+
208
+
```@example sharding_tutorial
209
+
Reactant.addressable_devices()
194
210
```
195
211
196
212
You can inspect the type of the device, as well as its properties.
197
213
198
-
One nice feature about how Reactant's handling of multiple devices is that you don't need to s handles sharding is that
199
-
:::
214
+
MPI to send the data. between computers When using GPUs on different devices, one needs to copy the data through the network via NCCL instead of the `cuda.
215
+
216
+
All devices from all nodes are available for use by Reactant. Given the topology of the devices, Reactant will automatically determine the right type of communication primitive to use to send data between the relevant nodes. For example, between GPUs on the same host Reactant may use the faster `cudaMemcpy` whereas for GPUs on different nodes Reactant will use NCCL.
217
+
218
+
The fact that you doesn't need to specify how the communication is occuring enables code written with Reactant to be run on a different topology (e.g. moving fro
219
+
One nice feature about how Reactant's handling of multiple devices is that you don't need to specify how the data is transfered. For example, when using multiple GPUs on the same host it might be efficient to copy data using a `cudaMemcpy` to transfer between devices directly. When using CPUs on multiple different nodes, one can use
200
220
201
221
## Generating Distributed Data by Concatenating Local-Worker Data
0 commit comments