Fix sharding error (#37)

* Use cosmo as arg for the ODE function * Update examples * format * notebook update * fix tests * add correct annotations for weights in painting and warning for cic_paint in distributed pm * update test_against_fpm * update distributed tests and add jacfwd jacrev and vmap tests * format * add Caveats to notebook readme * final touches * update Growth.py to allow using FastPM solver * fix 2D painting when input is (X , Y , 2) shape * update cic read halo size and notebooks examples * Allow env variable control of caching in growth * Format * update test jax version * update notebooks/03-MultiGPU_PM_Halo.ipynb * update numpy install in wf * update tolerance :) * reorganize install in test workflow * update tests * add mpi4py * update tests.yml * update tests * update wf * format * make normal_field signature consistent with jax.random.normal * update by default normal_field dtype to match JAX * format * debug test workflow * format * debug test workflow * updating tests * fix accuracy * fixed tolerance * adding caching * Update conftest.py * Update tolerance and precision settings in distributed PM tests * revererting back changes to growth.py --------- Co-authored-by: Francois Lanusse <fr.eiffel@gmail.com> Co-authored-by: Francois Lanusse <EiffL@users.noreply.github.com>
2025-06-30 09:01:11 +00:00 · 2025-06-28 23:07:31 +02:00 · 2025-06-28 23:07:31 +02:00 · 6693e5c725
commit 6693e5c725
parent cb2a7ab17f
17 changed files with 675 additions and 298 deletions
--- a/notebooks/01-Introduction.ipynb
+++ b/notebooks/01-Introduction.ipynb
--- a/notebooks/02-Advanced_usage.ipynb
+++ b/notebooks/02-Advanced_usage.ipynb
--- a/notebooks/03-MultiGPU_PM_Halo.ipynb
+++ b/notebooks/03-MultiGPU_PM_Halo.ipynb
--- a/notebooks/04-MultiGPU_PM_Solvers.ipynb
+++ b/notebooks/04-MultiGPU_PM_Solvers.ipynb
@ -62,7 +62,7 @@
    "\n",
    "This cell configures a **2x4 device mesh** across 8 devices and sets up named sharding to distribute data efficiently.\n",
    "\n",
-    "- **Device Mesh**: `pdims = (2, 4)` arranges devices in a 2x4 grid. `create_device_mesh(pdims)` initializes this layout across available GPUs.\n",
+    "- **Device Mesh**: `pdims = (2, 4)` arranges devices in a 2x4 grid.\n",
    "- **Sharding with Mesh**: `Mesh(devices, axis_names=('x', 'y'))` assigns the mesh grid axes, which allows flexible mapping of array data across devices.\n",
    "- **PartitionSpec and NamedSharding**: `PartitionSpec` defines data partitioning across mesh axes `('x', 'y')`, and `NamedSharding(mesh, P('x', 'y'))` specifies this sharding scheme for arrays in the simulation.\n",
    "\n",
@ -71,7 +71,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@ -80,11 +80,10 @@
    "from jax.sharding import Mesh, NamedSharding\n",
    "from jax.sharding import PartitionSpec as P\n",
    "\n",
-    "all_gather = partial(process_allgather, tiled=False)\n",
+    "all_gather = partial(process_allgather, tiled=True)\n",
    "\n",
    "pdims = (2, 4)\n",
-    "devices = create_device_mesh(pdims)\n",
-    "mesh = Mesh(devices, axis_names=('x', 'y'))\n",
+    "mesh = jax.make_mesh(pdims, axis_names=('x', 'y'))\n",
    "sharding = NamedSharding(mesh, P('x', 'y'))"
   ]
  },
@ -124,7 +123,7 @@
    "\n",
    "    # Evolve the simulation forward\n",
    "    ode_fn = ODETerm(\n",
-    "        make_diffrax_ode(cosmo, mesh_shape, paint_absolute_pos=False))\n",
+    "        make_diffrax_ode(mesh_shape, paint_absolute_pos=False,sharding=sharding , halo_size=halo_size))\n",
    "    solver = LeapfrogMidpoint()\n",
    "\n",
    "    stepsize_controller = ConstantStepSize()\n",
@ -288,7 +287,7 @@
    "\n",
    "    # Evolve the simulation forward\n",
    "    ode_fn = ODETerm(\n",
-    "        make_diffrax_ode(cosmo, mesh_shape, paint_absolute_pos=False))\n",
+    "        make_diffrax_ode(mesh_shape, paint_absolute_pos=False,sharding=sharding , halo_size=halo_size))\n",
    "    solver = Dopri5()\n",
    "\n",
    "    stepsize_controller = PIDController(rtol=1e-5,atol=1e-5)\n",
--- a/notebooks/05-MultiHost_PM.py
+++ b/notebooks/05-MultiHost_PM.py
@ -17,9 +17,8 @@ import jax_cosmo as jc
 import numpy as np
 from diffrax import (ConstantStepSize, Dopri5, LeapfrogMidpoint, ODETerm,
                     PIDController, SaveAt, diffeqsolve)
-from jax.experimental.mesh_utils import create_device_mesh
 from jax.experimental.multihost_utils import process_allgather
-from jax.sharding import Mesh, NamedSharding
+from jax.sharding import NamedSharding
 from jax.sharding import PartitionSpec as P

 from jaxpm.kernels import interpolate_power_spectrum
@ -78,7 +77,7 @@ def parse_arguments():

 def create_mesh_and_sharding(pdims):
    devices = create_device_mesh(pdims)
-    mesh = Mesh(devices, axis_names=('x', 'y'))
+    mesh = jax.make_mesh(pdims, axis_names=('x', 'y'))
    sharding = NamedSharding(mesh, P('x', 'y'))
    return mesh, sharding

@ -106,7 +105,10 @@ def run_simulation(omega_c, sigma8, mesh_shape, box_size, halo_size,
                   sharding=sharding)

    ode_fn = ODETerm(
-        make_diffrax_ode(cosmo, mesh_shape, paint_absolute_pos=False))
+        make_diffrax_ode(mesh_shape,
+                         paint_absolute_pos=False,
+                         sharding=sharding,
+                         halo_size=halo_size))

    # Choose solver
    solver = LeapfrogMidpoint() if solver_choice == "leapfrog" else Dopri5()
--- a/notebooks/README.md
+++ b/notebooks/README.md
@ -37,3 +37,50 @@ Each notebook includes installation instructions and guidelines for configuring
 - **SLURM** for job scheduling on clusters (if running multi-host setups)

 > **Note**: These notebooks are tested on the **Jean Zay** supercomputer and may require configuration changes for different HPC clusters.
+
+## Caveats
+
+### Cloud-in-Cell (CIC) Painting (Single Device)
+
+There is two ways to perform the CIC painting in JAXPM. The first one is to use the `cic_paint` which paints absolute particle positions to the mesh. The second one is to use the `cic_paint_dx` which paints relative particle positions to the mesh (using uniform particles). The absolute version is faster at the cost of more memory usage.
+
+inorder to use relative painting you need to :
+
+ - Set the `particles` argument in `lpt` function from `jaxpm.pm` to `None`
+ - Set `paint_absolute_pos` to `False` in `make_ode_fn` or `make_diffrax_ode` function from `jaxpm.pm` (it is True by default)
+
+Otherwise you set `particles` to the starting particles of your choice and leave `paint_absolute_pos` to `True` (default value).
+
+### Cloud-in-Cell (CIC) Painting (Multi Device)
+
+Both `cic_paint` and `cic_paint_dx` functions are available in multi-device mode.
+
+You need to set the arguments `sharding` and `halo_size` which is explained in the notebook [03-MultiGPU_PM_Halo.ipynb](03-MultiGPU_PM_Halo.ipynb).
+
+One thing to note that `cic_paint` is not as accurate as `cic_paint_dx` in multi-device mode and therefor is not recommended.
+
+Using relative painting in multi-device mode is just like in single device mode.\
+You need to set the `particles` argument in `lpt` function from `jaxpm.pm` to `None` and set `paint_absolute_pos` to `False`
+
+### Distributed PM
+
+To run a distributed PM follow the examples in notebooks [03](03-MultiGPU_PM_Halo.ipynb) and [05](05-MultiHost_PM.ipynb) for multi-host.
+
+In short you need to set the arguments `sharding` and `halo_size` in `lpt` , `linear_field` the `make_ode` functions and `pm_forces` if you use it.
+
+Missmatching the shardings will give you errors and unexpected results.
+
+You can also use `normal_field` and `uniform_particles` from `jaxpm.pm.distributed` to create the fields and particles with a sharding.
+
+### Choosing the right pdims
+
+pdims are processor dimensions.\
+Explained more in the jaxdecomp paper [here](https://github.com/DifferentiableUniverseInitiative/jaxDecomp).
+
+For 8 devices there are three decompositions that are possible:
+- (1 , 8)
+- (2 , 4) , (4 , 2)
+- (8 , 1)
+
+(1 , X) should be the fastest (2 , X) or (X , 2) is more accurate but slightly slower.\
+and (X , 1) is giving the least accurate results for some reason so it is not recommended.