Masking
swiftsimio provides unique functionality (when compared to other
software packages that read HDF5 data) through its masking facility.
SWIFT snapshots contain cell metadata that allow us to spatially mask the
data ahead of time. swiftsimio provides a number of tools that help
with this. This functionality is provided through the swiftsimio.masks
sub-module but is available easily through the swiftsimio.mask()
top-level function.
This functionality is used heavily in swiftgalaxy.
Spatial masking
Spatial masking is approximate and allows you to only load particles within a given cuboid region. It is precise to the top-level cells that are defined within SWIFT. It will always load all of the particles that you request, but in general it will also load some particles that are slightly outside of the region of interest. This is because it works as follows:
Load the top-level cell metadata.
Find the overlap between the specified region and these cells.
Load all cells within that overlap.
As you can see, some extra particles around the edges of regions may be loaded as we always load the whole top-level cell.
Example
In this example we will use the swiftsimio.masks.SWIFTMask object
to load the the octant of the box closest to the origin.
import swiftsimio as sw
filename = "cosmo_volume_example.hdf5"
mask = sw.mask(filename)
# The full metadata object is available from within the mask
boxsize = mask.metadata.boxsize
# load_region is a 3x2 list [[xmin, xmax], [ymin, ymax], [zmin, zmax]]
load_region = [[0.0 * b, 0.5 * b] for b in boxsize]
# Constrain the mask
mask.constrain_spatial(load_region)
# Now load the snapshot with this mask
data = sw.load(filename, mask=mask)
data is now a regular swiftsimio.reader.SWIFTDataset object, but
it only ever loads particles that are (approximately) inside the
load_region region.
Importantly, this method has a tiny memory overhead, and should also have a relatively small overhead when reading the data. This allows you to use snapshots that are much larger than the available memory on your machine and process them with ease.
It is also possible to build up a region with a more complicated geometry by
making repeated calls to constrain_spatial()
and setting the optional argument union=True. This option adds any
additional cells that need to be selected for the new region to the existing
selection. For instance, to add the diagonally opposed octant to the selection
made above (and so obtain a region shaped like two cubes with a single corner
touching):
two_octant_mask = sw.mask(filename)
first_region = [[0.0 * b, 0.5 * b] for b in boxsize]
additional_region = [[0.5 * b, 1.0 * b] for b in boxsize]
two_octant_mask.constrain_spatial(first_region)
two_octant_mask.constrain_spatial(additional_region, union=True)
two_octant_dataset = sw.load(filename, mask=two_octant_mask)
Rows of the load_region can also be replaced with None to indicate that no
constraint should be imposed along that axis. For example, to select a 1 Mpc thick
“slab” perpendicular to the z-axis:
import unyt as u
slab_mask = sw.mask(filename)
scale_factor = slab_mask.metadata.scale_factor
slab_region = [
cosmo_array([1, 2], u.Mpc, comoving=True, scale_factor=scale_factor, scale_exponent=1),
None,
None,
]
slab_mask.constrain_spatial(slab_region)
slab_dataset = sw.load(filename, slab_mask)
Periodic boundaries
The mask region is aware of the periodic box boundaries. Let’s take for example a region shaped like a “slab” in the \(x-y\) plane with \(|z|<0.1L_\mathrm{box}\). One way to write this is by thinking of the \(z<0\) part as lying at the upper edge of the box:
mask = sw.mask(filename)
boxsize = mask.metadata.boxsize
mask.constrain_spatial(
[
None,
None,
[0.0 * boxsize[2], 0.1 * boxsize[2]],
]
)
mask.constrain_spatial(
[
None,
None,
[0.9 * boxsize[2], 1.0 * boxsize[2]],
]
union=True,
)
This is a bit inconvenient though since the region is actually contiguous if we
account for the periodic boundary.
constrain_spatial() allows us to select a region
straddling the periodic boundary, for example this is an equivalent selection:
mask = sw.mask(filename)
mask.constrain_spatial(
[
None,
None,
[-0.1 * boxsize[2], 0.1 * boxsize[2]],
]
)
Note that masking never result in periodic copies of particles, nor does it shift particle coordinates to match the region defined; particle coordinates always lie in the range \([0, L_\mathrm{box}]\). For example reading a region that extends beyond the box in all directions produces exactly one copy of every particle and is equivalent to providing no spatial mask:
mask = sw.mask(filename)
mask.constrain_spatial([[-0.1 * b, 1.1 * b] for b in mask.metadata.boxsize])
Remember to wrap the coordinates yourself if relevant! Alternatively, the swiftgalaxy package offers support for coordinate transformations including periodic boundaries.
Another equivalent region for the \(|z|<0.1L_\mathrm{box}\) slab can be written by setting the lower bound to a greater value than the upper bound, the code will interpret this as a request to start at the lower bound, wrap through the upper periodic boundary and continue until the (numerically lower value of) the upper bound is reached:
mask = sw.mask(filename)
mask.constrain_spatial(
[
None,
None,
[0.9 * boxsize[2], 0.1 * boxsize[2]],
]
)
The coordinates defining the region must always be in the interval \([-0.5L_\mathrm{box}, 1.5L_\mathrm{box}]\). This allows enough flexibility to define all possible regions.
Implementation details
SWIFT snapshots group particles according to the cell that they occupy so that
particles belonging to a cell are stored contiguously. The cells form a regular grid
covering the simulation domain. However, SWIFT does not guarantee that all particles
that belong to a cell are within the boundaries of a cell at the time when a snapshot
is produced (particles are moved between cells at intervals, but may drift outside of
their current cell before being re-assigned). Snapshots contain metadata defining
the “bounding box” of each cell that contains all particles assigned to it at the
time that the snapshot was written. swiftsimio uses this information when
deciding what cells to read, so you may find that the “extra” particles read in
outside of the explicitly asked for have an irregular boundary with cuboid protrusions
or indentations. This is normal: the cells read in are exactly those needed to
guarantee that all particles in the specified region of interest are captured. It is
therefore advantageous to make the region as small and tightly fit to the analysis
task as possible - in particular, trying to align it with the cell boundaries will
typically result in an I/O overhead as neighbouring cells with particles that have
drifted into the region are read in. Unless these particles are actually needed, it
is actually better for performance to avoid the cell boundaries when defining the
region.
Older SWIFT snapshots lack the metadata to know exactly how far particles have
drifted out of their cells. In v10.2.0 or newer, if swiftsimio does not
find this metadata, it will pad the region (by 0.1 times the cell length by default),
and issue a UserWarning indicating this.
Warning
In the worst case that the region consists of one cell and the padding extends to all
neighbouring cells, this can result in up to a factor of \(3^3=27\) additional
I/O overhead. Older swiftsimio versions instead risk missing particles near
the region boundary.
In the unlikely case that particles drift more than 0.1 times
the cell length away from their “home” cell and the cell bounding-box metadata is not
present, some particles can be missed when applying a spatial mask. The padding of
the region can be extended or switched off with the safe_padding parameter:
mask = sw.mask(filename)
lbox = mask.metadata.boxsize
mask.constrain_spatial(
[[0.4 * b, 0.6 * b] for b in mask.metadata.boxsize],
safe_padding=False, # padding switched off
)
mask.constrain_spatial(
[[0.4 * b, 0.6 * b] for b in mask.metadata.boxsize],
safe_padding=1.0, # pad more, by 1.0 instead of 0.1 cell lengths
)
Masking by properties
The below example shows the use of the masking tools to constrain loading only particles within a chosen density window.
import swiftsimio as sw
# This creates and sets up the masking object.
mask = sw.mask("cosmological_volume.hdf5")
# This ahead-of-time creates a spatial mask based on the cell metadata.
mask.constrain_spatial(
np.array(
[
[0.2, 0.7],
[0.0, 1.0],
[0.0, 1.0],
]
) * mask.metadata.boxsize[:, np.newaxis]
)
# Now we also constrain the density between
# 0.4 and 0.8 g/cm^3. This reads in the relevant data in the region,
# and tests it element-by-element.
density_units = mask.units.mass / mask.units.length**3
mask.constrain_mask(
"gas",
"density",
cosmo_quantity(
0.4,
u.g / u.cm ** 3,
comoving=True,
scale_factor=mask.metadata.scale_factor,
scale_exponent=-3
),
cosmo_quantity(
0.8,
u.g / u.cm ** 3,
comoving=True,
scale_factor=mask.metadata.scale_factor,
scale_exponent=-3
),
)
# Now we can grab the actual data object. This includes the mask as a parameter.
data = sw.load("cosmo_volume_example.hdf5", mask=mask)
Warning
While evaluating a full mask (like the density mask in the example above) is implemented efficiently, subsequently reading data from the masked dataset can be extremely slow. This is because it often involves reading many individual rows from the particle data arrays in the hdf5 files, and the code is not optimized for this (this could be optimized in the future). If you encounter slow reads with a full mask, consider:
using a spatial-only mask instead, then discarding unwanted particles afterwards;
using the mask feature on the visualisation functions, if the mask is wanted for visualisation (for example
project_gas());using
swiftgalaxy’s masking features, possibly without using a halo catalogue.
When the attributes of this data object are accessed, only the ones that belong to the masked region (in both density and spatial) are read. I.e. if I ask for the temperature of particles, it will recieve an array containing temperatures of particles that lie in the x-coordinate interval \([0.2, 0.7]\) and have a density between \(0.4\) and \(0.8 \mathrm{g}/\mathrm{cm}^3\).
Row Masking
For certian scenarios, in particular halo catalogues, all arrays are of the
same length (you can check this through the metadata.homogeneous_arrays
attribute). Often, you are interested in a handful of, or a single, row,
corresponding to the properties of a particular object. You can use the
methods constrain_index and constrain_indices to do this, which
return swiftsimio data objects containing arrays with only those
rows.
import swiftsimio as sw
mask = sw.mask(filename)
mask.constrain_indices([1, 99, 23421])
data = sw.load(filename, mask=mask)
Here, the length of all the arrays will be 3. A quick performance note: if you
are using many indices (over 1000), you should use mask.convert_masks_to_bool()
before using the mask to benefit from range reading of overlapping rows in a
single chunk.
Writing subset of snapshot
In some cases it may be useful to write a subset of an existing snapshot to its own hdf5 file. This could be used, for example, to extract a galaxy halo that is of interest from a snapshot so that the file is easier to work with and transport.
To do this the write_subset function is provided. It can be used, for example,
as follows
import swiftsimio as sw
import unyt
mask = sw.mask("eagle_snapshot.hdf5")
scale_factor = mask.metadata.scale_factor
mask.constrain_spatial(
[
cosmo_array([100, 200], u.kpc, comoving=True, scale_factor=scale_factor, scale_exponent=1),
None,
None,
]
)
sw.subset_writer.write_subset("test_subset.hdf5", mask)
This will write a snapshot which contains the particles from the specified snapshot whose \(x\)-coordinate is within the range [100, 200] kpc. This function uses the cell mask which encompases the specified spatial domain to successively read portions of datasets from the input file and writes them to a new snapshot.
Due to the coarse grained nature of the cell mask, particles from outside this range may also be included if they are within the same top level cells as particles that fall within the given range.