Compression Filters
Filters to compress the data in snapshots can be applied to reduce the disk footprint of the datasets. The filters provided by SWIFT are filters natively provided by HDF5, implying that the library will automatically and transparently apply the reverse filter when reading the data stored on disk. They can be applied in combination with, or instead of, the lossless gzip compression filter.
These compression filters are lossy, meaning that they modify the data written to disk
Warning
The filters will reduce the accuracy of the data stored. No check is made inside SWIFT to verify that the applied filters make sense. Poor choices can lead to all the values of a given array reduced to 0, Inf, or to have lost too much accuracy to be useful. The onus is entirely on the user to choose wisely how they want to compress their data.
The filters are not applied when using parallel-hdf5.
The name of any filter applied is carried by each individual field in
the snapshot using the meta-data attribute Lossy compression
filter
.
Warning
Starting with HDF5 version 1.14.4, filters which compress the data
by more than 2x are flagged as problematic (see their
doc
). SWIFT can nevertheless write files with them by setting the
appropriate file-level flags. However, some tools (such as
h5py
) may not be able to read these fields.
The available filters are listed below.
N-bit filters for long long integers
The N-bit filter takes a long long and saves only the most significant N bits.
This can be used in cases similar to the particle IDs. For instance, if they cover the range \([1, 10^{10}]\) then 64-bits is too many and a lot of disk space is wasted storing the 0s. In this case \(\left\lceil{\log_2(10^{10})}\right\rceil + 1 = 35\) bits are sufficient (The extra “+1” is for the sign bit).
SWIFT implements 5 variants of this filter:
Nbit32
stores the 32 most significant bits (Numbers up to \(2\times10^{9}\), comp. ratio: 2)
Nbit36
stores the 36 most significant bits (Numbers up to \(3.4\times10^{10}\), comp. ratio: 1.78)
Nbit40
stores the 40 most significant bits (Numbers up to \(5.4\times10^{11}\), comp. ratio: 1.6)
Nbit44
stores the 44 most significant bits (Numbers up to \(8.7\times10^{12}\), comp. ratio: 1.45)
Nbit48
stores the 48 most significant bits (Numbers up to \(1.4\times10^{14}\), comp. ratio: 1.33)
Nbit56
stores the 56 most significant bits (Numbers up to \(3.6\times10^{16}\), comp. ratio: 1.14)
Note that if the data written to disk is requiring more than the N bits then part of the information written to the snapshot will lost. SWIFT does not apply any verification before applying the filter.
Scaling filters for floating-point numbers
The D-scale filters can be used to round floating-point values to a fixed absolute accuracy.
They start by computing the minimum of an array that is then deducted from all the values. The array is then multiplied by \(10^n\) and truncated to the nearest integer. These integers are stored with the minimal number of bits required to store the values. That process is reversed when reading the data.
For an array of values
1.2345 |
-0.1267 |
0.0897 |
and \(n=2\), we get stored on disk (but hidden to the user):
136 |
0 |
22 |
This can be stored with 8 bits instead of the 32 bits needed to store the original values in floating-point precision, realising a gain of 4x.
When reading the values (for example via h5py
or swiftsimio
), that
process is transparently reversed and we get:
1.2333 |
-0.1267 |
0.0933 |
Using a scaling of \(n=2\) hence rounds the values to two digits after the decimal point.
SWIFT implements 4 variants of this filter:
DScale1
scales by \(10^1\)
DScale2
scales by \(10^2\)
DScale3
scales by \(10^3\)
DScale4
scales by \(10^4\)
DScale5
scales by \(10^5\)
DScale6
scales by \(10^6\)
An example application is to store the positions with pc
accuracy in
simulations that use Mpc
as their base unit by using the DScale6
filter.
The compression rate of these filters depends on the data. On an
EAGLE-like simulation (100 Mpc box), compressing the positions from Mpc
to
pc
(via Dscale6
) leads to rate of around 2.2x.
Modified floating-point representation filters
These filters modify the bit-representation of floating point numbers to get a different relative accuracy.
In brief, floating point (FP) numbers are represented in memory as \((\pm 1)\times a \times 2^b\) with a certain number of bits used to store each of \(a\) (the mantissa) and \(b\) (the exponent) as well as one bit for the overall sign [1]. For example, a standard 4-bytes float uses 23 bits for \(a\) and 8 bits for \(b\). The number of bits in the exponent mainly drives the range of values that can be represented whilst the number of bits in the mantissa drives the relative accuracy of the numbers.
Converting to the more familiar decimal notation, we get that the number of decimal digits that are correctly represented is \(\log_{10}(2^{n(a)+1})\), with \(n(x)\) the number of bits in \(x\). The range of positive numbers that can be represented is given by \([2^{-2^{n(b)-1}+2}, 2^{2^{n(b)-1}}]\). For a standard float, this gives a relative accuracy of \(7.2\) decimal digits and a representable range of \([1.17\times 10^{-38}, 3.40\times 10^{38}]\). Numbers above the upper limit are labeled as Inf and below this range they default to zero.
The filters in this category change the number of bits in the mantissa and
exponent. When reading the values (for example via h5py
or
swiftsimio
) the numbers are transparently restored to regular float
but with 0s in the bits of the mantissa that were not stored on disk, hence
changing the result from what was stored originally before compression.
These filters offer a fixed compression ratio and a fixed relative
accuracy. The available options in SWIFT for a float
(32 bits) output are:
Filter name |
\(n(a)\) |
\(n(b)\) |
Accuracy |
Range |
Compression ratio |
---|---|---|---|---|---|
No filter |
23 |
8 |
7.22 digits |
\([1.17\times 10^{-38}, 3.40\times 10^{38}]\) |
— |
|
13 |
8 |
4.21 digits |
\([1.17\times 10^{-38}, 3.40\times 10^{38}]\) |
1.45x |
|
9 |
8 |
3.01 digits |
\([1.17\times 10^{-38}, 3.40\times 10^{38}]\) |
1.78x |
|
7 |
8 |
2.41 digits |
\([1.17\times 10^{-38}, 3.40\times 10^{38}]\) |
2x |
|
10 |
5 |
3.31 digits |
\([6.1\times 10^{-5}, 6.5\times 10^{4}]\) |
2x |
Same for a double
(64 bits) output:
Filter name |
\(n(a)\) |
\(n(b)\) |
Accuracy |
Range |
Compression ratio |
---|---|---|---|---|---|
No filter |
52 |
11 |
15.9 digits |
\([2.2\times 10^{-308}, 1.8\times 10^{308}]\) |
— |
|
21 |
11 |
6.62 digits |
\([2.2\times 10^{-308}, 1.8\times 10^{308}]\) |
1.93x |
|
13 |
11 |
4.21 digits |
\([2.2\times 10^{-308}, 1.8\times 10^{308}]\) |
2.56x |
|
9 |
11 |
3.01 digits |
\([2.2\times 10^{-308}, 1.8\times 10^{308}]\) |
3.05x |
The accuracy given in the table corresponds to the number of decimal digits that can be correctly stored. The “no filter” row is displayed for comparison purposes.
In the first table, the first two filters are useful to keep the same range as a standard float but with a reduced accuracy of 3 or 4 decimal digits. The last two are the two standard reduced precision options fitting within 16 bits: one with a much reduced relative accuracy and one with a much reduced representable range.
The compression filters for the double quantities are useful if the values one want to store fall outside the exponent range of float numbers but only a lower relative precision is necessary.
An example application is to store the densities with the FMantissa9
filter as we rarely need more than 3 decimal digits of accuracy for this
quantity.