getting started with gribscan#

Tools to scan GRIB files and create zarr-compatible indices.

Warning

This repository is still experimental. The code is not yet tested for many kinds of files. It will likely not destroy your files, as it only accesses GRIB files in read-mode, but it may skip some information or may crash. Please file an issue if you discover something is missing.

installing#

gribscan is on PyPI, you can install the recent released version using

python -m pip install gribscan

if you are interested in the recent development version, please clone the repository and install the package in development mode:

python -m pip install -e <path to your clone>

command line usage#

gribscan comes with two executables:

  • gribscan-index for building indices of GRIB files

  • gribscan-build for building a dataset from indices

building indices#

gribscan will create jsonlines-based .index-files next to the input GRIB files. The format is based on the ECMWF OpenData index format but contains a lot more entries.

You can pass in multiple GRIB files at once and specify the number of parallel processes (-n).

gribscan-index *.grb2 -n 16

Note: While gribscan uses cfgrib partially to read GRIB metadata, it does so in a rather hacky way. That way, gribscan does not have to create temporary files and is much faster than cfgrib or kerchunk.grib2, but it may not be as universal as cfgrib is. This is also the main reason for the warning above.

building a dataset#

After all the index files have been created, a common dataset can be assembled based on the information in the index files. The assembled dataset will be written outin a fsspec ReferenceFileSystem compatible JSON file, which internally builds a zarr-group structure.

gribscan-build *.index -o dataset.json --prefix <path prefix to referenced grib files>

The prefix will be prepended to the paths within the dataset.json and should point to the location of the original GRIB files.

reading indexed grib via zarr#

The resulting JSON-file can be interpreted by ReferenceFileSystem and zarr as follows:

import gribscan
import xarray as xr
ds = xr.open_zarr("reference::dataset.json", consolidated=False)
ds

Note that gribscan must be imported in order to register gribscan.rawgrib as a numcodecs codec, which enables the use of GRIB messages as zarr-chunks. As opposed to gribscan-index, the codec only depends on eccodes and doesn’t use cfgrib at all.

fsspec supports URL chaining. The prefix reference:: before the path signals to fsspec, that after loading the given path, an ReferenceFileSystem should be initialized with whatever is found in that path. In principle, it’s well possible to use ReferenceFileSystem also across HTTP or wihin ZIP files or a combination thereof…

library usage#

You might be interested in using gribscan as a Python-library, which enables further usecases.

building indices#

You can build an index from a single GRIB file (as explained above) using:

import gribscan
gribscan.write_index(gribfile, indexfile)

building dataset references#

You can also assemble a dataset from the incides using:

import gribscan
magician = gribscan.Magician()
gribscan.grib_magic(indexfiles, magician, global_prefix)

The magician is a class which can customize how the dataset is assembled. You may want to define your own in order to design the resulting dataset according to your preferences. Please have a look at magician.py to see how a Magician would look like and check out the magicians docs.