h5pack pack documentation
Note
If you're new to h5pack, please consult our Quickstart guide.
Please note that this page is just a quick reference to explain different tool options.
This tool converts raw data, annotations, and a configuration file into one or several .h5 partitions for easy use in training or data analysis pipelines. Packing data in this format offers faster access and transfer by reducing file system overhead. The HDF5 format also maintains complex data hierarchies and metadata in one container, facilitating consistent organization, cross-language accessibility, and scalability for large datasets.
Basic usage
or using aliases:
If your config file is named h5pack.yaml (the default name), you can omit the -c/--config option:
Advanced settings
Create multiple partitions
You can partition your .h5 dataset across multiple files, improving organization and potentially performance.
These partitions can be unified using a Virtual Dataset (VDS), allowing you to access all the data through a single logical file.
For example, your partition files might be named dataset.pt0.h5, dataset.pt1.h5, and so on. By using VDS, you can create a single virtual file named dataset.h5, which seamlessly integrates the datasets from all partition files. Accessing dataset.h5 is equivalent to accessing the combined data from dataset.pt0.h5, dataset.pt1.h5, and other partition files, providing a convenient and efficient way to work with large datasets.
Partitions can be divided by a fixed count (e.g., 4 partitions) or by the number of files per partition (e.g., 1000 files per partition).
Fixed number of partitions
To create a fixed number of partitions (4 in this example), run:
h5pack pack --config <config-file> --dataset <dataset-name> --output <output-h5-file> --partitions 4
or using aliases:
Number of files per partition
To fit a define number of files per partition (1000 in this example), run:
h5pack pack --config <config-file> --dataset <dataset-name> --output <output-h5-file> --files-per-partition 1000
or using aliases:
Number of workers
To speed up the creation of your partition files, you can increase the number of workers using the -w/--workers option as:
h5pack pack --config <config-file> --dataset <dataset-name> --output <output-h5-file> --partitions 4 --workers 4
or using aliases:
This will spawn 4 workers, each handling a single partition concurrently.
Create virtual dataset
If you want to automatically create a virtual dataset file that aggregates all partitions as part of a dataset, simply add the --create-virtual flag as follows:
In addition to generating partition files like dataset.pt0.h5, dataset.pt1.h5, and so forth, using the --create-virtual flag will also create a virtual dataset named dataset.h5. This virtual file provides unified access to all partitioned data.
Note
If your datasets have already been created, please refer to the h5pack virtual tool for integrating them into a virtual dataset.
Help
To see all available options, run:
or using aliases: