Quickstart
If you haven't installed h5pack yet, please refer to our Installation guide before proceeding. This guide will help you get up and running with h5pack in minutes. To create an HDF5 dataset using h5pack, you need the following three components:
- Data: It is your raw data which may include one or multiple sets of audio files in any of the formats supported by
soundfilesuch was.wavor.flac. - Annotations: Any additional data that is related to your audio data and relevant to pack in the dataset such as split (training, validation, test), speaker id or simlilar annotations. It is provided through a
.csvfile where one column has the path of the audio file and any other column can contain an annotation related to that audio file in any of the data types supported byh5pack. - Configuration file: The configuration file is a
.yamlfile that relates your annotations and your audio data and tellh5packwhat to include and how.
Note
All the data mentioned below is in the examples/00_annotated-audio-dataset folder of h5pack repository.
In the next section we will go through each individual component of the exampled
located in examples/00_annotated-audio-dataset. This folder has the following data structure:
00_annotated-audio-dataset
├── data
│ ├── brownian-noise.flac
│ ├── pink-noise.flac
│ └── white-noise.flac
├── dataset.csv
└── h5pack.yaml
In this case, the datais in the data folder which consist of three .flacfiles, our annotatios are in the dataset.csv file and the configuration is inside the h5pack.yaml file.
Data
It corresponds to the raw data that will be included in the dataset, this is typically
one or multiple sets of audio files in formats such as .wav or .flac. The resulting
.h5 file will consolidate the data from multiple audio files into a single HDF5 file or a set of HDF5 partition files, ensuring efficient storage and access.
Annotations
Annotations can be incorporated alongside each audio file using a .csv file. Each row in this .csv should include a column with the audio file's path, and the remaining columns can contain supplementary data to be embedded as annotations.
In this case, there is an arbitrary Type parameter describing type of noise included
in each audio file as a str. The dataset.csv content looks as follow:
| File | Type |
|---|---|
| data/brownian-noise.flac | brownian |
| data/pink-noise.flac | pink |
| data/white-noise.flac | white |
This .csv can be generated in any way you want. For the most common uses cases we
recommend using sndls tool.
Configuration file
The configuration file, typically named h5pack.yaml, is a .yaml file that connects your data and annotations, instructing h5pack on how to render your dataset(s). It consists of specifications for one or more datasets. In this case h5pack.yaml looks as follows (omitting the comments):
datasets:
simple_dataset:
attrs:
author: Your name
description: Your dataset description
version: 0.1.0
data:
file: dataset.csv
fields:
audio:
column: file
parser: as_audioint16
type:
column: type
parser: as_utf8str
From this file:
dataset: Is the main mandatory key that can contain one or multiple datsets. In this case a single dataset namedsimple_datasetis included.attrs: It is an optional key that can contain arbitrarystrattributes to be rendered with your data. Each key corresponds to the attribute name and each value has to be a singlestrwith the value of that specific attribute.data: Describes the data to be included in the files by relating your.csvannotations file with your raw data. In this case:fileis the annotations filedataset.csvfieldscan have one or multiple keys. Each key corresponds to the name of the field to be included in the dataset, and contains acolumnkey with the column from the.csvto be used and aparserselecting the parser used to include that data.
h5pack supports the following parsers:
| Parser name | Resulting data type | Example .csv row value |
|---|---|---|
as_audioint16 |
Audio files stored as int16 |
/path/to/file.wav |
as_audiofloat32 |
Audio files stored as float32 |
/path/to/file.wav |
as_audiofloat64 |
Audio files stored as float64 |
/path/to/file.wav |
as_int8 |
Single int8 value |
64 |
as_int16 |
Single int16 value |
32767 |
as_float32 |
Single float32 value |
0.707 |
as_float64 |
Single float64 value |
3.146 |
as_listint8 |
List of int8 values |
[0, 127] |
as_listint16 |
List of int16 values |
[32767, 32767] |
as_listfloat32 |
List of float32 values |
[0.707, 1.414, ...] |
as_listfloat64 |
List of float64 values |
[0.505, 2.125, ...] |
as_utf8str |
Single str value |
hello_world |
In this case, there a single set of audio files from the file column and saved as int16 (as_audioint16) in the audio field, and a str from the type column
and saved as str (as_utf8str).
Rendering the dataset with h5pack pack
Now that all required files are ready, we can use h5pack pack tool to create
the actual .h5 file. Now run
This will result in the following output:
Using root folder '/path/to/h5pack/examples/00_annotated-audio-dataset'
Validating configuration file 'h5pack.yaml' ...
Configuration file validation completed
Validating input data ...
Validating data of 'audio' field ...
Validation of 'audio' field data completed
Validating data of 'type' field ...
Validation of 'type' field data completed
Input data validation completed
Generating 1 partition spec(s) ...
Partition spec(s) completed
1 partition(s) will be created
Do you want to continue? [y/n]:
Once you have executed the previous required commands, you will need to confirm by typing y and pressing Enter when prompted. Upon confirmation, two files will be generated: simple_datase.h5 and simple_dataset.sha256. The simple_datase.h5 file is your dataset, now ready for use, while the simple_dataset.sha256 file contains the checksum for simple_dataset.h5. You can use this checksum file later to verify the integrity of your dataset.
For more options available with the h5pack pack tool, you can run
This tool allows customization of several aspects, including specifying the number of partitions for the output file and determining the number of workers that should be used to efficiently render the resulting files.
Inspecting the dataset with h5pack info
With the file now generated, you can easily inspect its content using the h5pack info tool. To do so, simply run the following command:
It will output
Input file: 'simple_dataset.h5'
File attribute(s):
- author: Your name
- creation_date: 2025-10-25 11:17:37
- description: Your dataset description
- producer: h5pack 1.1.0
- version: 0.1.0
Data group 'data':
- 'audio' attribute(s):
- parser: as_audioint16
- sample_rate: 16000
- 'audio' data attribute(s):
- shape: (3, 16000)
- dtype: int16
- 'audio__filepath' data attribute(s):
- shape: (3,)
- dtype: object
- 'type' attribute(s):
- parser: as_utf8str
- 'type' data attribute(s):
- shape: (3,)
- dtype: object
This information allows you to swiftly verify the contents of your file. In this example, the file contains three audio samples, each with a duration of one second and sampled at a rate of 16 kHz, resulting in 16,000 samples per audio clip.
Corroborating integrity with h5pack checksum
To verify the integrity of your file, run the following command:
The command will output:
Verifying checksum in 'simple_dataset.sha256' ...
simple_dataset.h5 cefeb61ce741bfa4a25cf54069e73d466fa9bdc2093d461ca30f381f6606eb79 [OK]
Checksum verification completed in 0.2 millisecond(s)
Using this tool, you can consistently check for any potentially corrupted files.
Unpacking the dataset with h5pack unpack
You also have the option to convert your .h5 files back into their original constituent files
using the h5pack unpack tool. To do it, simply run
This process creates a simple_dataset folder, which includes the corresponding
data, annotations file, and configuration file.
simple_dataset
├── data
│ └── audio
│ ├── brownian-noise.flac
│ ├── pink-noise.flac
│ └── white-noise.flac
├── dataset.csv
└── h5pack.yaml
The generated files are structured so that you can promptly repack them into .h5 file(s) if desired.
Conclusion
You should now be able to manage, verify, and repack your data to ensure its integrity and flexibility for future use. For more information on additional tools, including those not covered in this Quickstart guide, visit the Documentation section.