writer

package
v0.0.0-...-c75269d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package writer provides HDF5 file writing infrastructure.

The Allocator manages free space allocation in HDF5 files. For v0.11.0-beta MVP, it uses a simple end-of-file allocation strategy with no freed space reuse.

See ALLOCATOR_DESIGN.md for comprehensive design documentation.

Package writer provides HDF5 file writing capabilities.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type AllocatedBlock

type AllocatedBlock struct {
	Offset uint64 // Starting address in file
	Size   uint64 // Size of allocated block in bytes
}

AllocatedBlock tracks an allocated region of the file.

Each block represents a contiguous region that has been allocated and must not be overwritten or reused (in MVP version).

Blocks are tracked to prevent overlapping allocations and to validate allocator integrity during testing.

type Allocator

type Allocator struct {
	// contains filtered or unexported fields
}

Allocator manages space allocation in HDF5 files.

Strategy (MVP v0.11.0-beta):

  • End-of-file allocation: All allocations occur at end of file
  • No freed space reuse: Once allocated, space is never reclaimed
  • No fragmentation: Perfect sequential layout
  • Overlap prevention: All allocations tracked

Thread Safety:

  • NOT thread-safe: Use external synchronization if needed
  • Designed for single-threaded FileWriter

Performance:

  • Allocate: O(1) - constant time
  • IsAllocated: O(n) - linear scan over blocks
  • Blocks: O(n log n) - copy and sort
  • ValidateNoOverlaps: O(n log n) - sort and scan

Advanced features (deferred to v0.11.0-RC):

  • Free space reuse (best-fit, first-fit strategies)
  • Fragmentation management
  • Thread safety (optional mutex)
  • Alignment enforcement (8-byte)

See ALLOCATOR_DESIGN.md for detailed design documentation.

func NewAllocator

func NewAllocator(initialOffset uint64) *Allocator

NewAllocator creates a space allocator.

The allocator tracks all allocations and manages free space in the HDF5 file. It uses end-of-file allocation strategy (no freed space reuse in MVP).

Parameters:

  • initialOffset: Starting address for allocations (typically after superblock)
  • For superblock v2 (48 bytes): initialOffset = 48
  • For superblock v0 (variable size): initialOffset = superblock_size + driver_info_size

Returns:

  • *Allocator ready to allocate space

Example:

alloc := NewAllocator(48) // Start after superblock v2
addr, err := alloc.Allocate(1024)
if err != nil {
    return err
}

func (*Allocator) Allocate

func (a *Allocator) Allocate(size uint64) (uint64, error)

Allocate reserves a block of space at the end of the file.

The block is allocated at the current end-of-file address and tracked to prevent overlapping allocations. This is the primary method for obtaining space for HDF5 objects (datasets, groups, attributes, metadata).

Strategy:

  • Allocates at current end-of-file (sequential allocation)
  • Updates end-of-file pointer to addr + size
  • Tracks allocation in internal block list
  • No alignment enforcement (deferred to RC)
  • No size limit validation (OS will reject impossible sizes)

Parameters:

  • size: Number of bytes to allocate (must be > 0)

Returns:

  • address: File offset where block is allocated
  • error: Non-nil if allocation fails

Errors:

  • "cannot allocate zero bytes": Size must be greater than 0

Thread Safety:

  • NOT thread-safe: Do not call concurrently

Example:

addr, err := allocator.Allocate(1024) // Allocate 1KB
if err != nil {
    return err
}
// Use addr to write data to file
file.WriteAt(data, int64(addr))

func (*Allocator) Blocks

func (a *Allocator) Blocks() []AllocatedBlock

Blocks returns a copy of all allocated blocks, sorted by offset.

The returned slice is a copy, so modifications do not affect the allocator's internal state. Blocks are sorted by offset in ascending order for consistent iteration and display.

Returns:

  • []AllocatedBlock: Copy of all allocated blocks, sorted by offset

Performance:

  • Time: O(n log n) where n is number of blocks (due to sorting)
  • Space: O(n) - allocates copy of blocks

Use Cases:

  • Debugging allocation patterns
  • Testing allocator state
  • Visualizing file layout
  • Calculating total allocated space

Example:

blocks := alloc.Blocks()
for _, block := range blocks {
    fmt.Printf("Block: [%d, %d) size=%d\n",
        block.Offset, block.Offset+block.Size, block.Size)
}

// Calculate total allocated space
var total uint64
for _, block := range blocks {
    total += block.Size
}

func (*Allocator) EndOfFile

func (a *Allocator) EndOfFile() uint64

EndOfFile returns the current end-of-file address.

This is where the next allocation would occur. It represents the total file size including all allocated blocks.

Returns:

  • uint64: Current end-of-file address (next allocation address)

Performance:

  • Time: O(1) - constant time
  • Space: O(1) - no allocations

Use Cases:

  • Determine total file size
  • Verify space usage
  • Track file growth

Example:

eof := alloc.EndOfFile()
fmt.Printf("File size: %d bytes\n", eof)

func (*Allocator) IsAllocated

func (a *Allocator) IsAllocated(offset, size uint64) bool

IsAllocated checks if an address range overlaps with any allocated blocks.

This method is useful for validation and debugging to ensure no overlapping writes occur. It performs a linear scan over all allocated blocks.

Overlap Detection Logic:

  • Two ranges [a1,a2) and [b1,b2) overlap if: a1 < b2 && b1 < a2
  • Adjacent blocks (touching boundaries) do NOT overlap
  • Zero-size ranges never overlap (returns false)

Parameters:

  • offset: Starting address of range to check
  • size: Size of range to check

Returns:

  • true: Range overlaps with at least one allocated block
  • false: Range is free (or size is 0)

Performance:

  • Time: O(n) where n is number of allocated blocks
  • Space: O(1) - no allocations

Use Cases:

  • Validation before writing to file
  • Debugging overlap issues
  • Testing allocation correctness

Example:

if alloc.IsAllocated(1000, 100) {
    fmt.Println("Warning: Range [1000, 1100) already allocated!")
}

func (*Allocator) ValidateNoOverlaps

func (a *Allocator) ValidateNoOverlaps() error

ValidateNoOverlaps checks that no allocated blocks overlap.

This method is primarily for debugging and testing to ensure the allocator maintains correct state. In a correctly functioning allocator with end-of-file allocation, overlaps should NEVER occur.

Detection Logic:

  • Sorts blocks by offset
  • Checks that each block ends before the next block starts
  • Adjacent blocks (touching boundaries) are NOT considered overlapping

Returns:

  • nil: No overlaps detected (allocator state is valid)
  • error: Overlap detected (indicates allocator bug)

Performance:

  • Time: O(n log n) where n is number of blocks (due to sorting)
  • Space: O(n) - allocates sorted copy of blocks

Use Cases:

  • Debugging allocator implementation
  • Pre-release validation
  • Testing allocation correctness
  • Detecting memory corruption

Example:

if err := alloc.ValidateNoOverlaps(); err != nil {
    panic(fmt.Sprintf("BUG: Allocator corrupted: %v", err))
}

type BZIP2Filter

type BZIP2Filter struct {
	// contains filtered or unexported fields
}

BZIP2Filter implements BZIP2 compression (FilterID = 307). BZIP2 is a high-quality compression algorithm designed by Julian Seward. It provides better compression than GZIP (typically 10-15% smaller) but is slower.

BZIP2 is commonly used for scientific datasets where storage space is critical. Filter ID 307 is registered with the HDF Group.

Reference: https://sourceware.org/bzip2/ HDF5 Registration: https://github.com/HDFGroup/hdf5_plugins

func NewBZIP2Filter

func NewBZIP2Filter(blockSize int) *BZIP2Filter

NewBZIP2Filter creates a BZIP2 compression filter. blockSize specifies compression level (1-9):

  • 1 = fastest, lowest compression (100KB blocks)
  • 9 = slowest, highest compression (900KB blocks) - default

func (*BZIP2Filter) Apply

func (f *BZIP2Filter) Apply(_ []byte) ([]byte, error)

Apply compresses data using BZIP2 algorithm. Returns compressed data suitable for storage.

NOTE: Go stdlib compress/bzip2 only provides decompression. For write support, consider using github.com/dsnet/compress/bzip2 or waiting for future implementation.

func (*BZIP2Filter) Encode

func (f *BZIP2Filter) Encode() (flags uint16, cdValues []uint32)

Encode returns the filter parameters for the Pipeline message.

For BZIP2 in HDF5, the client data typically contains:

  • cd_values[0]: Block size (1-9, in 100KB units)

Reference: https://github.com/HDFGroup/hdf5_plugins/blob/master/BZIP2/src/H5Zbzip2.c

func (*BZIP2Filter) ID

func (f *BZIP2Filter) ID() FilterID

ID returns the HDF5 filter identifier for BZIP2.

func (*BZIP2Filter) Name

func (f *BZIP2Filter) Name() string

Name returns the HDF5 filter name.

func (*BZIP2Filter) Remove

func (f *BZIP2Filter) Remove(data []byte) ([]byte, error)

Remove decompresses BZIP2-compressed data. Returns the original uncompressed data.

This uses Go's stdlib compress/bzip2 for decompression.

type ChunkCoordinator

type ChunkCoordinator struct {
	// contains filtered or unexported fields
}

ChunkCoordinator handles N-dimensional dataset chunking.

This coordinator manages the mapping between: - Dataset dimensions and chunk dimensions - Linear chunk indices and N-dimensional chunk coordinates - Dataset data layout and chunk data extraction

Key Concepts:

  • Dataset dimensions: Total size of dataset in each dimension
  • Chunk dimensions: Size of each chunk in each dimension
  • Chunk coordinates: Scaled indices [dim0, dim1, ..., dimN] where coordinate[i] = element_index[i] / chunk_dim[i]
  • Edge chunks: Partial chunks at dataset boundaries

Example (2D dataset):

Dataset: 25x35 elements
Chunks: 10x10 elements
Result: 3x4 = 12 total chunks
  - Chunk [0,0]: 10x10 (full)
  - Chunk [0,3]: 10x5 (partial in dim 1)
  - Chunk [2,0]: 5x10 (partial in dim 0)
  - Chunk [2,3]: 5x5 (partial in both dims)

func NewChunkCoordinator

func NewChunkCoordinator(datasetDims, chunkDims []uint64) (*ChunkCoordinator, error)

NewChunkCoordinator creates coordinator.

Calculates the number of chunks needed in each dimension using ceiling division: numChunks[i] = ceil(datasetDims[i] / chunkDims[i])

Parameters:

  • datasetDims: Dataset size in each dimension
  • chunkDims: Chunk size in each dimension

Returns:

  • ChunkCoordinator: Ready to use
  • error: If dimensions mismatch

Example:

// 2D dataset: 100x200 elements, chunks: 10x20
coord, err := NewChunkCoordinator(
    []uint64{100, 200},
    []uint64{10, 20},
)
// Result: 10x10 = 100 total chunks

func (*ChunkCoordinator) ChunkDims

func (cc *ChunkCoordinator) ChunkDims() []uint64

ChunkDims returns chunk dimensions (read-only copy).

func (*ChunkCoordinator) DatasetDims

func (cc *ChunkCoordinator) DatasetDims() []uint64

DatasetDims returns dataset dimensions (read-only copy).

func (*ChunkCoordinator) ExtractChunkData

func (cc *ChunkCoordinator) ExtractChunkData(data []byte, coord []uint64, elemSize uint32) []byte

ExtractChunkData extracts chunk data from full dataset.

Extracts the data for a specific chunk from the full dataset buffer. The dataset is laid out in row-major order (C order), and the chunk data is extracted maintaining this layout.

Parameters:

  • data: Full dataset buffer (row-major layout)
  • coord: Chunk coordinate to extract
  • elemSize: Size of each element in bytes

Returns:

  • []byte: Extracted chunk data (contiguous buffer)

Example (2D, dataset 20x30 uint32, chunks 10x10):

chunk [0,0]: extract data[0:10, 0:10]
chunk [0,1]: extract data[0:10, 10:20]
chunk [1,0]: extract data[10:20, 0:10]

Algorithm:

For each element in chunk:
  1. Calculate position in dataset coordinates
  2. Calculate linear offset in dataset buffer
  3. Copy element to chunk buffer

func (*ChunkCoordinator) GetChunkCoordinate

func (cc *ChunkCoordinator) GetChunkCoordinate(index uint64) []uint64

GetChunkCoordinate converts linear index to N-D coordinate.

Uses row-major layout to convert a linear chunk index to its N-dimensional coordinate.

Row-major layout means: - Rightmost dimension varies fastest - Leftmost dimension varies slowest

Parameters:

  • index: Linear chunk index (0 to GetTotalChunks()-1)

Returns:

  • []uint64: N-dimensional chunk coordinate

Example (2D, 3x4 chunks):

index=0  → [0,0]
index=1  → [0,1]
index=3  → [0,3]
index=4  → [1,0]
index=11 → [2,3]

Algorithm:

coord[N-1] = index % numChunks[N-1]
coord[N-2] = (index / numChunks[N-1]) % numChunks[N-2]
...
coord[0] = index / (numChunks[1] * numChunks[2] * ... * numChunks[N-1])

func (*ChunkCoordinator) GetChunkSize

func (cc *ChunkCoordinator) GetChunkSize(coord []uint64) []uint64

GetChunkSize returns actual chunk size (may be partial).

Edge chunks at dataset boundaries may be smaller than the nominal chunk size. This method calculates the actual size of a chunk given its coordinate.

Parameters:

  • coord: Chunk coordinate [dim0, dim1, ..., dimN]

Returns:

  • []uint64: Actual chunk size in each dimension

Example (dataset 25x35, chunks 10x10):

[0,0] → [10,10] (full chunk)
[0,3] → [10,5]  (partial in dim 1)
[2,0] → [5,10]  (partial in dim 0)
[2,3] → [5,5]   (partial in both)

Algorithm:

start[i] = coord[i] * chunkDims[i]
end[i] = min(start[i] + chunkDims[i], datasetDims[i])
size[i] = end[i] - start[i]

func (*ChunkCoordinator) GetTotalChunks

func (cc *ChunkCoordinator) GetTotalChunks() uint64

GetTotalChunks returns total chunk count.

Calculates the total number of chunks by multiplying the number of chunks in each dimension.

Returns:

  • uint64: Total number of chunks in dataset

Example:

// Dataset: 100x200, chunks: 10x20
// numChunks = [10, 10]
// total = 10 * 10 = 100

func (*ChunkCoordinator) NumChunks

func (cc *ChunkCoordinator) NumChunks() []uint64

NumChunks returns number of chunks per dimension (read-only copy).

type CreateMode

type CreateMode int

CreateMode specifies the file creation/opening behavior.

const (
	// ModeTruncate creates a new file, truncating if it exists.
	// Equivalent to os.Create() behavior.
	ModeTruncate CreateMode = iota

	// ModeExclusive creates a new file, fails if it exists.
	// Equivalent to os.O_CREATE | os.O_EXCL.
	ModeExclusive

	// ModeReadWrite opens an existing file for reading and writing.
	// Used for read-modify-write operations on existing HDF5 files.
	ModeReadWrite

	// ModeReadOnly opens an existing file for reading only.
	// Used when opening files without modification intent.
	ModeReadOnly
)

type DenseAttributeWriter

type DenseAttributeWriter struct {
	// contains filtered or unexported fields
}

DenseAttributeWriter manages dense attribute storage for a single object.

Dense attributes (8+ attributes) use: - Fractal Heap: Storage for attribute data (name + type + space + value) - B-tree v2: Index for fast attribute lookup by name - Attribute Info Message: Metadata with heap/B-tree addresses

This writer REUSES infrastructure from dense groups: - structures.WritableFractalHeap (already exists!) - structures.WritableBTreeV2 (already exists!)

Reference: H5Adense.c - H5A__dense_create(), H5A__dense_insert().

func NewDenseAttributeWriter

func NewDenseAttributeWriter(objectAddr uint64) *DenseAttributeWriter

NewDenseAttributeWriter creates new dense attribute writer.

Parameters:

  • objectAddr: Address of object header (for reference)

Returns:

  • DenseAttributeWriter ready to use

func (*DenseAttributeWriter) AddAttribute

func (daw *DenseAttributeWriter) AddAttribute(attr *core.Attribute, sb *core.Superblock) error

AddAttribute adds an attribute to dense storage.

Process: 1. Encode attribute (name + type + space + data) 2. Insert into fractal heap → get heap ID 3. Insert into B-tree v2 (name → heap ID)

Parameters:

  • attr: Attribute to add
  • sb: Superblock for encoding

Returns:

  • error: Non-nil if add fails or duplicate name

Reference: H5Adense.c - H5A__dense_insert().

func (*DenseAttributeWriter) WriteToFile

func (daw *DenseAttributeWriter) WriteToFile(fw *FileWriter, allocator *Allocator, sb *core.Superblock) (*core.AttributeInfoMessage, error)

WriteToFile writes dense attribute storage to file.

Process: 1. Write fractal heap → get heap address 2. Write B-tree v2 → get B-tree address 3. Create Attribute Info Message with addresses 4. Return Attribute Info Message (caller adds to object header)

Parameters:

  • fw: FileWriter for write operations
  • allocator: Space allocator (pointer to match existing infrastructure)
  • sb: Superblock

Returns:

  • *core.AttributeInfoMessage: Message to add to object header
  • error: Non-nil if write fails

Reference: H5Adense.c - H5A__dense_create().

type DenseGroupWriter

type DenseGroupWriter struct {
	// contains filtered or unexported fields
}

DenseGroupWriter manages dense group creation.

Dense groups (HDF5 1.8+) use:

  • Link Info Message: Metadata about link storage
  • Fractal Heap: Storage for link names and messages
  • B-tree v2: Index for fast link lookup by name

This coordinator:

  1. Creates Fractal Heap for link storage
  2. Creates B-tree v2 for link indexing
  3. Stores link names and metadata in heap
  4. Indexes links in B-tree
  5. Builds Link Info Message with addresses
  6. Constructs object header with all messages

Reference: H5Gdense.c - H5G_dense_create(), H5G_dense_insert().

func NewDenseGroupWriter

func NewDenseGroupWriter(name string) *DenseGroupWriter

NewDenseGroupWriter creates new dense group writer.

Parameters:

  • name: Group name (for error messages)

Returns:

  • DenseGroupWriter ready to accept links

Reference: H5Gdense.c - H5G_dense_create().

func (dgw *DenseGroupWriter) AddLink(name string, targetAddr uint64) error

AddLink adds hard link to dense group.

For MVP: Only hard links supported (targetAddr points to object header) Future: Soft links, external links

Parameters:

  • name: Link name (UTF-8 string)
  • targetAddr: File address of target object header

Returns:

  • error if name empty, duplicate, or invalid

Reference: H5Gdense.c - H5G_dense_insert().

func (*DenseGroupWriter) WriteToFile

func (dgw *DenseGroupWriter) WriteToFile(fw *FileWriter, allocator *Allocator, sb *core.Superblock) (uint64, error)

WriteToFile writes dense group to file, returns object header address.

This method:

  1. For each link: a. Create link message (hard link format) b. Insert link message into fractal heap c. Insert (name, heapID) into B-tree v2
  2. Write fractal heap to file
  3. Write B-tree v2 to file
  4. Create Link Info Message with heap/B-tree addresses
  5. Create object header with Link Info + other messages
  6. Write object header to file

Parameters:

  • fw: FileWriter for write operations
  • allocator: Space allocator
  • sb: Superblock for encoding parameters

Returns:

  • uint64: File address of group's object header
  • error: Non-nil if write fails

Reference: H5Gdense.c - H5G_dense_create() + H5G_dense_insert().

type FileWriter

type FileWriter struct {
	// contains filtered or unexported fields
}

FileWriter wraps an os.File for writing HDF5 files. It provides: - Space allocation tracking (via Allocator) - Write-at-address operations - End-of-file tracking - Flush control

Thread-safety: Not thread-safe. Caller must synchronize access.

func NewFileWriter

func NewFileWriter(filename string, mode CreateMode, initialOffset uint64) (*FileWriter, error)

NewFileWriter creates a writer for a new HDF5 file. The file is opened for reading and writing.

Parameters:

  • filename: Path to file to create
  • mode: Creation mode (truncate or exclusive)
  • initialOffset: Starting address for allocations (typically superblock size)

For HDF5 files:

  • Superblock v2 is 48 bytes, so initialOffset would be 48
  • The superblock itself at offset 0 is not tracked by the allocator

Returns:

  • FileWriter ready for use
  • Error if file creation fails

func OpenFileWriter

func OpenFileWriter(filename string, mode CreateMode, initialOffset uint64) (*FileWriter, error)

OpenFileWriter opens an existing HDF5 file for read-modify-write operations. Unlike NewFileWriter which creates a new file, this opens an existing file.

Parameters:

  • filename: Path to existing HDF5 file
  • mode: Open mode (ModeReadWrite or ModeReadOnly)
  • initialOffset: Current end-of-file offset (for allocation tracking)

For existing files:

  • initialOffset should be set to the current file size
  • New allocations will occur after existing data
  • Allocator tracks next free address

Returns:

  • FileWriter ready for RMW operations
  • Error if file doesn't exist or open fails

Example:

// Open existing file for modification
fw, err := OpenFileWriter("data.h5", ModeReadWrite, existingFileSize)
if err != nil {
    return err
}
defer fw.Close()

// Now you can allocate new space and write data
addr, _ := fw.Allocate(1024)
fw.WriteAt(newData, int64(addr))

func (*FileWriter) Allocate

func (w *FileWriter) Allocate(size uint64) (uint64, error)

Allocate reserves a block of space in the file. Returns the address where the block was allocated. The space is not zeroed - caller must write data to the allocated block.

For MVP: - Allocation always occurs at end of file - No alignment requirements

Example:

addr, err := writer.Allocate(1024)
if err != nil {
    return err
}
// Now write data at addr
err = writer.WriteAt(data, addr)

func (*FileWriter) Allocator

func (w *FileWriter) Allocator() *Allocator

Allocator returns the space allocator. Useful for debugging and testing allocation patterns.

func (*FileWriter) Close

func (w *FileWriter) Close() error

Close closes the underlying file. This does NOT automatically flush - call Flush() first if needed. After Close(), the writer cannot be used.

func (*FileWriter) EndOfFile

func (w *FileWriter) EndOfFile() uint64

EndOfFile returns the current end-of-file address. This is where the next allocation would occur.

func (*FileWriter) File

func (w *FileWriter) File() *os.File

File returns the underlying *os.File. Use with caution - direct file operations may break allocation tracking. Primarily for reading operations or advanced use cases.

func (*FileWriter) Flush

func (w *FileWriter) Flush() error

Flush ensures all writes are committed to disk. This should be called before closing or when data durability is required.

func (*FileWriter) ReadAt

func (w *FileWriter) ReadAt(buf []byte, addr int64) (int, error)

ReadAt reads data at a specific address. Useful for reading back metadata immediately after writing. Implements io.ReaderAt interface for compatibility.

func (*FileWriter) Reader

func (w *FileWriter) Reader() io.ReaderAt

Reader returns an io.ReaderAt interface for reading from the file. This is the preferred method for reading operations as it returns an interface rather than a concrete type, improving testability and following Go best practices.

Use this for:

  • Reading back written data
  • Object header modifications
  • Integration tests (can be mocked)

Example:

reader := fw.Reader()
oh, err := core.ReadObjectHeader(reader, addr, sb)

func (*FileWriter) Seek

func (w *FileWriter) Seek(offset int64, whence int) (int64, error)

Seek implements io.Seeker interface for compatibility. Note: HDF5 uses absolute addressing, so seeking is rarely needed.

func (*FileWriter) WriteAt

func (w *FileWriter) WriteAt(data []byte, offset int64) (int, error)

WriteAt writes data at a specific address in the file. Implements io.WriterAt interface.

The address should typically be obtained from Allocate().

Note: This does not automatically track the write as an allocation. For metadata tracking, use Allocate() first, then WriteAt().

Example:

addr, _ := writer.Allocate(uint64(len(data)))
_, err := writer.WriteAt(data, int64(addr))

func (*FileWriter) WriteAtAddress

func (w *FileWriter) WriteAtAddress(data []byte, addr uint64) error

WriteAtAddress writes data at a specific address (convenience method with uint64 address).

func (*FileWriter) WriteAtWithAllocation

func (w *FileWriter) WriteAtWithAllocation(data []byte) (uint64, error)

WriteAtWithAllocation is a convenience method that allocates space and writes data. Returns the address where data was written.

This is equivalent to:

addr, err := writer.Allocate(uint64(len(data)))
if err != nil { return 0, err }
_, err = writer.WriteAt(data, int64(addr))
return addr, err

type Filter

type Filter interface {
	// ID returns the HDF5 filter identifier.
	ID() FilterID

	// Name returns human-readable filter name.
	Name() string

	// Apply applies filter to data (compression/checksum on write path).
	// Returns transformed data.
	Apply(data []byte) ([]byte, error)

	// Remove reverses filter (decompression/verification on read path).
	// Returns original data.
	Remove(data []byte) ([]byte, error)

	// Encode encodes filter parameters for Pipeline message.
	// Returns: flags, cd_values (client data array).
	Encode() (flags uint16, cdValues []uint32)
}

Filter interface for data transformation. Filters are applied in sequence during write (e.g., Shuffle → GZIP → Fletcher32) and reversed during read (Fletcher32 → GZIP → Shuffle).

type FilterID

type FilterID uint16

FilterID represents HDF5 standard filter identifiers.

const (
	FilterNone        FilterID = 0     // No filter
	FilterGZIP        FilterID = 1     // GZIP compression (deflate)
	FilterShuffle     FilterID = 2     // Byte shuffle
	FilterFletcher32  FilterID = 3     // Fletcher32 checksum
	FilterSZIP        FilterID = 4     // SZIP (not implemented)
	FilterNBIT        FilterID = 5     // NBIT (not implemented)
	FilterScaleOffset FilterID = 6     // Scale+offset (not implemented)
	FilterBZIP2       FilterID = 307   // BZIP2 compression
	FilterLZF         FilterID = 32000 // LZF compression (PyTables/h5py)
)

HDF5 standard filter constants.

type FilterPipeline

type FilterPipeline struct {
	// contains filtered or unexported fields
}

FilterPipeline manages a chain of filters applied to chunk data. Filters are applied in sequence on write and reversed on read.

Example pipeline for numeric data compression:

  1. Shuffle (reorder bytes for better compression)
  2. GZIP (compress data)
  3. Fletcher32 (add checksum)

On write: data → Shuffle → GZIP → Fletcher32 → stored. On read: stored → Fletcher32 → GZIP → Shuffle → data.

func NewFilterPipeline

func NewFilterPipeline() *FilterPipeline

NewFilterPipeline creates an empty filter pipeline.

func (*FilterPipeline) AddFilter

func (fp *FilterPipeline) AddFilter(f Filter)

AddFilter adds a filter to the end of the pipeline. Filters are applied in the order they are added during write operations.

func (*FilterPipeline) AddFilterAtStart

func (fp *FilterPipeline) AddFilterAtStart(f Filter)

AddFilterAtStart inserts a filter at the beginning of the pipeline. This is useful for filters that should be applied first (e.g., Shuffle before GZIP).

func (*FilterPipeline) Apply

func (fp *FilterPipeline) Apply(data []byte) ([]byte, error)

Apply applies all filters in sequence (write path). Example: Shuffle → GZIP → Fletcher32

If any filter fails, the operation stops and returns an error.

func (*FilterPipeline) Count

func (fp *FilterPipeline) Count() int

Count returns the number of filters in the pipeline.

func (*FilterPipeline) EncodePipelineMessage

func (fp *FilterPipeline) EncodePipelineMessage() ([]byte, error)

EncodePipelineMessage encodes the filter pipeline as an HDF5 Pipeline message (0x000B). This message is stored in the dataset's object header to describe which filters are applied to the data.

Returns the encoded message bytes ready to be written to the object header. Returns an error if the pipeline is empty.

func (*FilterPipeline) IsEmpty

func (fp *FilterPipeline) IsEmpty() bool

IsEmpty returns true if the pipeline has no filters.

func (*FilterPipeline) Remove

func (fp *FilterPipeline) Remove(data []byte) ([]byte, error)

Remove reverses all filters in reverse order (read path). Example: Fletcher32 → GZIP → Shuffle

Filters must be removed in reverse order to correctly restore the original data.

type Fletcher32Filter

type Fletcher32Filter struct{}

Fletcher32Filter implements Fletcher32 checksum (FilterID = 3).

The Fletcher32 filter adds a 4-byte checksum to the end of data to detect corruption during storage or transmission. It uses the Fletcher32 algorithm, which is faster than CRC32 but less robust against intentional tampering.

The filter is commonly used in HDF5 to ensure data integrity, especially for compressed data where corruption could affect decompression.

On write: checksum is calculated and appended (original_data + 4 bytes). On read: checksum is verified and stripped (returns original_data).

func NewFletcher32Filter

func NewFletcher32Filter() *Fletcher32Filter

NewFletcher32Filter creates a Fletcher32 checksum filter.

func (*Fletcher32Filter) Apply

func (f *Fletcher32Filter) Apply(data []byte) ([]byte, error)

Apply calculates Fletcher32 checksum and appends it to the data.

The returned data is 4 bytes longer than the input, with the checksum stored in little-endian format at the end.

func (*Fletcher32Filter) Encode

func (f *Fletcher32Filter) Encode() (flags uint16, cdValues []uint32)

Encode returns the filter parameters for the Pipeline message.

Fletcher32 has no parameters, so this returns empty values.

func (*Fletcher32Filter) ID

func (f *Fletcher32Filter) ID() FilterID

ID returns the HDF5 filter identifier for Fletcher32.

func (*Fletcher32Filter) Name

func (f *Fletcher32Filter) Name() string

Name returns the HDF5 filter name.

func (*Fletcher32Filter) Remove

func (f *Fletcher32Filter) Remove(data []byte) ([]byte, error)

Remove verifies and strips the Fletcher32 checksum.

This method:

  1. Extracts the 4-byte checksum from the end of data
  2. Calculates the checksum of the original data
  3. Verifies they match
  4. Returns the original data without the checksum

Returns an error if the checksum doesn't match (data corruption detected).

type GZIPFilter

type GZIPFilter struct {
	// contains filtered or unexported fields
}

GZIPFilter implements GZIP compression (FilterID = 1). This filter uses the DEFLATE compression algorithm to reduce data size. In HDF5, this filter is named "deflate" following zlib terminology.

Compression levels:

1 = fastest compression, larger files
6 = balanced (default)
9 = best compression, slower

func NewGZIPFilter

func NewGZIPFilter(level int) *GZIPFilter

NewGZIPFilter creates a GZIP filter with the specified compression level.

Valid levels:

1 = Fast compression, lower ratio
6 = Default (balanced)
9 = Best compression, slower

Invalid levels are automatically adjusted to 6 (default).

func (*GZIPFilter) Apply

func (f *GZIPFilter) Apply(data []byte) ([]byte, error)

Apply compresses data using GZIP/DEFLATE algorithm. Returns compressed data suitable for storage.

The compressed data includes GZIP headers and CRC32 checksum.

func (*GZIPFilter) Encode

func (f *GZIPFilter) Encode() (flags uint16, cdValues []uint32)

Encode returns the filter parameters for the Pipeline message.

For GZIP, the client data contains a single value: the compression level. Flags are always 0 for GZIP.

func (*GZIPFilter) ID

func (f *GZIPFilter) ID() FilterID

ID returns the HDF5 filter identifier for GZIP.

func (*GZIPFilter) Name

func (f *GZIPFilter) Name() string

Name returns the HDF5 filter name. HDF5 uses "deflate" (the underlying algorithm) rather than "gzip".

func (*GZIPFilter) Remove

func (f *GZIPFilter) Remove(data []byte) ([]byte, error)

Remove decompresses GZIP-compressed data. Returns the original uncompressed data.

This method reverses the Apply operation, restoring the original data.

type LZFFilter

type LZFFilter struct {
}

LZFFilter implements LZF compression (FilterID = 32000). LZF is a very fast compression algorithm designed by Marc Lehmann. It provides ~40-50% compression with 3-5x faster compression than GZIP and 2x faster decompression.

This filter is commonly used by PyTables and h5py for fast compression. Filter ID 32000 was registered by Francesc Alted (PyTables maintainer).

Reference: http://oldhome.schmorp.de/marc/liblzf.html HDF5 Registration: https://portal.hdfgroup.org/display/support/Filters

func NewLZFFilter

func NewLZFFilter() *LZFFilter

NewLZFFilter creates an LZF compression filter. LZF has no configuration parameters - it uses a fixed algorithm.

func (*LZFFilter) Apply

func (f *LZFFilter) Apply(data []byte) ([]byte, error)

Apply compresses data using LZF algorithm. Returns compressed data suitable for storage.

LZF algorithm characteristics:

  • Hash-based pattern matching (LZ77 family)
  • 8KB sliding window
  • Very fast compression (near memcpy speed)
  • Typical compression ratio: 40-50%

func (*LZFFilter) Encode

func (f *LZFFilter) Encode() (flags uint16, cdValues []uint32)

Encode returns the filter parameters for the Pipeline message.

For LZF in HDF5, the client data typically contains:

  • cd_values[0]: Plugin revision number (usually 0)
  • cd_values[1]: LZF filter version (usually 0)
  • cd_values[2]: Pre-computed chunk size (0 = not pre-computed)

For this implementation, we use minimal parameters.

func (*LZFFilter) ID

func (f *LZFFilter) ID() FilterID

ID returns the HDF5 filter identifier for LZF.

func (*LZFFilter) Name

func (f *LZFFilter) Name() string

Name returns the HDF5 filter name.

func (*LZFFilter) Remove

func (f *LZFFilter) Remove(data []byte) ([]byte, error)

Remove decompresses LZF-compressed data. Returns the original uncompressed data.

This method reverses the Apply operation, restoring the original data.

type SZIPFilter

type SZIPFilter struct {
	// contains filtered or unexported fields
}

SZIPFilter implements SZIP compression (FilterID = 4). SZIP uses extended Golomb-Rice coding as defined in CCSDS 121.0-B-3 standard. It was designed by NASA for satellite imagery compression and is widely used in scientific data compression.

SZIP is implemented by libaec (Adaptive Entropy Coding) library in C. Patents on the SZIP algorithm expired in 2017, making it freely usable.

However, no pure Go implementation exists as of 2026. The algorithm is complex and requires significant effort to implement:

  • Adaptive entropy coding (extended Golomb-Rice)
  • Preprocessing options (NN predictor, EC option encoder)
  • Block-based compression with configurable parameters

For HDF5 files requiring SZIP, users should:

  1. Use HDF5 C library with libaec
  2. Use h5py (Python) which links to C library
  3. Re-compress files using GZIP (filter ID 1) for pure Go compatibility

Reference: https://github.com/MathisRosenhauer/libaec CCSDS Standard: https://public.ccsds.org/Pubs/121x0b3.pdf HDF Group: https://docs.hdfgroup.org/hdf5/latest/group___s_z_i_p.html

func NewSZIPFilter

func NewSZIPFilter(optionMask, pixelsPerBlock, bitsPerPixel, pixelsPerScan uint32) *SZIPFilter

NewSZIPFilter creates an SZIP compression filter. Parameters match the SZIP specification:

  • optionMask: Compression options (NN=32, EC=4, LSB=1, MSB=2, RAW=128)
  • pixelsPerBlock: Number of pixels per block (must be even, typically 8-32)
  • bitsPerPixel: Bits per pixel (1-32)
  • pixelsPerScan: Pixels per scanline for 2D data (0 for 1D)

Common configurations:

  • NN predictor with EC encoder: optionMask = 36 (32 + 4)
  • RAW mode (no preprocessing): optionMask = 128

func (*SZIPFilter) Apply

func (f *SZIPFilter) Apply(_ []byte) ([]byte, error)

Apply compresses data using SZIP algorithm. Returns compressed data suitable for storage.

NOTE: SZIP compression requires libaec library (C implementation). No pure Go implementation exists as of 2026. This is a stub that returns "not implemented" error.

For SZIP compression, consider:

  1. Using CGo with libaec
  2. Using HDF5 C library
  3. Using alternative compression (GZIP filter ID 1)

func (*SZIPFilter) Encode

func (f *SZIPFilter) Encode() (flags uint16, cdValues []uint32)

Encode returns the filter parameters for the Pipeline message.

For SZIP in HDF5, the client data contains:

  • cd_values[0]: Bits per pixel (1-32)
  • cd_values[1]: Coding method (NN=32, EC=4, LSB=1, MSB=2, RAW=128)
  • cd_values[2]: Pixels per block (even number, 8-32)
  • cd_values[3]: Pixels per scanline (0 for 1D data)

Reference: https://github.com/HDFGroup/hdf5/blob/develop/src/H5Zszip.c

func (*SZIPFilter) ID

func (f *SZIPFilter) ID() FilterID

ID returns the HDF5 filter identifier for SZIP.

func (*SZIPFilter) Name

func (f *SZIPFilter) Name() string

Name returns the HDF5 filter name.

func (*SZIPFilter) Remove

func (f *SZIPFilter) Remove(_ []byte) ([]byte, error)

Remove decompresses SZIP-compressed data. Returns the original uncompressed data.

NOTE: SZIP decompression requires libaec library (C implementation). No pure Go implementation exists as of 2026. This is a stub that returns "not implemented" error.

type ShuffleFilter

type ShuffleFilter struct {
	// contains filtered or unexported fields
}

ShuffleFilter implements byte shuffle (FilterID = 2).

The shuffle filter reorders bytes in the data to improve compression ratios for numeric data. It works by transposing byte order from element-by-element to byte-by-byte.

For example, with 4-byte integers [A1 A2 A3 A4][B1 B2 B3 B4][C1 C2 C3 C4]:

Original: [A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4]
Shuffled: [A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4]

This transformation groups similar bytes together (all first bytes, then all second bytes, etc.), which typically compresses much better with algorithms like GZIP.

The shuffle filter is especially effective for:

  • Integer arrays with slowly changing values
  • Floating-point arrays with similar magnitudes
  • Multi-dimensional arrays with spatial locality

Note: Shuffle should always be applied BEFORE compression filters like GZIP.

func NewShuffleFilter

func NewShuffleFilter(elementSize uint32) *ShuffleFilter

NewShuffleFilter creates a shuffle filter with the specified element size.

The element size should match the datatype size:

  • int32, float32: elementSize = 4
  • int64, float64: elementSize = 8
  • int16: elementSize = 2
  • int8: elementSize = 1

For compound or array types, use the size of the base element.

func (*ShuffleFilter) Apply

func (f *ShuffleFilter) Apply(data []byte) ([]byte, error)

Apply performs byte shuffle on the data.

The shuffle algorithm:

  1. Divide data into elements of size elementSize
  2. For each byte position in an element (0 to elementSize-1): a. Extract that byte from each element b. Write all those bytes consecutively

Example with elementSize=4, 3 elements:

Input:  [a1 a2 a3 a4][b1 b2 b3 b4][c1 c2 c3 c4]
Output: [a1 b1 c1][a2 b2 c2][a3 b3 c3][a4 b4 c4]

This groups similar bytes together, improving compression with GZIP.

func (*ShuffleFilter) Encode

func (f *ShuffleFilter) Encode() (flags uint16, cdValues []uint32)

Encode returns the filter parameters for the Pipeline message.

For shuffle, the client data contains a single value: the element size. Flags are always 0 for shuffle.

func (*ShuffleFilter) ID

func (f *ShuffleFilter) ID() FilterID

ID returns the HDF5 filter identifier for shuffle.

func (*ShuffleFilter) Name

func (f *ShuffleFilter) Name() string

Name returns the HDF5 filter name.

func (*ShuffleFilter) Remove

func (f *ShuffleFilter) Remove(data []byte) ([]byte, error)

Remove reverses the byte shuffle (unshuffle).

This operation reverses Apply, restoring the original byte order.

Example with elementSize=4, 3 elements:

Input:  [a1 b1 c1][a2 b2 c2][a3 b3 c3][a4 b4 c4]
Output: [a1 a2 a3 a4][b1 b2 b3 b4][c1 c2 c3 c4]

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL