---
title: "Working with HDF5-Backed Matrices in BigDataStatMeth"
author:
- name: Dolors Pelegri-Siso
  affiliation:
  - &uab Universitat Autonoma de Barcelona (UAB)
  - &isglobal ISGlobal, Centre for Research in Environmental Epidemiology
    (ISGlobal)
  - &brge Bioinformatics Research Group in Epidemiology (BRGE)
  email: dolors.pelegri@isglobal.org
- name: Juan R. Gonzalez
  affiliation:
  - *uab
  - *isglobal
  - *brge
  email: juanr.gonzalez@isglobal.org
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: yes
    fig_caption: yes
    toc_float: yes
abstract: |
  This vignette illustrates how to work with HDF5-backed matrices using
  the S3 interface provided by the BigDataStatMeth package. It covers
  matrix creation, data import, subsetting, algebraic and statistical
  operations, matrix decompositions, global options, compression,
  HDF5 file-space management, and resource handling.
vignette: |
  %\VignetteIndexEntry{Working with HDF5-Backed Matrices in BigDataStatMeth}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: references.bib
---

```{r setup, include = FALSE}
library(knitr)
library(BiocStyle)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  cache = FALSE,
  message = FALSE,
  warning = FALSE,
  width = 80,
  crop = NULL
)
```

# Overview

`BigDataStatMeth` provides statistical and linear algebra operations for
matrices stored in HDF5 files. The package is designed for workflows in
which matrices may be too large to be held entirely in memory, while
still allowing users to work with familiar R functions.

The examples in this vignette use small matrices so that they can be run
quickly and compared with standard in-memory R results. The same
interface is intended for larger HDF5-backed matrices. All files created
in the examples are written to temporary locations.

```{r load-package}
library(BigDataStatMeth)
```

# User-facing interface and global options

The recommended user-facing interface is based on `HDF5Matrix` objects
and standard R generics. Thus, HDF5-backed matrices can be manipulated
using calls such as `dim()`, `[`, `%*%`, `crossprod()`, `scale()`,
`cor()`, `svd()`, `qr()`, and `prcomp()`.

Internally, `BigDataStatMeth` uses an R6 and C++ backend. Advanced users
may interact with lower-level interfaces when needed, but this vignette
focuses on the S3 interface because it allows code to remain close to
ordinary R matrix workflows.

The following table summarizes the main user-facing functionality
available for `HDF5Matrix` objects. It is not an exhaustive list of all
exported functions; rather, it highlights the standard R-style methods
and high-level helpers used throughout this vignette. Additional
domain-specific, lower-level, and compatibility functions are documented
in their corresponding help pages.

| Category | Representative calls |
|:---|:---|
| Core object handling | `hdf5_create_matrix()`, `hdf5_matrix()`, `dim()`, `nrow()`, `ncol()`, `is_open()`, `close()` |
| HDF5 inspection and I/O | `list_datasets()`, `hdf5_import()`, `hdf5_import_multiple()`, `as.matrix()`, `as.data.frame()` |
| Subsetting and assignment | `X[i, j]`, `X[i, j] <- value` |
| Dimension names | `rownames()`, `colnames()`, `dimnames()` |
| Element-wise arithmetic | `X + Y`, `X - Y`, `X * Y`, `X / Y` |
| Matrix algebra | `%*%`, `crossprod()`, `tcrossprod()` |
| Binding | `cbind()`, `rbind()` |
| Aggregations | `colSums()`, `rowSums()`, `colMeans()`, `rowMeans()`, `colVars()`, `rowVars()`, `colSds()`, `rowSds()`, `colMins()`, `rowMins()`, `colMaxs()`, `rowMaxs()` |
| Scalar summaries | `mean()`, `var()`, `sd()` |
| Normalization and transformations | `scale()`, `sweep()` |
| Correlation | `cor()` |
| Matrix decompositions | `svd()`, `prcomp()`, `eigen()`, `pseudoinverse()` |
| Factorizations and solvers | `qr()`, `chol()`, `solve()` |
| Diagonal operations | `diag()`, `diag<-()`, `diag_op()`, `diag_scale()` |
| Split, reduce, and apply | `split_dataset()`, `split()`, `reduce()`, `apply_function()` |
| Resource management and options | `hdf5matrix_options()`, `show_hdf5matrix_options()`, `hdf5_close_all()` |
| Specialized high-level helpers | selected `bd*` functions for utilities without a direct standard R generic, such as `bdCreate_hdf5_group()`, `bdmove_hdf5_dataset()`, and `bdWrite_hdf5_dimnames()` |

Most examples in this vignette use the standard `HDF5Matrix`/S3
interface. Some high-level helper functions keep the `bd*` prefix
because they provide specialized functionality that does not map
directly to an existing base R generic. These functions remain part of
the package API and are documented in their corresponding help pages.

## Global options

Some S3 operators, such as element-wise arithmetic, follow the standard
R syntax and do not expose additional arguments in the function call.
Global options make it possible to configure block-wise processing,
parallelization, the number of threads, and compression for such
operations.

```{r options-view}
old_opts <- hdf5matrix_options()
old_opts
```

For example, the following settings enable parallel execution with two
threads, use a fixed block size, and set the compression level used for
new output datasets when not explicitly specified by a method call.

```{r options-set}
hdf5matrix_options(
  paral = TRUE,
  threads = 2L,
  block_size = 512L,
  compression = 6L
)

hdf5matrix_options()
```

When an individual method provides explicit arguments, those arguments
should be used for local control of that operation. The global options
are especially useful for generic operators where the standard R call is
kept unchanged. Additional operation-specific parameters are documented
in the corresponding help pages, for example `?svd.HDF5Matrix`,
`?prcomp.HDF5Matrix`, `?qr.HDF5Matrix`, and `?hdf5matrix_options`.

# Creating, opening, and inspecting HDF5 datasets

The function `hdf5_create_matrix()` creates an HDF5 dataset and returns
an `HDF5Matrix` object pointing to it. In the example below, a standard
R matrix is written to an HDF5 file and then manipulated through the S3
interface.

```{r create-matrix}
h5file <- tempfile(fileext = ".h5")

set.seed(123)
X <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)

X_h5 <- hdf5_create_matrix(
  filename = h5file,
  dataset = "data/X",
  data = X,
  overwrite = TRUE
)

X_h5
dim(X_h5)
nrow(X_h5)
ncol(X_h5)
```

The datasets stored in an HDF5 file can be listed with
`list_datasets()`. Existing datasets can be reopened with
`hdf5_matrix()`.

```{r open-matrix}
list_datasets(h5file)

X_h5_reopened <- hdf5_matrix(
  filename = h5file,
  path = "data/X"
)

dim(X_h5_reopened)
```

Operations that return an `HDF5Matrix` object write their results to an
HDF5 dataset. The printed object shows the file and dataset path. Some
methods, such as `crossprod()` and `tcrossprod()`, allow the output group
and dataset name to be specified explicitly. Other S3 operators, such as
element-wise arithmetic, currently use package-defined output names.

# Importing tabular data

Delimited text files can be imported directly into HDF5 with
`hdf5_import()`. The following example uses the small cholesterol data
file distributed with the package.

In the final package layout, the file should be available under
`inst/extdata/colesterol.csv`. The fallback below is included only to
allow this vignette to be tested against older source layouts in which
the same file was stored elsewhere.

```{r import-tabular-data}
# Generate reproducible example data matching the cholesterol dataset structure
set.seed(4214123)
n <- 1000L

colesterol <- data.frame(
  TCholesterol = rnorm(n, mean = 195.8, sd = 30.1),
  Age          = rnorm(n, mean = 58.4,  sd = 9.9),
  Insulin      = rnorm(n, mean = 12.5,  sd = 7.4),
  Creatinine   = rnorm(n, mean = 0.798, sd = 0.098),
  BUN          = rnorm(n, mean = 14.6,  sd = 3.4),
  LLDR         = rnorm(n, mean = 0.981, sd = 0.450),
  Triglycerides= rnorm(n, mean = 136.6, sd = 55.1),
  HDL_C        = rnorm(n, mean = 48.2,  sd = 9.2),
  LDL_C        = rnorm(n, mean = 116.4, sd = 25.5),
  Sex          = rbinom(n, 1, prob = 0.58)
)

csv_file <- tempfile(fileext = ".csv")
write.csv(colesterol, csv_file, row.names = FALSE)

h5_csv <- tempfile(fileext = ".h5")

cholesterol_h5 <- hdf5_import(
  source = csv_file,
  filename = h5_csv,
  dataset = "cholesterol/data",
  sep = ",",
  header = TRUE,
  overwrite = TRUE
)

cholesterol_h5
dim(cholesterol_h5)
cholesterol_h5[1:5, 1:min(6, ncol(cholesterol_h5))]
```

# Subsetting, assignment, dimnames, and conversion to memory

The `[` operator can be used to read subsets from an `HDF5Matrix`
object. Only the requested rows and columns are read.

```{r subsetting}
X_h5[1:5, 1:6]

sub_X <- X_h5[1:20, 1:10]
dim(sub_X)
```

The replacement form `[<-` writes values back to the HDF5 dataset.
Changes are applied directly to the data stored on disk.

```{r assignment}
X_edit <- hdf5_create_matrix(
  h5file,
  "data/X_edit",
  data = X[1:10, 1:6],
  overwrite = TRUE
)

X_edit[1, 1] <- 999
X_edit[1:3, 1:3]
```

Dimension names can also be stored and retrieved with the usual R
accessors.

```{r dimnames-example}
DN_h5 <- hdf5_create_matrix(
  h5file,
  "data/dimnames_example",
  data = matrix(seq_len(12), nrow = 4, ncol = 3),
  overwrite = TRUE
)

rownames(DN_h5) <- paste0("sample_", seq_len(nrow(DN_h5)))
colnames(DN_h5) <- paste0("feature_", seq_len(ncol(DN_h5)))

rownames(DN_h5)
colnames(DN_h5)
dimnames(DN_h5)
```

The function `as.matrix()` reads an HDF5-backed matrix into memory. It
is useful for small examples or final results, but should be used with
care for very large datasets.

```{r convert-to-memory}
X_small <- as.matrix(X_h5[1:10, 1:5])
X_small
```

# Matrix algebra

## Element-wise arithmetic

Element-wise arithmetic between `HDF5Matrix` objects can be performed
using the standard arithmetic operators. The results are stored as new
HDF5-backed matrices.

```{r arithmetic}
set.seed(1)

A <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
B <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
C <- matrix(rnorm(80 * 40), nrow = 80, ncol = 40)

A_h5 <- hdf5_create_matrix(
  h5file, "data/A", data = A,
  overwrite = TRUE
)

B_h5 <- hdf5_create_matrix(
  h5file, "data/B", data = B,
  overwrite = TRUE
)

C_h5 <- hdf5_create_matrix(
  h5file, "data/C", data = C,
  overwrite = TRUE
)

S_h5 <- A_h5 + B_h5
D_h5 <- A_h5 - B_h5

S_h5
dim(S_h5)

all.equal(as.matrix(S_h5), A + B)
all.equal(as.matrix(D_h5), A - B)
```

The output datasets can be inspected directly from the file. This is
useful when results need to be reopened later in the workflow.

```{r arithmetic-output-location}
list_datasets(h5file, group = "OUTPUT", recursive = TRUE)
```

## Matrix multiplication

Matrix multiplication uses the same `%*%` syntax as ordinary R matrices,
while the computation is performed block-wise on the HDF5-backed data.

```{r multiplication}
M_h5 <- A_h5 %*% C_h5

M_h5
dim(M_h5)
all.equal(as.matrix(M_h5), A %*% C)
```

## Crossproducts

Crossproducts are available through the standard R calls. These methods
also allow the output group and dataset name to be specified explicitly.

```{r crossproducts}
XtX_h5 <- crossprod(
  A_h5,
  outgroup = "RESULTS",
  outdataset = "A_crossprod"
)

XXt_h5 <- tcrossprod(
  A_h5,
  outgroup = "RESULTS",
  outdataset = "A_tcrossprod"
)

XtX_h5
list_datasets(h5file, group = "RESULTS", recursive = TRUE)

all.equal(as.matrix(XtX_h5), crossprod(A))
all.equal(as.matrix(XXt_h5), tcrossprod(A))
```

## Binding matrices

The functions `cbind()` and `rbind()` can be used to combine compatible
HDF5-backed matrices by columns or rows.

```{r bind-example}
A1_h5 <- hdf5_create_matrix(
  h5file,
  "bind/A1",
  data = A[1:50, 1:10],
  overwrite = TRUE
)

A2_h5 <- hdf5_create_matrix(
  h5file,
  "bind/A2",
  data = A[1:50, 11:20],
  overwrite = TRUE
)

Cbind_h5 <- cbind(
  A1_h5, A2_h5,
  out_group = "BIND_RESULTS",
  out_dataset = "A1_A2_cbind",
  overwrite = TRUE
)

Rbind_h5 <- rbind(
  A1_h5, A1_h5,
  out_group = "BIND_RESULTS",
  out_dataset = "A1_A1_rbind",
  overwrite = TRUE
)

dim(Cbind_h5)
dim(Rbind_h5)

all.equal(as.matrix(Cbind_h5), cbind(A[1:50, 1:10], A[1:50, 11:20]))
all.equal(as.matrix(Rbind_h5), rbind(A[1:50, 1:10], A[1:50, 1:10]))
```

# Statistical operations

Many summary operations return standard R vectors or scalars. This is
convenient when the output is small relative to the input matrix.

```{r aggregations}
all.equal(colMeans(A_h5), colMeans(A))
all.equal(rowSums(A_h5), rowSums(A))

all.equal(colVars(A_h5), apply(A, 2, var))
all.equal(rowSds(A_h5), apply(A, 1, sd))
```

The `scale()` method centers and/or scales an HDF5-backed matrix by
blocks and stores the result on disk.

```{r scaling}
A_scaled_h5 <- scale(A_h5)
A_scaled <- scale(A)

all.equal(
  as.matrix(A_scaled_h5),
  A_scaled,
  check.attributes = FALSE
)
```

Correlation matrices can be computed directly from an `HDF5Matrix`.

```{r correlation}
Cor_h5 <- cor(A_h5)

all.equal(
  as.matrix(Cor_h5),
  cor(A),
  tolerance = 1e-6
)
```

Some operations can be combined with small HDF5-backed vectors. The
following example subtracts column means from each column using
`sweep()`.

```{r sweep-example}
col_means_h5 <- hdf5_create_matrix(
  h5file,
  "stats/A_col_means",
  data = matrix(colMeans(A), nrow = 1),
  overwrite = TRUE
)

A_centered_h5 <- sweep(A_h5, MARGIN = 2, STATS = col_means_h5, FUN = "-")

all.equal(
  as.matrix(A_centered_h5),
  sweep(A, MARGIN = 2, STATS = colMeans(A), FUN = "-"),
  check.attributes = FALSE
)
```

# Matrix decompositions

`BigDataStatMeth` implements several matrix decompositions for
HDF5-backed matrices. The examples below use small matrices to validate
results against standard R computations.

## Singular value decomposition

The SVD method can center and scale the input before decomposition. The
returned singular values are stored in memory, while singular vectors are
stored as HDF5-backed matrices.

```{r svd-example}
set.seed(123)
X_svd <- matrix(rnorm(120 * 300), nrow = 120, ncol = 300)

X_svd_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/X_svd",
  data = X_svd,
  overwrite = TRUE
)

svd_h5 <- svd(
  X_svd_h5,
  nu = 10,
  nv = 10,
  center = TRUE,
  scale = TRUE,
  overwrite = TRUE
)

head(svd_h5$d)
dim(svd_h5$u)
dim(svd_h5$v)
```

The first singular values can be compared with the corresponding
in-memory R calculation.

```{r svd-validation}
svd_r <- svd(scale(X_svd), nu = 10, nv = 10)

all.equal(
  svd_h5$d[1:10],
  svd_r$d[1:10],
  tolerance = 1e-6
)
```

For large matrices, `svd()` can use a block-wise strategy. The parameter
`k` controls the number of local SVDs per incremental level, whereas
`q` controls the number of incremental levels used to combine
intermediate decompositions. Larger values may reduce the size of each
local problem, but they can also increase overhead. Therefore, these
parameters should be selected according to the matrix dimensions and the
available hardware. The argument `threads` controls the number of
threads used by parallel parts of the computation.

The arguments `nu` and `nv` follow the same idea as in `base::svd()`: 
`nu` controls how many left singular vectors are returned, and `nv`
controls how many right singular vectors are returned. Requesting only
the leading components is useful when the complete set of singular
vectors is not needed.

```{r block-svd-example}
svd_blk_h5 <- svd(
  X_svd_h5,
  nu = 5,
  nv = 5,
  k = 4,
  q = 1,
  threads = 2,
  overwrite = TRUE
)

head(svd_blk_h5$d)
dim(svd_blk_h5$u)
dim(svd_blk_h5$v)
```

## Principal component analysis

The `prcomp()` method provides a PCA interface for HDF5-backed matrices.
The result includes standard PCA summaries together with HDF5-backed
matrices for quantities such as scores and loadings. The argument
`ncomponents` controls the number of principal components computed; a
value of `0` requests all available components.

```{r pca-example}
set.seed(124)

n <- 180
p <- 40
group <- rep(c("Group 1", "Group 2", "Group 3"), each = n / 3)

X_pca <- matrix(rnorm(n * p), nrow = n, ncol = p)
X_pca[group == "Group 2", 1:8] <- X_pca[group == "Group 2", 1:8] + 1.5
X_pca[group == "Group 3", 9:16] <- X_pca[group == "Group 3", 9:16] - 1.5

X_pca_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/X_pca",
  data = X_pca,
  overwrite = TRUE
)

pca_h5 <- prcomp(
  X_pca_h5,
  center = TRUE,
  scale. = TRUE,
  ncomponents = 5,
  overwrite = TRUE
)

pca_h5
head(pca_h5$sdev)
```

The PCA scores are stored in the `x` component of the returned object.
When `x` is an HDF5-backed matrix, the required columns can be subsetted
and converted to memory for visualization. Here the synthetic groups are
used only to make the low-dimensional structure visible.

```{r pca-plot, fig.width = 7, fig.height = 5, fig.cap = "PCA scores computed from an HDF5-backed matrix."}
class(pca_h5$x)
dim(pca_h5$x)

pca_scores <- as.matrix(pca_h5$x[, 1:2])
pca_df <- data.frame(
  PC1 = pca_scores[, 1],
  PC2 = pca_scores[, 2],
  group = group
)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  ggplot2::ggplot(pca_df, ggplot2::aes(PC1, PC2, colour = group)) +
    ggplot2::geom_point(size = 2.4, alpha = 0.85) +
    ggplot2::stat_ellipse(linewidth = 0.6, alpha = 0.7) +
    ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(
      legend.position = "top",
      panel.grid.minor = ggplot2::element_blank()
    ) +
    ggplot2::labs(
      title = "PCA of an HDF5-backed matrix",
      subtitle = "Scores returned by prcomp.HDF5Matrix()",
      x = "PC1",
      y = "PC2",
      colour = "Synthetic group"
    )
} else {
  plot(
    pca_df$PC1,
    pca_df$PC2,
    pch = 19,
    xlab = "PC1",
    ylab = "PC2",
    main = "PCA of an HDF5-backed matrix"
  )
}
```

## QR decomposition

The QR decomposition returns HDF5-backed `Q` and `R` matrices. The
argument `thin` controls whether a reduced decomposition is returned.
With `thin = TRUE`, `Q` has only the columns needed to reconstruct the
input matrix, which is usually preferable for tall matrices because it
avoids storing unnecessary columns. With `thin = FALSE`, a full `Q`
matrix is returned.

Rather than comparing the signs of individual columns, the example
validates the factorization and the orthogonality of `Q`.

```{r qr-example}
QR_h5 <- qr(A_h5, thin = TRUE, overwrite = TRUE)

Q <- as.matrix(QR_h5$Q)
R <- as.matrix(QR_h5$R)

all.equal(Q %*% R, A, tolerance = 1e-6)
all.equal(crossprod(Q), diag(ncol(Q)), tolerance = 1e-6)
```

For tall-skinny matrices, the QR method can use the TSQR path. The
`method = "auto"` setting selects the backend according to the matrix
shape, while `method = "tsqr"` requests the tall-skinny algorithm
explicitly.

```{r tsqr-example}
X_tsqr <- matrix(rnorm(600 * 30), nrow = 600, ncol = 30)
X_tsqr_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/X_tsqr",
  data = X_tsqr,
  overwrite = TRUE
)

QR_tsqr_h5 <- qr(
  X_tsqr_h5,
  thin = TRUE,
  method = "tsqr",
  threads = 2L,
  overwrite = TRUE
)

dim(QR_tsqr_h5$Q)
dim(QR_tsqr_h5$R)
```

## Cholesky decomposition and inverse

For symmetric positive-definite matrices, `chol()` and `solve()` can be
used directly on `HDF5Matrix` objects.

```{r chol-solve-example}
set.seed(321)
Z <- matrix(rnorm(200 * 50), nrow = 200, ncol = 50)
SPD <- crossprod(Z) + diag(1e-6, 50)

SPD_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/SPD",
  data = SPD,
  overwrite = TRUE
)

Ch_h5 <- chol(SPD_h5, overwrite = TRUE)
Inv_h5 <- solve(SPD_h5, overwrite = TRUE)

Ch <- as.matrix(Ch_h5)
chol_error <- min(
  max(abs(crossprod(Ch) - SPD)),
  max(abs(tcrossprod(Ch) - SPD))
)
chol_error < 1e-6

all.equal(as.matrix(Inv_h5), solve(SPD), tolerance = 1e-6)
```

## Eigen decomposition and pseudoinverse

Additional decompositions are available through S3 methods. For example,
`eigen()` can be applied to symmetric HDF5-backed matrices, and
`pseudoinverse()` computes the Moore--Penrose pseudoinverse.

```{r eigen-pseudoinverse}
Eig_h5 <- eigen(
  SPD_h5,
  symmetric = TRUE,
  k = 5L,
  overwrite = TRUE
)

head(Eig_h5$values)
dim(Eig_h5$vectors)

Pinv_h5 <- pseudoinverse(
  A_h5,
  overwrite = TRUE
)

dim(Pinv_h5)
```

# Compression and HDF5 file space management

## Compression

HDF5 datasets can be stored with different compression levels. Lower
compression levels generally reduce writing time but use more disk
space. Higher compression levels may reduce file size but can increase
the time needed to write and read data. The default value in
`BigDataStatMeth` is `compression = 6`, which provides a balanced
setting for many workflows.

The following small example illustrates the trade-off between writing
time and file size. It is not intended as a formal benchmark.

```{r compression-example}
set.seed(123)
X_cmp <- round(matrix(rnorm(2500 * 250), nrow = 2500, ncol = 250), 2)

f0 <- tempfile(fileext = ".h5")
f6 <- tempfile(fileext = ".h5")

t0 <- system.time(
  hdf5_create_matrix(
    f0, "data/X",
    data = X_cmp,
    compression = 0,
    overwrite = TRUE
  )
)

t6 <- system.time(
  hdf5_create_matrix(
    f6, "data/X",
    data = X_cmp,
    compression = 6,
    overwrite = TRUE
  )
)

data.frame(
  compression = c(0, 6),
  elapsed = c(t0[["elapsed"]], t6[["elapsed"]]),
  file_size_MB = round(file.info(c(f0, f6))$size / 1024^2, 3)
)
```

Formal performance comparisons should be run on the target hardware and
with representative datasets.

## HDF5 file space management

When datasets are deleted or overwritten in an HDF5 file, the physical
file size on disk does not always decrease immediately. `BigDataStatMeth`
creates HDF5 files using a file space management strategy that allows
free space inside the file to be tracked and reused by subsequent writes.
Thus, space released by removed datasets can be reused within the same
file instead of unnecessarily increasing the file size after repeated
operations.

This behavior helps control file growth during workflows that create,
overwrite, or remove intermediate datasets. Nevertheless, reducing the
physical size of an existing HDF5 file may require repacking the file.
The HDF Group provides external command-line tools for this purpose. In
particular, the [`h5repack` user guide](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t_o_o_l__r_p__u_g.html)
describes how to rewrite an HDF5 file into a new compacted or
reconfigured file. This is an external HDF5 utility and is not executed
from this vignette.

# Managing HDF5 resources

HDF5-backed objects keep file handles open while they are in use. In
ordinary interactive workflows these handles are released when objects
are garbage-collected. Users can also close objects explicitly.

```{r close-single-object}
close(X_h5_reopened)
```

The function `hdf5_close_all()` closes the HDF5 handles currently tracked
by the package, including open file handles. After this call,
HDF5-backed objects that pointed to those handles should be reopened
before further use.

Calling `gc()` may help trigger R finalizers for objects that are no
longer referenced, which can be useful when repeatedly re-running code
during development.

```{r restore-options-and-close-all}
hdf5matrix_options(
  paral = old_opts$paral,
  block_size = old_opts$block_size,
  threads = old_opts$threads,
  compression = old_opts$compression
)

hdf5_close_all()
gc()
```

# Extending BigDataStatMeth through C++

The examples above use the standard R interface, which is the recommended
entry point for most users. However, `BigDataStatMeth` also provides a
C++ infrastructure for developers who need to implement new scalable
methods.

Internally, the package defines C++ classes for managing HDF5 files,
groups, and datasets, together with block-wise routines for linear
algebra and statistical operations. These are the same computational
building blocks used by the R/S3 interface. As a result, developers can
reuse the package infrastructure from Rcpp-based code instead of
implementing HDF5 file management, block iteration, and numerical
operations from scratch.

This makes the package useful not only as an end-user R package, but
also as a development framework for extending HDF5-backed statistical
computing in C++. Since the purpose of this vignette is to introduce the
standard R interface, the C++ API is not discussed in detail here.
Nevertheless, it is an important part of the package architecture and
provides the computational infrastructure behind the R/S3 methods shown
above.

# Session information

```{r session-info}
sessionInfo()
```