--- title: "Introduction to DICErClust" author: "Sarah Ayton and Yiye Zhang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Introduction to DICErClust} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} ## Chunks that call DICEr() or load torch are skipped on CRAN ## (they require the torch system library and ~160 MB of weights). ## Set NOT_CRAN=true locally, or use devtools::check() which sets it automatically. knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(Sys.getenv("NOT_CRAN"), "true") ) ``` ## What is DICErClust? **DICErClust** provides an R implementation of Deep Significance Clustering (DICE), a self-supervised framework that discovers clinically meaningful patient subgroups from electronic health record (EHR) data. Unlike conventional unsupervised clustering, DICE simultaneously optimises four objectives: 1. **Reconstruction fidelity** — an LSTM autoencoder learns a compact latent representation of time-varying continuous features. 2. **Cluster cohesion** — a soft k-means classifier assigns patients to clusters in the latent space. 3. **Outcome prediction** — a logistic regression head predicts a binary clinical outcome from the cluster assignment and auxiliary demographic features. 4. **Statistical significance** — a likelihood-ratio test (LRT) penalty ensures at least one cluster pair shows a significantly different outcome rate (*p* < 0.05) at the saved checkpoint. The result is a partition into risk-stratified subgroups that are both data-driven and statistically validated. > **Reference:** Huang Y, Du C, Zhu F, *et al.* (2021). > Self-supervised deep clustering of patient subgroups for heart failure with > preserved ejection fraction. > *J Am Med Inform Assoc*, **28**, 2394–2403. > doi:10.1093/jamia/ocab203 --- ## Installation ```{r install, eval = FALSE} ## From a local source tarball: install.packages( "/path/to/DICErClust_0.1.1.tar.gz", repos = NULL, type = "source" ) ``` DICErClust depends on the **torch** package for R. If you have not installed torch before, run `torch::install_torch()` once after installing the package. --- ## Data format `DICEr()` reads training and test data from RDS files. Each file must be a **length-3 list**: | Position | Name | Type | Description | |----------|------|------|-------------| | `[[1]]` | `data_x` | numeric matrix *n* × *p* | Continuous features — LSTM encoder input | | `[[2]]` | `data_v` | numeric matrix *n* × *q* | Binary demographic covariates — outcome-head auxiliary input | | `[[3]]` | `data_y` | integer vector length *n* | Binary outcome (0/1) | > **Important:** `data_v` must use R `numeric` (float64) storage, not > `integer`. The `torch` backend infers tensor dtype from the R storage mode; > integer columns produce int64 tensors that are incompatible with the float32 > model weights. ```{r data-format} ## Build a minimal synthetic dataset ---------------------------------------- set.seed(42) n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L make_rds <- function(n, path) { saveRDS( list( matrix(runif(n * p), n, p), # data_x: continuous matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float rbinom(n, 1, 0.3) # data_y: outcome ), path ) } data_dir <- file.path(tempdir(), "dice_intro") dir.create(data_dir, showWarnings = FALSE) make_rds(n_train, file.path(data_dir, "train.rds")) make_rds(n_test, file.path(data_dir, "test.rds")) ## Verify format d <- readRDS(file.path(data_dir, "train.rds")) cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n") cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n") cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n") ``` --- ## Quick start ```{r train, eval = FALSE} library(DICErClust) args <- list( seed = 42L, input_path = data_dir, filename_train = "train.rds", filename_test = "test.rds", n_input_fea = p, # columns in data_x n_hidden_fea = 3L, # LSTM latent dimension lstm_layer = 1L, lstm_dropout = 0.0, K_clusters = 2L, # number of clusters n_dummy_demov_fea = q, # columns in data_v cuda = FALSE, # set TRUE to use GPU lr = 1e-4, init_AE_epoch = 5L, # Stage 1 warm-up epochs iter = 20L, # Stage 2 iterations epoch_in_iter = 2L, lambda_AE = 1.0, lambda_classifier = 1.0, lambda_outcome = 1.0, lambda_p_value = 1.0 ) old_wd <- setwd(tempdir()) DICEr(args) # writes output to hn_3_K_2/part2_AE_nhidden_3/ setwd(old_wd) ``` ### Loading results ```{r load-results, eval = FALSE} part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3") res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds")) res_test <- readRDS(file.path(part2_dir, "data_test_iter.rds")) ## Cluster assignments ## Training set: use res_train$C (k-means labels, re-ordered by outcome rate) ## Test set: use res_test$pred_C (nearest-centroid assignments) table(res_test$pred_C) ``` --- ## Hyperparameters at a glance | Argument | Default | Effect | |---|---|---| | `n_hidden_fea` | — | LSTM latent dimension; controls representation capacity | | `K_clusters` | — | Number of clusters | | `init_AE_epoch` | 5 | Stage 1 warm-up length | | `iter` | 20 | Maximum Stage 2 iterations | | `epoch_in_iter` | 1 | Gradient-update epochs per iteration | | `lr` | 1e-4 | Adam learning rate | | `lambda_AE` | 1.0 | Weight on reconstruction loss | | `lambda_classifier` | 1.0 | Weight on cluster-assignment loss | | `lambda_outcome` | 1.0 | Weight on outcome BCE loss | | `lambda_p_value` | 1.0 | Weight on LRT significance penalty | All four `lambda` weights are equal at their defaults, giving each objective equal influence. The LRT significance threshold (χ²₁, α = 0.05 → 3.841) is fixed and not user-tunable. --- ## Output directory structure After a successful run, DICEr creates: ``` / └── hn__K_/ ├── part1_AE_nhidden_/ # Stage 1 autoencoder outputs │ └── part1_loss_AE.png └── part2_AE_nhidden_/ # Stage 2 best checkpoint ├── data_train_iter.rds # training set with C assignments └── data_test_iter.rds # test set with pred_C assignments ``` `data_train_iter.rds` and `data_test_iter.rds` are the data lists enriched with cluster assignment fields: - `$C` — k-means cluster labels (training set; cluster 0 = highest mortality) - `$pred_C` — nearest-centroid labels for the test set (**use this**, not `$C`, for test-set evaluation) --- ## Full worked example For a complete end-to-end analysis on the UCI Heart Failure Clinical Records dataset — including preprocessing, training, cluster evaluation (AUC = 0.823, χ² = 32.99, *p* < 0.001), and publication-quality figures — see: ```{r vignette-link, eval = FALSE} vignette("heart-failure-example", package = "DICErClust") ```