---
title: "Introduction to DICErClust"
author: "Sarah Ayton and Yiye Zhang"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Introduction to DICErClust}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
## Chunks that call DICEr() or load torch are skipped on CRAN
## (they require the torch system library and ~160 MB of weights).
## Set NOT_CRAN=true locally, or use devtools::check() which sets it automatically.
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = identical(Sys.getenv("NOT_CRAN"), "true")
)
```

## What is DICErClust?

**DICErClust** provides an R implementation of Deep Significance Clustering
(DICE), a self-supervised framework that discovers clinically meaningful patient
subgroups from electronic health record (EHR) data.  Unlike conventional
unsupervised clustering, DICE simultaneously optimises four objectives:

1. **Reconstruction fidelity** — an LSTM autoencoder learns a compact latent
   representation of time-varying continuous features.
2. **Cluster cohesion** — a soft k-means classifier assigns patients to
   clusters in the latent space.
3. **Outcome prediction** — a logistic regression head predicts a binary
   clinical outcome from the cluster assignment and auxiliary demographic
   features.
4. **Statistical significance** — a likelihood-ratio test (LRT) penalty
   ensures at least one cluster pair shows a significantly different outcome
   rate (*p* < 0.05) at the saved checkpoint.

The result is a partition into risk-stratified subgroups that are both
data-driven and statistically validated.

> **Reference:** Huang Y, Du C, Zhu F, *et al.* (2021).
> Self-supervised deep clustering of patient subgroups for heart failure with
> preserved ejection fraction.
> *J Am Med Inform Assoc*, **28**, 2394–2403.
> doi:10.1093/jamia/ocab203

---

## Installation

```{r install, eval = FALSE}
## From a local source tarball:
install.packages(
  "/path/to/DICErClust_0.1.1.tar.gz",
  repos = NULL, type = "source"
)
```

DICErClust depends on the **torch** package for R.  If you have not installed
torch before, run `torch::install_torch()` once after installing the package.

---

## Data format

`DICEr()` reads training and test data from RDS files.  Each file must be a
**length-3 list**:

| Position | Name | Type | Description |
|----------|------|------|-------------|
| `[[1]]` | `data_x` | numeric matrix *n* × *p* | Continuous features — LSTM encoder input |
| `[[2]]` | `data_v` | numeric matrix *n* × *q* | Binary demographic covariates — outcome-head auxiliary input |
| `[[3]]` | `data_y` | integer vector length *n* | Binary outcome (0/1) |

> **Important:** `data_v` must use R `numeric` (float64) storage, not
> `integer`.  The `torch` backend infers tensor dtype from the R storage mode;
> integer columns produce int64 tensors that are incompatible with the float32
> model weights.

```{r data-format}
## Build a minimal synthetic dataset ----------------------------------------
set.seed(42)
n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L

make_rds <- function(n, path) {
  saveRDS(
    list(
      matrix(runif(n * p), n, p),                   # data_x: continuous
      matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float
      rbinom(n, 1, 0.3)                             # data_y: outcome
    ),
    path
  )
}

data_dir <- file.path(tempdir(), "dice_intro")
dir.create(data_dir, showWarnings = FALSE)
make_rds(n_train, file.path(data_dir, "train.rds"))
make_rds(n_test,  file.path(data_dir, "test.rds"))

## Verify format
d <- readRDS(file.path(data_dir, "train.rds"))
cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n")
cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n")
cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n")
```

---

## Quick start

```{r train, eval = FALSE}
library(DICErClust)

args <- list(
  seed              = 42L,
  input_path        = data_dir,
  filename_train    = "train.rds",
  filename_test     = "test.rds",
  n_input_fea       = p,       # columns in data_x
  n_hidden_fea      = 3L,      # LSTM latent dimension
  lstm_layer        = 1L,
  lstm_dropout      = 0.0,
  K_clusters        = 2L,      # number of clusters
  n_dummy_demov_fea = q,       # columns in data_v
  cuda              = FALSE,   # set TRUE to use GPU
  lr                = 1e-4,
  init_AE_epoch     = 5L,      # Stage 1 warm-up epochs
  iter              = 20L,     # Stage 2 iterations
  epoch_in_iter     = 2L,
  lambda_AE         = 1.0,
  lambda_classifier = 1.0,
  lambda_outcome    = 1.0,
  lambda_p_value    = 1.0
)

old_wd <- setwd(tempdir())
DICEr(args)            # writes output to hn_3_K_2/part2_AE_nhidden_3/
setwd(old_wd)
```

### Loading results

```{r load-results, eval = FALSE}
part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3")

res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test  <- readRDS(file.path(part2_dir, "data_test_iter.rds"))

## Cluster assignments
## Training set: use res_train$C   (k-means labels, re-ordered by outcome rate)
## Test set:     use res_test$pred_C (nearest-centroid assignments)
table(res_test$pred_C)
```

---

## Hyperparameters at a glance

| Argument | Default | Effect |
|---|---|---|
| `n_hidden_fea` | — | LSTM latent dimension; controls representation capacity |
| `K_clusters` | — | Number of clusters |
| `init_AE_epoch` | 5 | Stage 1 warm-up length |
| `iter` | 20 | Maximum Stage 2 iterations |
| `epoch_in_iter` | 1 | Gradient-update epochs per iteration |
| `lr` | 1e-4 | Adam learning rate |
| `lambda_AE` | 1.0 | Weight on reconstruction loss |
| `lambda_classifier` | 1.0 | Weight on cluster-assignment loss |
| `lambda_outcome` | 1.0 | Weight on outcome BCE loss |
| `lambda_p_value` | 1.0 | Weight on LRT significance penalty |

All four `lambda` weights are equal at their defaults, giving each objective
equal influence.  The LRT significance threshold (χ²₁, α = 0.05 → 3.841) is
fixed and not user-tunable.

---

## Output directory structure

After a successful run, DICEr creates:

```
<working_dir>/
└── hn_<n_hidden>_K_<K>/
    ├── part1_AE_nhidden_<n>/          # Stage 1 autoencoder outputs
    │   └── part1_loss_AE.png
    └── part2_AE_nhidden_<n>/          # Stage 2 best checkpoint
        ├── data_train_iter.rds        # training set with C assignments
        └── data_test_iter.rds         # test set with pred_C assignments
```

`data_train_iter.rds` and `data_test_iter.rds` are the data lists enriched
with cluster assignment fields:

- `$C` — k-means cluster labels (training set; cluster 0 = highest mortality)
- `$pred_C` — nearest-centroid labels for the test set (**use this**, not `$C`,
  for test-set evaluation)

---

## Full worked example

For a complete end-to-end analysis on the UCI Heart Failure Clinical Records
dataset — including preprocessing, training, cluster evaluation (AUC = 0.823,
χ² = 32.99, *p* < 0.001), and publication-quality figures — see:

```{r vignette-link, eval = FALSE}
vignette("heart-failure-example", package = "DICErClust")
```