--- title: "Beta Diversity" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Beta Diversity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Beta diversity is a measure of how different two samples are. The different metrics described in this vignette quantify that difference, referred to as the "distance" or "dissimilarity" between a pair of samples. The distance is typically `0` for identical samples and `1` for completely different samples. ## Input Data We will use the `ex_counts` feature table included with ecodive. It contains the number of observations of each bacterial genera in each sample. In the text below, you can substitute the word 'genera' for the feature of interest in your own data. ```r library(ecodive) counts <- rarefy(ex_counts) t(counts) #> Saliva Gums Nose Stool #> Streptococcus 162 309 6 1 #> Bacteroides 2 2 0 341 #> Corynebacterium 0 0 171 1 #> Haemophilus 180 34 0 1 #> Propionibacterium 1 0 82 0 #> Staphylococcus 0 0 86 1 ``` Looking at the matrix above, you can see that saliva and gums are similar, while saliva and stool are quite different. # Weighted vs Unweighted Metrics Before selecting a formula, it is important to distinguish between **weighted** and **unweighted** metrics. * **Weighted metrics** take relative abundances into account. * **Unweighted metrics** only consider presence/absence (binary data). You can consult `list_metrics()` to see which category a specific metric falls into or to list all available options programmatically. ``` r list_metrics('beta', 'id', weighted = FALSE) #> [1] "sorensen" "hamming" "jaccard" "ochiai" "unweighted_unifrac" list_metrics('beta', 'id', weighted = TRUE) #> [1] "aitchison" "bhattacharyya" "bray" #> [4] "canberra" "chebyshev" "chord" #> [7] "clark" "divergence" "euclidean" #> [10] "generalized_unifrac" "gower" "hellinger" #> [13] "horn" "jensen" "jsd" #> [16] "lorentzian" "manhattan" "matusita" #> [19] "minkowski" "morisita" "motyka" #> [22] "normalized_unifrac" "psym_chisq" "soergel" #> [25] "squared_chisq" "squared_chord" "squared_euclidean" #> [28] "topsoe" "variance_adjusted_unifrac" "wave_hedges" #> [31] "weighted_unifrac" ``` # Formulas The following tables detail the mathematical definitions for the metrics available in `ecodive`. ## Abundance-Based (Weighted) Given: * $n$ : The number of features. * $X_i$, $Y_i$ : Absolute counts for the $i$-th feature in samples $X$ and $Y$. * $X_T$, $Y_T$ : Total counts in each sample. $X_T = \sum_{i=1}^{n} X_i$ * $P_i$, $Q_i$ : Proportional abundances of $X_i$ and $Y_i$. $P_i = X_i / X_T$ * $X_L$, $Y_L$ : Mean log of abundances. $X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}$ * $R_i$ : The range of the $i$-th feature across all samples (max - min). | | | | :--------------- | :--------------------------------- | | **Aitchison distance**
`aitchison()` | $\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}$ | | **Bhattacharyya distance**
`bhattacharyya()` | $-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}$ | | **Bray-Curtis dissimilarity**
`bray()` | $\displaystyle \frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} (X_i + Y_i)}$ | | **Canberra distance**
`canberra()` | $\displaystyle \sum_{i=1}^{n} \frac{|X_i - Y_i|}{X_i + Y_i}$ | | **Chebyshev distance**
`chebyshev()` | $\max(|X_i - Y_i|)$ | | **Chord distance**
`chord()` | $\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}$ | | **Clark's divergence distance**
`clark()` | $\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{X_i - Y_i}{X_i + Y_i}\right)^{2}}$ | | **Divergence**
`divergence()` | $\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}$ | | **Euclidean distance**
`euclidean()` | $\sqrt{\sum_{i=1}^{n} (X_i - Y_i)^2}$ | | **Gower distance**
`gower()` | $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{|X_i - Y_i|}{R_i}$ | | **Hellinger distance**
`hellinger()` | $\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}$ | | **Horn-Morisita dissimilarity**
`horn()` | $\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}$ | | **Jensen-Shannon distance**
`jensen()` | $\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}$ | | **Jensen-Shannon divergence (JSD)**
`jsd()` | $\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]$ | | **Lorentzian distance**
`lorentzian()` | $\sum_{i=1}^{n}\ln{(1 + |X_i - Y_i|)}$ | | **Manhattan distance**
`manhattan()` | $\sum_{i=1}^{n} |X_i - Y_i|$ | | **Matusita distance**
`matusita()` | $\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}$ | | **Minkowski distance**
`minkowski()` | $\sqrt[p]{\sum_{i=1}^{n} (X_i - Y_i)^p}$
Where $p$ is the geometry of the space. | | **Morisita dissimilarity**
(integers only)
`morisita()` | $\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}$ | | **Motyka dissimilarity**
`motyka()` | $\displaystyle \frac{\sum_{i=1}^{n} \max(X_i, Y_i)}{\sum_{i=1}^{n} (X_i + Y_i)}$ | | **Probabilistic Symmetric $\chi^2$ distance**
`psym_chisq()` | $\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}$ | | **Soergel distance**
`soergel()` | $\displaystyle \frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} \max(X_i, Y_i)}$ | | **Squared $\chi^2$ distance**
`squared_chisq()` | $\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}$ | | **Squared Chord distance**
`squared_chord()` | $\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2$ | | **Squared Euclidean distance**
`squared_euclidean()` | $\sum_{i=1}^{n} (X_i - Y_i)^2$ | | **Topsoe distance**
`topsoe()` | $\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)$ | | **Wave Hedges distance**
`wave_hedges()` | $\displaystyle \sum_{i=1}^{n}\frac{|X_i - Y_i|}{\max(X_i, Y_i)}$ | ## Presence / Absence (Unweighted) Given: * $A$, $B$ : Number of features in each sample. * $J$ : Number of features in common. | | | | :---------------- | :-------------------------- | | **Dice-Sorensen dissimilarity**
`sorensen()` | $\displaystyle \frac{2J}{(A + B)}$ | | **Hamming distance**
`hamming()` | $\displaystyle (A + B) - 2J$ | | **Jaccard distance**
`jaccard()` | $\displaystyle 1 - \frac{J}{(A + B - J)}$ | | **Otsuka-Ochiai dissimilarity**
`ochiai()` | $\displaystyle 1 - \frac{J}{\sqrt{AB}}$ | > **Note:** `sorensen()` is equivalent to `bray(norm = 'binary')`, and `jaccard()` is equivalent to `soergel(norm = 'binary')`. ## Phylogenetic Given $n$ branches with lengths $L$ and a pair of samples' binary ($A$ and $B$) or proportional abundances ($P$ and $Q$) on each of those branches. | | | | :----------- | :----------- | | **Unweighted UniFrac**
`unweighted_unifrac()` | $\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i|$ | | **Weighted UniFrac**
`weighted_unifrac()` | $\displaystyle \sum_{i=1}^{n} L_i|P_i - Q_i|$ | | **Normalized Weighted UniFrac**
`normalized_unifrac()` | $\displaystyle \frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}$ | | **Generalized UniFrac (GUniFrac)**
`generalized_unifrac()` | $\displaystyle \frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\displaystyle \frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}$
Where $\alpha$ is a scalable weighting factor. | | **Variance-Adjusted Weighted UniFrac**
`variance_adjusted_unifrac()` | $\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }$ | See https://cmmr.github.io/ecodive/articles/unifrac.html for detailed example UniFrac calculations. # Partial Calculation The default value of `pairs=NULL` in ecodive's beta diversity functions results in the returned all-vs-all distance matrix being completely filled in. ```r bray(counts) #> Saliva Gums Nose #> Gums 0.4260870 #> Nose 0.9797101 0.9826087 #> Stool 0.9884058 0.9884058 0.9913043 ``` If you are doing a reference-vs-all comparison, you can use the `pairs` parameter to skip unwanted calculations and save some CPU time. The larger the dataset, the more noticeable the improvement will be. ```r bray(counts, pairs = 1:3) #> Saliva Gums Nose #> Gums 0.4260870 #> Nose 0.9797101 NA #> Stool 0.9884058 NA NA ``` The `pairs` argument can be: * A numeric vector, giving the positions in the result to calculate. * A logical vector, indicating whether to calculate a position in the result. * A `function(i,j)` that returns whether rows `i` and `j` should be compared. Therefore, all of the following are equivalent: ```r bray(counts, pairs = 1:3) bray(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)) bray(counts, pairs = function (i, j) i == 1) ``` The ordering of `pairs` follows the pairings produced by `combn()`. ```r # Column index pairings combn(nrow(counts), 2) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 1 1 1 2 2 3 #> [2,] 2 3 4 3 4 4 # Sample name pairings combn(rownames(counts), 2) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums" "Nose" #> [2,] "Gums" "Nose" "Stool" "Nose" "Stool" "Stool" ``` So, for instance, to use gums as the reference sample: ```r my_combn <- combn(rownames(counts), 2) my_pairs <- my_combn[1,] == 'Gums' | my_combn[2,] == 'Gums' my_pairs #> [1] TRUE FALSE FALSE TRUE TRUE FALSE bray(counts, pairs = my_pairs) #> Saliva Gums Nose #> Gums 0.4260870 #> Nose NA 0.9826087 #> Stool NA 0.9884058 NA ``` # References Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. *arXiv*. Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. *International Journal of Mathematical Models and Methods in Applied Sciences*, 1(4), 300–307.