---
title: "Bivariate analysis of continuous and/or categorical variables"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bivariate analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

Tidycomm includes five functions for bivariate explorative data analysis:

- `crosstab()` for both categorical independent and dependent variables
- `t_test()` for dichotomous categorical independent and continuous dependent variables
- `unianova()` for polytomous categorical independent and continuous dependent variables
- `correlate()` for both continuous independent and dependent variables
- `regress()` for both continuous or factorial (translated into dummy dichotomous versions) independent and continuous dependent variables

```{r setup, message = FALSE, warning = FALSE, include = FALSE}
library(tidycomm)
```

We will again use sample data from the [Worlds of Journalism](https://worldsofjournalism.org/) 2012-16 study for demonstration purposes: 

```{r}
WoJ
```

## Compute contingency tables and Chi-square tests

`crosstab()` outputs a contingency table for one independent (column) variable and one or more dependent (row) variables:

```{r}
WoJ %>% 
  crosstab(reach, employment)
```

Additional options include `add_total` (adds a row-wise `Total` column if set to `TRUE`) and `percentages` (outputs column-wise percentages instead of absolute values if set to `TRUE`):

```{r}
WoJ %>% 
  crosstab(reach, employment, add_total = TRUE, percentages = TRUE)
```

Setting `chi_square = TRUE` computes a $\chi^2$ test including Cramer's $V$ and outputs the results in a console message:

```{r}
WoJ %>% 
  crosstab(reach, employment, chi_square = TRUE)
```

Finally, passing multiple row variables will treat all unique value combinations as a single variable for percentage and Chi-square computations:

```{r}
WoJ %>% 
  crosstab(reach, employment, country, percentages = TRUE)
```

You can also visualize the output from `crosstab()`:

```{r}
WoJ %>% 
  crosstab(reach, employment, percentages = TRUE) %>% 
  visualize()
```

Note that the `percentages = TRUE` argument determines whether the bars add up to 100% and thus cover the whole width or whether they do not:

```{r}
WoJ %>% 
  crosstab(reach, employment) %>% 
  visualize()
```

## Compute t-Tests

Use `t_test()` to quickly compute t-Tests for a group variable and one or more test variables. Output includes test statistics, descriptive statistics and Cohen's $d$ effect size estimates: 

```{r}
WoJ %>% 
  t_test(temp_contract, autonomy_selection, autonomy_emphasis)
```

Passing no test variables will compute t-Tests for all numerical variables in the data:

```{r}
WoJ %>% 
  t_test(temp_contract)
```

If passing a group variable with more than two unique levels, `t_test()` will produce a `warning` and default to the first two unique values. You can manually define the levels by setting the `levels` argument:

```{r}
WoJ %>% 
  t_test(employment, autonomy_selection, autonomy_emphasis)

WoJ %>% 
  t_test(employment, autonomy_selection, autonomy_emphasis, levels = c("Full-time", "Freelancer"))
```

Additional options include:

- `pooled_sd`: By default, the pooled variance will be used the compute Cohen's $d$ effect size estimates ($s = \sqrt\frac{(n_1 - 1)s^2_1 + (n_2 - 1)s^2_2}{n_1 + n_2 - 2}$).
Set `pooled_sd = FALSE` to use the simple variance estimation instead ($s = \sqrt\frac{(s^2_1 + s^2_2)}{2}$).
- `paired`: Set `paired = TRUE` to compute a paired t-Test instead. It is advisable to specify the case-identifying variable with `case_var` when computing paired t-Tests, as this will make sure that data are properly sorted.

Previously, the (now deprecated) option of `var.equal` was also available. This has been overthrown, however, as `t_test()` now by default tests for equal variance (using a Levene test) to decide whether to use pooled variance or to use the Welch approximation to the degrees of freedom.

`t_test()` also provides a one-sample t-Test if you provide a `mu` argument:

```{r}
WoJ %>% 
  t_test(autonomy_emphasis, mu = 3.9)
```

Of course, also the result from t-Tests can be visualized easily as such:

```{r}
WoJ %>% 
  t_test(temp_contract, autonomy_selection, autonomy_emphasis) %>% 
  visualize()
```


## Compute one-way ANOVAs

`unianova()` will compute one-way ANOVAs for one group variable and one or more test variables. Output includes test statistics, $\eta^2$ effect size estimates, and $\omega^2$, if Welch's approximation is used to account for unequal variances.

```{r}
WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis)
```

Descriptives can be added by setting `descriptives = TRUE`. If no test variables are passed, all numerical variables in the data will be used:

```{r}
WoJ %>% 
  unianova(employment, descriptives = TRUE)
```

You can also compute _Tukey's HSD_ post-hoc tests by setting `post_hoc = TRUE`. Results will be added as a `tibble` in a list column `post_hoc`.

```{r}
WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis, post_hoc = TRUE)
```

These can then be unnested with `tidyr::unnest()`:

```{r}
WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis, post_hoc = TRUE) %>% 
  dplyr::select(Variable, post_hoc) %>% 
  tidyr::unnest(post_hoc)
```

Visualize one-way ANOVAs the way you visualize almost everything in `tidycomm`:

```{r}
WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis) %>% 
  visualize()
```


## Compute correlation tables and matrices

`correlate()` will compute correlations for all combinations of the passed variables:

```{r}
WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis)
```

If no variables passed, correlations for all combinations of numerical variables will be computed:

```{r}
WoJ %>% 
  correlate()
```

Specify a focus variable using the `with` parameter to correlate all other variables with this focus variable. 

```{r}
WoJ %>% 
  correlate(autonomy_selection, autonomy_emphasis, with = work_experience)
```

Run a partial correlation by designating three variables along with the `partial` parameter.

```{r}
WoJ %>% 
  correlate(autonomy_selection, autonomy_emphasis, partial = work_experience)
```

Visualize correlations by passing the results on to the `visualize()` function:

```{r}
WoJ %>% 
  correlate(work_experience, autonomy_selection) %>% 
  visualize()
```

If you provide more than two variables, you automatically get a correlogram (the same you would get if you convert correlations to a correlation matrix):

```{r}
WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis) %>% 
  visualize()
```


By default, Pearson's product-moment correlations coefficients ($r$) will be computed. Set `method` to `"kendall"` to obtain Kendall's $\tau$ or to `"spearman"` to obtain Spearman's $\rho$ instead.

To obtain a correlation matrix, pass the output of `correlate()` to `to_correlation_matrix()`:

```{r}
WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis) %>% 
  to_correlation_matrix()
```

## Compute linear regressions

`regress()` will create a linear regression on one dependent variable with a flexible number of independent variables. Independent variables can thereby be continuous, dichotomous, and factorial (in which case each factor level will be translated into a dichotomous dummy variable version):

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government)
```

The function automatically adds standardized beta values to the expected linear-regression output. You can also opt in to calculate up to three precondition checks:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government,
          check_independenterrors = TRUE,
          check_multicollinearity = TRUE,
          check_homoscedasticity = TRUE)
```

For linear regressions, a number of visualizations are possible. The default one is the visualization of the result(s), is that the dependent variable is correlated with each of the independent variables separately and a linear model is presented in these:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize()
```

Alternatively you can visualize precondition-check-assisting depictions. Correlograms among independent variables, for example:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "correlogram")
```

Next up, visualize a residuals-versus-fitted plot to determine distributions:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "resfit")
```

Or use a (normal) probability-probability plot to check for multicollinearity:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "pp")
```

The (normal) quantile-quantile plot also helps checking for multicollinearity but focuses more on outliers:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "qq")
```

Next up, the scale-location (sometimes also called spread-location) plot checks whether residuals are spread equally to help check for homoscedasticity:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "scaloc")
```

Finally, visualize the residuals-versus-leverage plot to check for influential outliers affecting the final model more than the rest of the data:

```{r}
WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "reslev")
```