--- title: "Introduction to modeldiag" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to modeldiag} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(modeldiag) ``` # Overview Statistical models rely on assumptions for valid inference. Violations of these assumptions can lead to biased estimates, incorrect standard errors, and misleading conclusions. The `modeldiag` package provides a unified framework for diagnosing these assumptions across multiple model classes, including: * Linear models * Logistic regression * Count models (Poisson) * Survival models (Cox proportional hazards) This vignette introduces both the **statistical intuition** behind common diagnostics and how to implement them using `modeldiag`. --- # Linear Models Consider the classical linear regression model: $$Y = X\beta + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2 I)$$ Valid inference depends on several assumptions about the error term $\varepsilon$. ## Multicollinearity Multicollinearity occurs when predictors are highly correlated. This inflates the variance of coefficient estimates. The Variance Inflation Factor (VIF) is defined as: $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$ where $R_j^2$ is obtained by regressing predictor $X_j$ on all other predictors. Large VIF values indicate unstable estimates. ## Heteroscedasticity Heteroscedasticity occurs when: $$ \text{Var}(\varepsilon_i) \neq \sigma^2 $$ The Breusch–Pagan test evaluates whether residual variance depends on predictors. ## Autocorrelation Autocorrelation arises when: $$\text{Cov}(\varepsilon_i, \varepsilon_j) \neq 0$$ The Durbin–Watson statistic tests for first-order autocorrelation. ## Normality of Errors Many inferential procedures assume: $$ \varepsilon \sim N(0, \sigma^2) $$ The Shapiro–Wilk test evaluates this assumption. ## Influential Observations Influential points disproportionately affect model estimates. Cook’s distance measures this influence: $$ D_i = \frac{( \hat{\beta} - \hat{\beta}*{(i)} )^T X^T X (\hat{\beta} - \hat{\beta}*{(i)})}{p \hat{\sigma}^2} $$ --- ## Example ```{r linear-example} model_lm <- lm(mpg ~ wt + hp + disp, data = mtcars) diag_lm <- diagnose_model(model_lm) summary(diag_lm) ``` --- # Logistic Regression Logistic regression models the probability: $$ \text{logit}(P(Y=1)) = X\beta $$ ## Key Diagnostics ### Linearity of the Logit The model assumes a linear relationship between predictors and the log-odds: $$ \log\left(\frac{p}{1-p}\right) $$ The Box–Tidwell test evaluates this assumption. ### Goodness of Fit The Hosmer–Lemeshow test compares observed and expected counts across groups. ### Separation Complete or quasi-complete separation occurs when predictors perfectly classify outcomes, leading to unstable or infinite estimates. --- ## Example ```{r logistic-example} model_glm <- glm(am ~ wt + hp, data = mtcars, family = binomial) diag_glm <- diagnose_model(model_glm) summary(diag_glm) ``` --- # Poisson Regression Poisson regression assumes: $$ Y \sim \text{Poisson}(\lambda), \quad \log(\lambda) = X\beta $$ ## Overdispersion A key assumption is: $$ \text{Var}(Y) = \mathbb{E}(Y) $$ Overdispersion occurs when: $$ \text{Var}(Y) > \mathbb{E}(Y) $$ This leads to underestimated standard errors. ## Zero Inflation Excess zeros beyond what the Poisson model predicts may indicate a zero-inflated process. --- ## Example ```{r poisson-example} model_pois <- glm(carb ~ wt + hp, data = mtcars, family = poisson) diag_pois <- diagnose_model(model_pois) summary(diag_pois) ``` --- # Survival Models The Cox proportional hazards model assumes: $$ h(t | X) = h_0(t) \exp(X\beta) $$ ## Proportional Hazards The key assumption is that hazard ratios are constant over time. Schoenfeld residuals are used to test: $$ \frac{\partial \beta(t)}{\partial t} = 0 $$ --- ## Example ```{r survival-example} library(survival) data(lung) model_cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung) diag_cox <- diagnose_model(model_cox) summary(diag_cox) ``` --- # Visualization Diagnostic plots help identify violations visually. ```{r plotting, fig.height=6, fig.width=6} plot(diag_lm) ``` --- # Conclusion The `modeldiag` package provides a unified and extensible framework for model diagnostics, combining statistical rigor with practical usability. By integrating multiple diagnostic tools into a consistent interface, it simplifies the process of validating model assumptions across diverse modeling frameworks. # References Cook, R. D., & Weisberg, S. (1982). *Residuals and Influence in Regression*. Chapman & Hall. Breusch, T. S., & Pagan, A. R. (1979). *A Simple Test for Heteroscedasticity and Random Coefficient Variation*. Econometrica. Durbin, J., & Watson, G. S. (1950, 1951). *Testing for Serial Correlation in Least Squares Regression*. Biometrika. Shapiro, S. S., & Wilk, M. B. (1965). *An Analysis of Variance Test for Normality*. Biometrika. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). *Applied Logistic Regression*. Wiley. Cox, D. R. (1972). *Regression Models and Life-Tables*. JRSS. Cameron, A. C., & Trivedi, P. K. (2013). *Regression Analysis of Count Data*. Cambridge University Press.