% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/screen_analyzer.R
\name{screen_analyzer}
\alias{screen_analyzer}
\title{Analyze performance between the human and AI screening.}
\usage{
screen_analyzer(x, human_decision = human_code, key_result = TRUE)
}
\arguments{
\item{x}{An object of either class\code{'gpt'} or \code{'chatgpt'}
or a dataset of either class \code{'gpt_tbl'}, \code{'chatgpt_tbl'}, or \code{'gpt_agg_tbl'}}

\item{human_decision}{Indicate the variable in the data that contains the human_decision.
This variable must be numeric, containing 1 (for included references) and 0 (for excluded references) only.}

\item{key_result}{Logical indicating if only the raw agreement, recall, and specificity measures should be returned.
Default is \code{TRUE}.}
}
\value{
A \code{tibble} with screening performance measures. The \code{tibble} includes the following variables:
\tabular{lll}{
\bold{promptid} \tab \code{integer} \tab indicating the prompt ID. \cr
\bold{model} \tab \code{character}   \tab indicating the specific gpt-model used. \cr
\bold{reps}  \tab \code{integer}  \tab indicating the number of times the same question was sent to GPT server. \cr
\bold{top_p} \tab \code{numeric}  \tab indicating the applied top_p. \cr
\bold{n_screened} \tab \code{integer} \tab indicating the number of screened references. \cr
\bold{n_missing} \tab \code{numeric} \tab indicating the number of missing responses. \cr
\bold{n_refs} \tab \code{integer} \tab indicating the total number of references expected to be screened for the given condition. \cr
\bold{human_in_gpt_ex} \tab \code{numeric}  \tab indicating the number of references included by humans and excluded by gpt. \cr
\bold{human_ex_gpt_in} \tab \code{numeric}  \tab indicating the number of references excluded by humans and included by gpt. \cr
\bold{human_in_gpt_in} \tab \code{numeric}  \tab indicating the number of references included by humans and included by gpt. \cr
\bold{human_ex_gpt_ex} \tab \code{numeric}  \tab indicating the number of references excluded by humans and excluded by gpt. \cr
\bold{accuracy} \tab \code{numeric}  \tab indicating the overall percent disagreement between human and gpt (Gartlehner et al., 2019). \cr
\bold{p_agreement} \tab \code{numeric} \tab indicating the overall percent agreement between human and gpt. \cr
\bold{precision}  \tab \code{numeric}  \tab "measures the ability to include only articles that should be included" (Syriani et al., 2023). \cr
\bold{recall}  \tab \code{numeric} \tab "measures the ability to include all articles that should be included" (Syriani et al., 2023). \cr
\bold{npv}  \tab \code{numeric}  \tab Negative predictive value (NPV) "measures the ability to exclude only articles that should be excluded" (Syriani et al., 2023). \cr
\bold{specificity}  \tab \code{numeric} \tab "measures the ability to exclude all articles that should be excluded" (Syriani et al., 2023). \cr
\bold{bacc}  \tab \code{numeric}  \tab "capture the accuracy of deciding both inclusion and exclusion classes" (Syriani et al., 2023). \cr
\bold{F2}  \tab \code{numeric} \tab F-measure that "consider the cost of getting false negatives twice as costly as getting false positives" (Syriani et al., 2023). \cr
\bold{mcc}  \tab \code{numeric} \tab indicating percent agreement for excluded references (Gartlehner et al., 2019). \cr
\bold{irr}  \tab \code{numeric}  \tab indicating the inter-rater reliability as described in McHugh (2012). \cr
\bold{se_irr} \tab \code{numeric} \tab indicating standard error for the inter-rater reliability. \cr
\bold{cl_irr} \tab \code{numeric} \tab indicating lower confidence interval for the inter-rater reliability. \cr
\bold{cu_irr} \tab \code{numeric} \tab indicating upper confidence interval for the inter-rater reliability. \cr
\bold{level_of_agreement} \tab \code{character} \tab interpretation of the inter-rater reliability as suggested by McHugh (2012). \cr
}
}
\description{
\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#experimental}{\figure{lifecycle-experimental.svg}{options: alt='[Experimental]'}}}{\strong{[Experimental]}}\if{html}{\out{<br>}}
\if{html}{\out{<br>}}
When both the human and AI title and abstract screening has been done, this function
allows you to calculate performance measures of the screening, including the overall
accuracy, specificity and sensitivity of the screening, as well as
inter-rater reliability kappa statistics.
}
\examples{
\dontrun{

library(future)

set_api_key()

prompt <- "Is this study about a Functional Family Therapy (FFT) intervention?"

plan(multisession)

res <- tabscreen_gpt(
  data = filges2015_dat[1:2,],
  prompt = prompt,
  studyid = studyid,
  title = title,
  abstract = abstract
  )

plan(sequential)

res |> screen_analyzer()

}
}
\references{
Gartlehner, G., Wagner, G., Lux, L., Affengruber, L., Dobrescu, A., Kaminski-Hartenthaler, A., & Viswanathan, M. (2019).
Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study.
\emph{Systematic Reviews}, 8(1), 277. \doi{10.1186/s13643-019-1221-3}

McHugh, M. L. (2012).
Interrater reliability: The kappa statistic.
\emph{Biochemia Medica}, 22(3), 276-282. \url{https://pubmed.ncbi.nlm.nih.gov/23092060/}

Syriani, E., David, I., & Kumar, G. (2023).
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews.
\emph{ArXiv Preprint ArXiv:2307.06464}.
}
