--- title: "Wikipedia Highlighter Article" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Wikipedia Highlighter Article} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In a second example, we view previous versions of a Wikipedia article (in this case, on the [Highlighter](https://en.wikipedia.org/wiki/Highlighter)), in order to see which parts are consistently included. This data includes the first 150 versions of the article, as well as the latest 150 versions of the article, collected around 14 March, 2024. The data was gathered as follows: ```{r data, eval=FALSE} library(rvest) library(dplyr) library(purrr) library(tibble) library(stringr) class_of_interest <- ".mw-content-ltr" ## ids are #id-name, classes are .class-name # Finding newest 150 versions of Wikipedia's highlighter article editurl <- "https://en.wikipedia.org/w/index.php?title=Highlighter&action=history&offset=&limit=150" editclass_of_interest <- ".mw-changeslist-date" # Save the urls of the full articles url_list1 <- editurl %>% rvest::read_html() %>% rvest::html_nodes(editclass_of_interest) %>% purrr::map(., list()) %>% tibble::tibble(node = .) %>% dplyr::mutate(link = purrr::map_chr(node, html_attr, "href") %>% paste0("https://en.wikipedia.org", .)) # Finding oldest 150 versions of Wikipedia's highlighter article editurl2 <- "https://en.wikipedia.org/w/index.php?title=Highlighter&action=history&dir=prev&limit=150" # Save the urls of the full articles url_list2 <- editurl2 %>% rvest::read_html() %>% rvest::html_nodes(editclass_of_interest) %>% purrr::map(., list()) %>% tibble::tibble(node = .) %>% dplyr::mutate(link = purrr::map_chr(node, html_attr, "href") %>% paste0("https://en.wikipedia.org", .)) # Combine url list url_list <- rbind(url_list1, url_list2) # create a data frame with the text of the documents wiki_pages <- data.frame(page_notes = rep(NA, dim(url_list)[1])) for (i in 1:dim(url_list)[1]){ wiki_list <- url_list$link[i] %>% rvest::read_html() %>% rvest::html_node(class_of_interest) %>% rvest::html_children() %>% purrr::map(., list()) %>% tibble::tibble(node = .) %>% dplyr::mutate(type = purrr::map_chr(node, html_name)) %>% dplyr::filter(type == "p") %>% dplyr::mutate(text = purrr::map_chr(node, html_text)) %>% dplyr::mutate(cleantext = stringr::str_remove_all(text, "\\[.*?\\]") %>% stringr::str_trim()) %>% plyr::summarise(cleantext = paste(cleantext, collapse = "
")) wiki_pages$page_notes[i] <- wiki_list$cleantext[1] } ``` The previous versions are then compared to the current version's collocations with fuzzy matching in order to provide a count for the amount of times each collocation occurs in edited documents (divided by the number of times the collocation occurs in the current version to account for duplications). ```{r} library(highlightr) # calculate frequencies with reference to source document (first row) merged_frequency <- collocation_frequency(highlightr::wiki_pages, text_column = "page_notes", source_row = 1, fuzzy=TRUE) head(merged_frequency) ``` These frequencies can be mapped back to the transcript document, then highlighted as described based on the average collocation frequency that each word appeared in. The results are shown below by specifying `` `r '\x60r page_highlight\x60'` `` in an R Markdown document outside of a code chunk and knitting to HTML. Note that the "labels" argument can be used to add additional labels to the gradient key. ```{r} # create a ggplot object of the transcript freq_plot <- collocation_plot(merged_frequency) # add html tags to source document page_highlight <- highlighted_text(freq_plot, labels=c("(fewest articles)", "(most articles)")) ``` `r page_highlight`
This text indicates changes to the Wikipedia article, where yellow indicates more occurrences (such as yellow as the primary highlight color and information regarding the trilighter). Darker colors indicate text that is seen in fewer versions of the article (such as the introductory sentence and the reference to correction tape). We could also use the oldest version of the Highlighter article in the dataset as the transcript reference to view which text has been changed: ```{r} # calculate frequencies with reference to source document (last row) merged_frequency2 <- collocation_frequency(highlightr::wiki_pages, text_column = "page_notes", source_row = nrow(wiki_pages), fuzzy=TRUE) # create a gpplot object of the transcript freq_plot2 <- collocation_plot(merged_frequency2) # add html tags to source document page_highlight2 <- highlighted_text(freq_plot2, labels=c("(fewest articles)", "(most articles)")) ``` `r page_highlight2`
In this case, the beginning of the first sentence ("A highlighter is a form...") is fairly popular. Wikipedia Citation: “Highlighter.” Wikipedia, 14 Mar. 2024. Wikipedia, https://en.wikipedia.org/w/index.php?title=Highlighter&oldid=1213690238.