diff --git a/episodes/1-intro-reproducible-examples.md b/episodes/1-intro-reproducible-examples.md index 9ee4d66b..7cce82c4 100644 --- a/episodes/1-intro-reproducible-examples.md +++ b/episodes/1-intro-reproducible-examples.md @@ -19,7 +19,7 @@ exercises: 10 :::::::::::::::::::::::::::::::::::::::::::::::: #### Welcome and introductions -- Welcome to "RRRR, I'm Stuck!" We're glad you're here! Let's first take care of a few setup steps. You should have all followed the setup instructions on the [workshop website](kaijagahm.github.io/R-help-reprexes/), and you should have both R and RStudio installed. +- Welcome to "RRRR, I'm Stuck!" We're glad you're here! Let's first take care of a few setup steps. You should have all followed the setup instructions on the [workshop website](https://carpentries-incubator.github.io/R-help-reprexes/), and you should have both R and RStudio installed. - You should also have the following packages installed: {**reprex**}, {**ratdat**}, {**dplyr**}, and {**ggplot2**}. diff --git a/episodes/3-minimal-reproducible-data.Rmd b/episodes/3-minimal-reproducible-data.Rmd index 02003a84..4818df59 100644 --- a/episodes/3-minimal-reproducible-data.Rmd +++ b/episodes/3-minimal-reproducible-data.Rmd @@ -1,7 +1,7 @@ --- title: "Minimal Reproducible Data" teaching: 90 -excercises: 6 +exercises: 6 --- ```{r setup, include=FALSE} diff --git a/episodes/4-minimal-reproducible-code-draft.Rmd b/episodes/4-minimal-reproducible-code-draft.Rmd deleted file mode 100644 index 25f6ef05..00000000 --- a/episodes/4-minimal-reproducible-code-draft.Rmd +++ /dev/null @@ -1,166 +0,0 @@ ---- -title: "Minimal Reproducible Code" -teaching: 10 -exercises: 2 ---- - -::: questions -- Which part of my code is causing an error message or an incorrect result? -- I want to make my code minimal, but where do I even start? -- How do I make non-reproducible code reproducible? -- How do I tell whether a code snippet is reproducible or not? -::: - -::: objectives -- Identify the step that is generating the error -- Implement a stepwise approach to make minimal code -- Edit a piece of code to make it reproducible -- Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing. -::: - -In the last episode, we focused in on how to make minimal datasets that would reproduce a target piece of code, but we didn't talk much about the code itself. In each example, we already had a piece of code that needed a dataset. But how do you know which part of your code is the problem and should be focused on? - -At this point, we as researchers have been exploring our kangaroo rat data for a while and making several plots. Now we know how many kangaroo rats we can expect to catch, and we can move on to our second research question: Do the rodent exclusion plots actually working at keeping the target species out of certain areas? - -If the exclusion plots work, we would expect differences in abundance in the different plot types. Specifically, we'd expect to see fewer kangaroo rats overall in any of the exclusion plots than in the control. - -We start by making a plot to examine this - -```{r, include = F} -library(readr) -library(dplyr) -library(ggplot2) -library(stringr) -library(here) # XXX COME BACK TO THIS, SEE https://github.com/carpentries-incubator/R-help-reprexes/issues/61 -rodents <- read_csv(here("scripts/data/surveys_complete_77_89.csv")) -rodents <- rodents %>% - filter(taxa == "Rodent") -krats <- rodents %>% - filter(genus == "Dipodomys") -krats <- krats %>% - mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) -krats <- krats %>% - mutate(time_period = ifelse(year < 1988, "early", "late")) -krats <- krats %>% - filter(species != "sp.") -krats_per_day <- krats %>% - group_by(date, year, species) %>% - summarize(n = n()) %>% - group_by(species) -``` - -To start figuring this out, we decide to make a plot of counts per day, per plot type, per species, per year. - -```{r} -counts_per_day <- krats %>% - group_by(year, plot_id, plot_type, month, day, species_id) %>% - summarize(count_per_day = n()) -``` - -Then, we use that information to visualize the distribution of counts per day in the different plot types to see if there's a difference overall. - -```{r} -counts_per_day %>% - ggplot(aes(x = plot_type, y = count_per_day, fill = species_id, group = interaction(plot_type, species_id)))+ - geom_boxplot(outlier.size = 0.5)+ - theme_minimal()+ - labs(title = "Kangaroo rat captures, all years", - x = "Plot type", - y = "Individuals per day", - fill = "Species") -``` - -Interestingly, we don't see a difference in the number of k-rats captured in the different plot types! We expected to catch more k-rats in the control plots (which don't exclude rodents) than in the various rodent exclosure plots, but the rates appear to be about the same. That doesn't bode well for the effectiveness of these experimental plots! - -You're really interested in this result, and you want to talk to your coworker about it. One of your coworkers, Taylor, asks to see the code you used to make this plot so that they can investigate it on their own. You say "No problem, I'll send the code over!" - -You send the following email: - -*Hi Taylor,* -*Here's the code I used to make that plot! I hope it works.* -``` -counts_per_day %>% - ggplot(aes(x = plot_type, y = count_per_day, fill = species_id, group = interaction(plot_type, species_id)))+ - geom_boxplot(outlier.size = 0.5)+ - theme_minimal()+ - labs(title = "Kangaroo rat captures, all years", - x = "Plot type", - y = "Individuals per day", - fill = "Species") -``` - -Unfortunately, Taylor soon writes back. -*Hey Sam,* -*That code didn't run properly for me. Maybe you need to include the data?* - -Of course! The data! That's important. -::: challenge -**Exercise:** On the Etherpad or in your own notes, identify the dataset that you'll need to send to Taylor so they can run your code. What are a few different ways that you could give him the data? What are the advantages or disadvantages to each? -::: -::: solution -XXX INSERT SOLUTION -::: - -You decide that instead of sending Taylor the modified file, you're going to send him more of the code so he can reproduce the entire thing himself. You look back over the code you've written so far. It's kind of messy! -XXX TODO: Add messy comments and other tangents to this script to make it long. -XXX TOOD: How do we show a script that includes line numbers? That will be important--they all need to have the same reference point for this challenge. -``` -# Note: your code might look a little different! That's okay. -library(readr) -library(dplyr) -library(ggplot2) -library(stringr) -library(here) # XXX COME BACK TO THIS, SEE https://github.com/carpentries-incubator/R-help-reprexes/issues/61 -rodents <- read_csv(here("scripts/data/surveys_complete_77_89.csv")) -rodents <- rodents %>% - filter(taxa == "Rodent") -krats <- rodents %>% - filter(genus == "Dipodomys") -krats <- krats %>% - mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) -krats <- krats %>% - mutate(time_period = ifelse(year < 1988, "early", "late")) -krats <- krats %>% - filter(species != "sp.") -krats_per_day <- krats %>% - group_by(date, year, species) %>% - summarize(n = n()) %>% - group_by(species) -counts_per_day <- krats %>% - group_by(year, plot_id, plot_type, month, day, species_id) %>% - summarize(count_per_day = n()) -counts_per_day %>% - ggplot(aes(x = plot_type, y = count_per_day, fill = species_id, group = interaction(plot_type, species_id)))+ - geom_boxplot(outlier.size = 0.5)+ - theme_minimal()+ - labs(title = "Kangaroo rat captures, all years", - x = "Plot type", - y = "Individuals per day", - fill = "Species") -``` - -::: challenge -**Excercise:** Which lines of code in the script above does Taylor absolutely NEED to run in order to reproduce your plot? -Bonus: What are some lines of code that Taylor doesn't NEED to run but which might provide him useful extra context? - -(For this challenge, use the line numbers in the script above, even if your script looks slightly different. For extra practice, do the same exercise with your own script! Do you find any different answers?) -::: -::: solution -XXX INSERT SOLUTION, referencing line numbers -::: - -Taylor emails you back again: -*Hi Sam,* -*Thanks, that looks like it will probably work. Do I really have to install all those packages, though? I'm a little worried about running out of space on my computer.* - -You roll your eyes internally, but he's right--probably not all those packages are totally necessary! - -::: challenge -**Excercise:** Email Taylor back and tell him which packages he needs in order to run your code. -::: -::: solution -*Hi Taylor,* -*Good point. Yeah, you don't need all those packages. Some of them were from other parts of the code that I didn't include here, and there are some that I just forgot to remove when I stopped using them! You can just install {dplyr}, {ggplot2}, {here}, and {readr} and the code should run fine!* -*Sam* -::: - diff --git a/episodes/4-minimal-reproducible-code.Rmd b/episodes/4-minimal-reproducible-code.Rmd index f8d4a851..4dbfe13e 100644 --- a/episodes/4-minimal-reproducible-code.Rmd +++ b/episodes/4-minimal-reproducible-code.Rmd @@ -5,217 +5,165 @@ exercises: 35 --- ::: questions -- Why can't I just post my whole script? -- Which parts of my code are directly relevant to my problem? -- Which parts of my code are necessary in order for the problem area to run correctly? -- I feel overwhelmed by this script--where do I even start? +- Why is it important to make a minimal code example? +- Which part of my code is causing the problem? +- Which parts of my code should I include in a minimal example? +- How can I tell whether a code snippet is reproducible or not? +- How can I make my code reproducible? ::: ::: objectives - Explain the value of a minimal code snippet. -- Simplify a script down to a minimal code example. -- Identify the problem area of the code. +- Identify the problem area of a script. - Identify supporting parts of the code that are essential to include. -- Identify pieces of code that can be removed without affecting the central functionality. +- Simplify a script down to a minimal code example. +- Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing. +- Edit a piece of code to make it reproducible - Have a road map to follow to simplify your code. +- Describe the {reprex} package and its uses ::: -You're excited by how much progress you're making in your research. You've made a lot of descriptive plots and gained some interesting insights into your data. Now you're excited to investigate whether the k-rat exclusion plots are actually working. You set to work writing a bunch of code to do this, using a combination of descriptive visualizations and linear models. - -So far, you've been saving all of your analysis in a script called "krat-analysis.R". At this point, it looks something like this: -```{r "krat-analysis.R"} -#| eval: false -#| code-fold: true -# Kangaroo rat analysis using the Portal data -# Created by: Research McResearchface -# Last updated: 2024-11-22 - -# Load packages to use in this script -library(readr) -library(dplyr) -library(ggplot2) -library(stringr) - -# Read in the data -rodents <- read_csv("scripts/data/surveys_complete_77_89.csv") +```{r} +library(tidyverse) +library(ratdat) +library(here) +source(here("scripts/krat-analysis.R")) +``` -### DATA WRANGLING #### -glimpse(rodents) # or click on the environment -str(rodents) # an alternative that does the same thing -head(rodents) # or open fully with View() or click in environment -table(rodents$taxa) +You're happy with the progress you've made in exploring your data, and you're excited to move on to your second research question: Do the k-rat exclusion plots actually exclude k-rats? -# Abundance distribution of taxa -rodents %>% - ggplot(aes(x=taxa))+ - geom_bar() - -# Examine NA values -## How do we find NAs anyway? ---- -head(is.na(rodents$taxa)) # logical--tells us when an observation is an NA (T or F) +You start by making a couple predictions. -# Not very helpful. BUT -sum(is.na(rodents$taxa)) # sum considers T = 1 and F = 0 +Prediction 1: If the k-rat plots work, then we would expect to see fewer k-rats in the k-rat exclusion plots than in the control, with the fewest in the long-term k-rat exclosure plot. -# Simplify down to just rodents -rodents <- rodents %>% - filter(taxa == "Rodent") -glimpse(rodents) +Second, if the k-rat plots work, then we would expect a lower proportion of *spectabilis* in the *spectabilis*-specific exclosure than in the control plot, regardless of the absolute number of k-rats caught. -# Just kangaroo rats because this is what we are studying -krats <- rodents %>% - filter(genus == "Dipodomys") -dim(krats) # okay, so that's a lot smaller, great. -glimpse(krats) +You start by computing summary counts of the number of k-rats per day in the control and the two k-rat exclosure plots, and then using boxplots to show the differences between daily counts in each plot type. -# Prep for time analysis -# To examine trends over time, we'll need to create a date column -krats <- krats %>% - mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) - -# Examine differences in different time periods -krats <- krats %>% - mutate(time_period = ifelse(year < 1988, "early", "late")) - -# Check that this went through; check for NAs -table(krats$time_period, exclude = NULL) # learned how to do this earlier - -### QUESTION 1: How many k-rats over time in the past? ### -# How many kangaroo rats of each species were found at the study site in past years (so you know what to expect for a sample size this year)? - -# Numbers over time by plot type -krats %>% - ggplot(aes(x = date, fill = plot_type)) + - geom_histogram()+ - facet_wrap(~species)+ - theme_bw()+ - scale_fill_viridis_d(option = "plasma")+ - geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue") - -# Oops we gotta get rid of the unidentified k-rats -krats <- krats %>% - filter(species != "sp.") - -# Re-do the plot above -krats %>% - ggplot(aes(x = date, fill = plot_type)) + - geom_histogram()+ - facet_wrap(~species)+ - theme_bw()+ - scale_fill_viridis_d(option = "plasma")+ - geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue") - -# How many individuals caught per day? -krats_per_day <- krats %>% - group_by(date, year, species) %>% - summarize(n = n()) %>% - group_by(species) - -krats_per_day %>% - ggplot(aes(x = species, y = n))+ - geom_boxplot(outlier.shape = NA)+ - geom_jitter(width = 0.2, alpha = 0.2, aes(col = year))+ - theme_classic()+ - ylab("Number per day")+ - xlab("Species") - -#### QUESTION 2: Do the k-rat exclusion plots work? ##### -# Do the k-rat exclusion plots work? (i.e. Does the abundance of each species differ by plot?) -# If the k-rat plots work, then we would expect: -# A. Fewer k-rats overall in any of the exclusion plots than in the control, with the fewest in the long-term k-rat exclusion plot +```{r} +# Prediction 1: expect fewer k-rats in exclusion plots than in control plot +# Count number of k-rats per day, per plot type counts_per_day <- krats %>% - group_by(year, plot_id, plot_type, month, day, species_id) %>% - summarize(count_per_day = n()) + filter(plot_type %in% c("Control", "Long-term Krat Exclosure", "Short-term Krat Exclosure")) %>% + group_by(year, plot_id, plot_type, month, day) %>% + summarize(count_per_day = n(), .groups = "drop") +# Visualize counts_per_day %>% - ggplot(aes(x = plot_type, y = count_per_day, fill = species_id, group = interaction(plot_type, species_id)))+ + ggplot(aes(x = plot_type, y = count_per_day, fill = plot_type))+ geom_boxplot(outlier.size = 0.5)+ theme_minimal()+ labs(title = "Kangaroo rat captures, all years", x = "Plot type", - y = "Individuals per day", - fill = "Species") + y = "Individuals per day")+ + theme(legend.position = "none") +``` -# B. For Spectabilis-specific exclosure, we expect a lower proportion of spectabilis there than in the other plots. +This is a coarse analysis that doesn't compare numbers on the same day against each other, but even so, our prediction is supported. The fewest k-rats are caught per day in the Long-term Krat Exclosure plots, followed by the Short-term Krat Exclosure. The Control plots have the largest number of Krats. + +Next, you decide to test the second prediction: that there will be proportionally fewer *spectabilis* caught in the *spectabilis* exclusion plots than in the control plot. + +You restrict the data to just the control and *spectabilis* exclosure plots. Then you calculate the number and proportion of each species of k-rat caught in each plot and visualize the proportions for *spectabilis.* + +```{r} +# Focus in on the control and spectab exclosure plots control_spectab <- krats %>% filter(plot_type %in% c("Control", "Spectab exclosure")) +# Calculate proportion of spectabilis caught +head(control_spectab) prop_spectab <- control_spectab %>% group_by(year, plot_type, species_id) %>% - summarize(total_count = n(), .groups = "drop_last") %>% + summarize(total_count = n(), .groups = "drop") %>% mutate(prop = total_count/sum(total_count)) %>% filter(species_id == "DS") # keep only spectabilis +# Visualize proportions prop_spectab %>% ggplot(aes(x = year, y = prop, col = plot_type))+ geom_point()+ geom_line()+ theme_minimal()+ - labs(title = "Spectab exclosures did not reduce proportion of\nspectab captures", + labs(title = "Spectab exclosures reduced the proportion of spectab captures", y = "Spectabilis proportion", x = "Year", color = "Plot type") +``` -#### MODELING #### -counts_mod <- lm(count_per_day ~ plot_type + species_id, data = counts_per_day) -summary(counts_mod) +Interesting! In this analysis, we do see, as expected, that the *spectabilis* exclosure seems to have a consistently lower proportion of *spectabilis* than we see in the control plots. This supports our second prediction, and it also does a better job of getting at our research question because it's looking at proportions, not raw numbers. -# with interaction term: -counts_mod_interact <- lm(count_per_day ~ plot_type*species_id, data = counts_per_day) -summary(counts_mod_interact) +But speaking of proportions... looking closer at the y-axis, we can see that something looks wrong. You remember from the previous plot that *merriami* was generally more common than the other two species, but you don't remember that *spectabilis* was quite this rare! Could it really be true that only 3% of the k-rats caught were *spectabilis* at the absolute highest? -summary(counts_mod) -summary(counts_mod_interact) -``` +# XXX FIXME: +Here, I've presented this as though part of the mystery is figuring out whether there's something wrong with the data. That type of data sleuthing (is my code bad, or is my data bad??) is a really important skill, but I think it might be a bit tangential for us to focus on here. The easy way to have this stick to the LOs would be to have some ground truth that shows us for absolutely sure that <3% is wrong, which means it's the code that's bad, not the original data. -## Why is it important to simplify code? +This seems wrong. But your code is running fine, so you're not really sure where your error is or how to start fixing it. + +You're feeling overwhelmed, so you decide to ask your labmate, Jordan, for help. +> Hi Jordan, +> I ran into a problem with my code today, and I'm wondering if you can help. I made this plot about kangaroo rat abundance in different plot types, but the proportions look weirdly low. I don't know what's wrong--can you help me figure it out? I'm attaching my script to this email. +> Thanks, +> Mishka +> Attachment: krat-analysis.R -Learning how to simplify your code is one of the most important parts of making a minimal reproducible example, asking others for help, and helping yourself. +You know that Jordan is a very experienced coder. If they look through your script, surely they'll be able to figure out what's going on much more quickly than you can! ::::::::::::::::::::::::::::::::::::::::::: challenge -## Making sense of code +## Making sense of someone else's code -Reflect on a time when you opened a coding project after a long time away from it. Or maybe you had to look through and try to run someone else's code. +Imagine you are Jordan, and you've in the middle of your own analysis when you receive this email. What reaction do you have? How do you feel about helping your lab mate? How would you feel if it were a complete stranger asking you for help instead? -(If you have easy access to one of your past projects, maybe try opening it now and taking a look through it right now!) +Now, reflect on a time when you have had to look through and run someone else's code, or a time when you've opened your own coding project after a long time away from it. (If you have easy access to one of your past projects, maybe try opening it now and taking a look through it right now!) -How do situations like this make you feel? Write some reflections on the Etherpad. +How do situations like this make you feel? Turn to the person next to you and discuss for a minute (or write some reflections on the Etherpad). -This exercise should take about 5 minutes. -::::::::::::::::::::::::::::::::::::::::::: +This exercise should take about 3 minutes. +:::solution +FIXME: I think this is way too long and needs to be more targeted. We are trying to get them to realize that this is a frustrating/overwhelming/scary experience for the person in the helper role. + +I am worried that people are going to use this time to start reflecting on the substance of the actual error. Maybe that's a reason to just focus it on a general time you've dealt with someone else's code, instead of "put yourself in Jordan's shoes"? I feel like I need to demo it with people to find out, though. -Debugging is a time when it's common to have to read through long and complex code (either your own or someone else's). That means that the person doing the debugging is likely to experience some of the emotions we just talked about. +Suggestions for revising this challenge are welcome! +::: +::::::::::::::::::::::::::::::::::::::::::: + +Jordan wants to help their labmate, but they are really busy. They've looked through enough code in the past to know that it will take a long time to make sense of yours. They email you back: + +> Hi Mishka, +> I'd be happy to help, but this is a lot of code to go through without much context. Could you just show me the part that's most relevant? Maybe make a minimal example? +> Thanks, +> Jordan -The more we can reduce the negative emotions and make the experience of solving errors easy and painless, the likelier you are to find solutions to your problems (or convince others to take the time to help you). Helpers are doing us a favor--why put barriers in their way? +## Why is it important to simplify code? -Let's illustrate the importance of simplifying our code by focusing on an error in the big long analysis script we created, shown above. Let's imagine we're getting ready to show these preliminary results to our advisor, but when we re-run the whole script, we realize there's a problem. +In order to get help from Jordan, we're going to need to do some work up front to make things easier for them. In general, learning how to simplify your code is one of the most important parts of making a minimal reproducible example, asking others for help, and helping yourself. -[DESCRIPTION OF PROBLEM HERE] +Debugging is a time when it's common to have to read through long and complex code (either your own or someone else's). The more we can reduce their frustration and overwhelm and make the experience of solving errors easy and painless, the more likely that others will want to take the time to help us. Helpers are doing us a favor--why put barriers in their way? ## A road map for simplifying your code -In this episode, we're going to walk through a road map for breaking your code down to its simplest form while making sure that 1) it still runs, and 2) it reproduces the problem you care about solving. +Sometimes, it can be challenging to know where to start simplifying your code. In this episode, we're going to walk through a road map for breaking your code down to its simplest form while making sure that 1) it still runs, and 2) it reproduces the problem you care about solving. For now, we'll go through this road map step by step. At the end, we'll review the whole thing. One takeaway from this lesson is that there is a step by step process to follow, and you can refer back to it if you feel lost in the future. ### Step 0. Create a separate script -When we know there's a problem with our script, it helps to start solving it by examining smaller parts of the code in a separate script, instead of editing the original. +To begin creating a simpler piece of code to show Jordan, let's create a separate script. We can separate out small pieces of the code and craft the perfect minimal example while still keeping the original analysis intact. -:::::::::::::::::::::::::::::::::::::challenge -## A separate place for minimal code +Create and save a new, blank R script and give it a name, such as "reprex-script.R" -Create a new, blank R script and give it a name, such as "reprex-script.R" +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 +``` +:::::::::::::::::::::::::::::::::::::callout +## Creating a new R script There are several ways to make an R script - File > New File > R Script - Click the white square with a green plus sign at the top left corner of your RStudio window - Use a keyboard shortcut: Cmd + Shift + N (on a Mac) or Ctrl + Shift + N (on Windows) - -Once you've created the script, click the Save button to name and save it. - -This exercise should take about 2 minutes. ::::::::::::::::::::::::::::::::::::::::::: ### Step 1. Identify the problem area @@ -224,76 +172,173 @@ Now that we have a script, let's zero in on what's broken. First, we should use some of the techniques we learned in the "Identify the Problem" episode and see if they help us solve our error. -[MORE CONTENT THAT CALLS BACK TO PL'S EPISODE HERE] +XXX FIXME: Peter, let's make sure this follows nicely from your lesson. +- Look for error messages or warnings --> there aren't any +- Google it --> not really informative -In this particular case, though, we weren't able to completely resolve our error. +In this case, the debugging tips we learned weren't enough to help us figure out what was wrong with our code. -[WHY? maybe because it's not an error but a case of "the plot isn't returning what we want"? Or maybe it's an extra difficult error message that we can't find an easy answer to? +This is quite common, especially when you're faced with a **semantic error** (your code runs, but it doesn't produce the correct result). -I need to figure out what error to introduce into the script in the first place... that will determine the justification to use here.] +In this case, we can start by identifying the piece of code that *showed us there was a problem*. We noticed the problem by looking at the plot, so that would be an obvious place to start. But is the plot code really the problem area? Maybe, or maybe not. We know that some weird values are showing up on the plot. That means that either there are weird values in the data that created the plot, or the values in the data looked okay and something happened when we created the plot. -(*Using the plot example for now*) +To distinguish between those possibilities, let's take a look at the plot again: -Okay, so we know that the plot doesn't look the way we want it to. Which part of the code created that plot? One way to figure this out if we're not sure is to step through the code line by line. +```{r} +# Visualize proportions +prop_spectab %>% + ggplot(aes(x = year, y = prop, col = plot_type))+ + geom_point()+ + geom_line()+ + theme_minimal()+ + labs(title = "Spectab exclosures did not reduce proportion of\nspectab captures", + y = "Spectabilis proportion", + x = "Year", + color = "Plot type") +``` -:::::::::::::::::::::::::::::::::::::callout -## Stepping through code, line by line +...and then at the data used to generate that plot +```{r filename = "Console"} +head(prop_spectab) +``` -Placing your cursor on a line of code and using the keyboard shortcut Cmd + Enter (Mac) or Ctrl + Enter (Windows) will run that line of code *and* it will automatically advance your cursor to the next line. This makes it easy to "step through" your code without having to click or highlight. -::::::::::::::::::::::::::::::::::::::::::: +Aha! The `prop` column of `prop_spectab` shows very small values, and those values correspond to the plot we created. -Yay, we found the trouble spot! Let's go ahead and copy that line of code and paste it over into the empty script we created, "reprex-script.R". +So it looks like the problem is in the data, not in the plot code itself. The plot was just what allowed us to see the problem in our data. This is often the case. Visualizations are important! -### Step 2. Give context: functions and packages +So instead of focusing on the plot, let's zoom in on the step right before the plot as our "problem area": the code that created the `prop_spectab` object. Let's copy and paste that section of code into the new script that we just created. -R code consists primarily of *variables* and *functions*. +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 -::::::::::::::::::::::::::::::::::::::::::: challenge -## Where do functions come from? +# Calculate proportion of spectabilis caught +head(control_spectab) +prop_spectab <- control_spectab %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + filter(species_id == "DS") # keep only spectabilis +``` -When coding in R, we use a lot of different functions. Where do those functions come from? How can we make sure that our helpers have access to those sources? Take a moment to brainstorm. +Now the code is much smaller and easier to dig through! You decide to send this updated script along to Jordan. -This exercise should take about 3 minutes. -::::::::::::::::::::::::::::::::::::::::::: -::: solution -Functions in R typically come from packages. Some packages, such as `{base}` and `{stats}`, are loaded in R by default, so you might not have realized that they are packages too. +> Hi Jordan, +> Sorry about that! You're right, that was a lot of code to dig through. I've narrowed it down now to just focus on the part of the code that was causing the problem. The proportions of k-rats in the prop_spectab data frame are too small to make sense. Can you help me figure out why the proportions are wrong? +> Thank you, +> Mishka +> Attachment: reprex-script.R + +Jordan gets your email and opens the script. They try to run your code, but they immediately run into some problems. They write back, + +> Hey Mishka, +> This is almost there, but I can't run your code because I don't have the objects and packages it relies on. Can you elaborate on your example to make it runnable? +> Jordan -You can see a complete list of functions in `{base}` and `{stats}` by running `library(help = "base")` or `library(help = "stats")`. +### Step 2. Identify dependencies -Some functions might be user-defined. In that case, you'll need to make sure to include the function definition in your reprex. +Jordan is right, of course--you gave them a minimal example of your code, but you didn't provide the context around that code to make it runnable for someone else. You need to provide the *dependencies* for your code, including the functions it uses and the datasets it relies on. + +R code consists primarily of **functions** and **variables**. Making sure that we give our helpers access to all the functions and variables we use in our minimal code will make our examples truly reproducible. + +Let's talk about **functions** first. When we code in R, we use a lot of different functions. Functions in R typically come from packages, and you get access to them by loading the package into your environment. (Some packages, such as `{base}` and `{stats}`, are loaded in R by default, so you might not have realized that a lot of functions, such as `dim`, `colSums`, `factor`, and `length` actually come from those packages!) + +::: callout +You can see a complete list of the functions that come from the `{base}` and `{stats}` packages by running `library(help = "base")` or `library(help = "stats")`. ::: -::::::::::::::::::::::::::::::::::::::::::: callout -## Finding functions +To make sure that your helper has access to the functions necessary for your reprex, you can include a `library()` call in your reprex. -Sometimes it can be hard to figure out where a function comes from. Especially if a function comes from a package you use frequently, you might not remember where it comes from! +Let's do this for our own reprex. We can start by identifying all the functions used, and then we can figure out where each function comes from to make sure that we tell our helper to load the right packages. -You can search for a function in the help docs with `??fun` (where "fun" is the name of the function). To explicitly declare which package a function comes from, you can use a double colon `::`--for example, `dplyr::select()`. Declaring the function with a double colon also allows you to use that function even if the package is not loaded, as long as it's installed. -::::::::::::::::::::::::::::::::::::::::::: +Here's what our script looks like so far. -The quickest way to make sure others have access to the functions contained in packages is to include a `library()` call in your reprex, so they know to load the package too. +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 -::::::::::::::::::::::::::::::::::::::::::: challenge -## Which packages are essential? +# Calculate proportion of spectabilis caught +head(control_spectab) +prop_spectab <- control_spectab %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + filter(species_id == "DS") # keep only spectabilis +``` -In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible. +You look through the script and list out the following functions: +`head()` +`group_by()` +`summarize()` +`n()` +`mutate()` +`sum()` +`filter()` + +:::callout +`%>%` is technically an operator, not a function... +FIXME note: `%>%` is an operator that also needs dplyr to be loaded. It might be relevant because the literal error that Jordan will get first is one about %>% if they don't have that loaded. But it's also exhausting to talk about and it might confuse people. By the time they finish loading {dplyr} for the sake of the other functions, the %>% will be included with {dplyr} and it will no longer be a problem. Hmm... +::: -- [Example (including an ambiguous function: `dplyr::select()` is a good one because it masks `plyr::select()`)] -- [Example where you have to look up which package a function comes from] -- [Example with a user-defined function that doesn't exist in any package] +You remember from your introductory R lesson that `group_by()` and `summarize()` are both functions that come from the `{dplyr}` package, so you can go ahead and add `library(dplyr)` to the top of your script. -This exercise should take about 5 minutes. -::::::::::::::::::::::::::::::::::::::::::: -::: solution -FIXME +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 + +# Load packages needed for the analysis +library(dplyr) + +# Calculate proportion of spectabilis caught +head(control_spectab) +prop_spectab <- control_spectab %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + filter(species_id == "DS") # keep only spectabilis +``` + +You're pretty sure that `mutate()` and `filter()` also come from `{dplyr}`, but you're not sure. And you don't know where `n()` and `sum()` come from at all! + +Some functions might be created, or "defined", in your code itself, instead of being contained in a package. + +It's pretty common not to remember where functions come from, especially if they belong to packages you use regularly. Let's try searching for the `sum()` function in the documentation. + +In the "Help" tab of your RStudio window (which should be next to the "Packages" and "Viewer" tabs in the bottom right pane of RStudio if you have the default layout), search for "sum". + +[INCLUDE SCREENSHOT OF SUM HELP DOCS HERE] + +The top of the help file says `sum {base}`, which means that `sum()` is a function that comes from the `{base}` package. That's good for us--everyone has `{base}` loaded by default in R, so we don't need to tell our helper to load any additional packages in order to be able to access sum. + +Let's confirm that we remembered correctly about `mutate()` and `filter()` by searching for the documentation on those as well. + +Searching for "mutate" quickly shows us the help file for `mutate {dplyr}`, but searching for "filter" gives a long list of possible functions, all called `filter`. You look around a little bit and realize that the `filter` function you're using is also from the `{dplyr}` package. It might be a good idea to explicitly declare that in your code, in case your helper already has one of these other packages loaded. + +::: callout +Maybe something about functions masking other functions? Or is that too much? ::: -Looking through the problem area that we isolated, we can see that we'll need to load the following packages: FIXME -- `{package}` -- `{package}` -- `{package}` +You can explicitly declare which package a function comes from using a double colon `::`--for example, `dplyr::filter()`. (Declaring the function with a double colon also allows you to use that function even if the package is not loaded, as long as it's installed.) + +Let's add that to our script, so that it's really clear which `filter()` we're using. -Let's go ahead and add those as `library()` calls to the top of our script. +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 + +# Load packages needed for the analysis +library(dplyr) + +# Calculate proportion of spectabilis caught +head(control_spectab) +prop_spectab <- control_spectab %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + dplyr::filter(species_id == "DS") # keep only spectabilis +``` + +The last function we need to find a source for is `n()`. Searching that in the Help files shows that it also comes from `{dplyr}`. Great! We don't need to consider any other packages, and the `library(dplyr)` line in our reprex script will tell our helper that the `{dplyr}` package is necessary to run our code. ::::::::::::::::::::::::::::::::::::::::::: callout ## Installing vs. loading packages @@ -303,7 +348,21 @@ But what if our helper doesn't have all of these packages installed? Won't the c Typically, we don't include `install.packages()` in our code for each of the packages that we include in the `library()` calls, because `install.packages()` is a one-time piece of code that doesn't need to be repeated every time the script is run. We assume that our helper will see `library(specialpackage)` and know that they need to go install "specialpackage" on their own. Technically, this makes that part of the code not reproducible! But it's also much more "polite". Our helper might have their own way of managing package versions, and forcing them to install a package when they run our code risks messing up our workflow. It is a common convention to stick with `library()` and let them figure it out from there. -FIXME this feels over-explained... pare it down! +::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::::: challenge +## Which packages are essential? + +In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible. + +- [Example (including an ambiguous function: `dplyr::select()` is a good one because it masks `plyr::select()`)] +- [Example where you have to look up which package a function comes from] +- [Example with a user-defined function that doesn't exist in any package] + +This exercise should take about 10 minutes. +:::solution +FIXME +::: ::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: callout @@ -312,15 +371,47 @@ FIXME this feels over-explained... pare it down! There is an alternative approach to installing packages [insert content/example of the if(require()) thing--but note that explaining this properly requires explaining why require() is different from library(), why it returns a logical, etc. and is kind of a rabbit hole that I don't want to go down here.] ::::::::::::::::::::::::::::::::::::::::::: -### Step 3. Give context: variables and datasets +Including `library()` calls will definitely help Jordan be able to run your code. But another reason that your code still won't work as written is that Jordan doesn't have access to the same *objects*, or *variables*, that you used in the code. + +The piece of code that we copied over to "reprex-script.R" from "krat-analysis.R" came from line 117. We had done a lot of analyses before then, starting from the raw dataset and creating and modifying new objects along the way. + +You realize that Jordan doesn't have access to `control_spectab`, the dataset that the reprex relies on. -Isolating the problem area and loading the necessary packages and functions was an important step to making our example code self-contained. But we're still not done making the code minimal and reproducible. Almost certainly, our code snippet relies on variables, such as datasets, that our helper won't have access to. +One way to fix this would be to add the code that creates `control_spectab` to the reprex, like this: -The piece of code that we copied over came from line [LINE NUMBER] of our analysis script. We had done a lot of analyses before then, including modifying datasets and creating intermediate objects/variables. +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 -Our code snippet depends on all those previous steps, so when we isolate it in a new script, it might not be able to run anymore. More importantly, when a helper doesn't have access to the rest of our script, the code might not run for them either. +# Load packages needed for the analysis +library(dplyr) -To fix this, we need to provide some additional context around our reprex so that it runs. +# B. For Spectabilis-specific exclosure, we expect a lower proportion of spectabilis there than in the other plots. +control_spectab <- krats %>% + filter(plot_type %in% c("Control", "Spectab exclosure")) + +# Calculate proportion of spectabilis caught +head(control_spectab) +prop_spectab <- control_spectab %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + dplyr::filter(species_id == "DS") # keep only spectabilis +``` + +But that doesn't fix the problem either, because now Jordan doesn't have access to `krats`. Let's go back to where `krats` was created, on line 40: + +```{r} +# Just kangaroo rats because this is what we are studying +krats <- rodents %>% + filter(genus == "Dipodomys") +dim(krats) # okay, so that's a lot smaller, great. +glimpse(krats) +``` + +But there are several other places in the code where we modified `krats`, and we need to include those as well if we want our code to be truly reproducible. For example, on line 47, a date column was added, and on line 70 we removed unidentified k-rats. And even after including those, we would need to go back even farther, because the `rodents` object also isn't something that Jordan has access to! We called the raw data `rodents` after reading it in, and we also made several modifications, such as when we removed non-rodent taxa on line 35. +XXX This example would be better illustrated if we could actually show it breaking... like, if we think we've succeeded in tracing back far enough, and then it doesn't run because e.g. an essential column is missing/different. +XXX This is definitely too much detail... but I do want this section to at least partially feel like "aaaah information overload, too complicated!" because the point is to show how hard it is to trace every object backwards and convince them of the value of using minimal data. ::::::::::::::::::::::::::::::::::::::::::: challenge ## Identifying variables @@ -331,78 +422,163 @@ For each of the following code snippets, identify all the variables used - [Example where they use a built-in dataset but it contains a column that that dataset doesn't actually contain, i.e. because it's been modified previously. Might be good to use the `date` column that we put into `krats` for this] This exercise should take about 5 minutes. +::: solution +TBH maybe we don't need this exercise at all, since we want to just redirect them to minimal data anyway? +::: ::::::::::::::::::::::::::::::::::::::::::: + +This process is confusing and difficult! If you keep tracing each variable back through your script, before you know it, you end up needing to include the entire script to give context, and then your code is no longer minimal. + +We can make our lives easier if we realize that helpers don't always need the exact same variables and datasets that we were using, just reasonably good stand-ins. Let's think back to the last episode, where we talked about different ways to create minimal reproducible datasets. We can lean on those skills here to make our example reproducible and greatly reduce the amount of code that we need to include. + +:::::::::::::::::::::::::::::::::::::::::::challenge +## Incorporating minimal datasets + +What are some ways that you could use a minimal dataset to make this reprex better? What are the advantages and disadvantages of each approach? + +This exercise should take about 5 minutes. ::: solution -FIXME +Could provide the `control_spectab` file directly to Jordan, e.g. via email attachment. +Advantages: less work, keeps the context, Jordan is a coworker so they probably understand it. +Disadvantages: file might be large, relies on ability to email file, won't be able to use this method if you post the question online, file contains extra rows/columns that aren't actually necessary to this problem and might therefore be confusing. + +Could create a new dataset from scratch. +Advantages: +Disadvantages: + +Could take a minimal subset of the `control_spectab` data and use `dput` or similar to include it directly +Advantages: +Disadvantages: + +Could use a built-in dataset from R +Advantages: +Disadvantages: ::: +::::::::::::::::::::::::::::::::::::::::::: -As you might have noticed, identifying these variables isn't always straightforward. Sometimes variables depend on other variables, and before you know it, you end up needing the entire script. +You decide that the easiest way to approach this particular problem would be to use a sample dataset to reproduce the problem. -Let's work together as a group to sketch out which variables depend on which others. A helpful way to do this is to start with the variables included in our code snippet and ask, for each one, "Where did this come from?" +Let's think about how to create a sample dataset that will accurately reproduce the problem. +What columns are needed to create `prop_spectab`, and what do we know about those columns? -[Make a big dependency graph. The point is to illustrate that it gets very long and you can't always rely on this process to identify a simple way to include the needed variables.] +```{r} +head(control_spectab) +length(unique(control_spectab$year)) # 13 years +length(unique(control_spectab$species_id)) # three species +length(unique(control_spectab$plot_type)) # two plot types +``` -How can we make sure that helpers can access these objects too, without providing them the entire long script? +Looks like we use `year`, which is numeric, and `plot_type` and `species_id`, which are both character types but have discrete levels like a factor. We have data from 13 years across two plot types, and there are three species that might occur. -Theoretically, we could meticulously trace each object back and make sure to include the code to create all of its predecessors from the original data, which we would provide to our helper. But pretty soon, we might find that we're just giving the helper the original (long, complicated) script! +Let's create some vectors. I'm only going to use four years here because 13 seems a little excessive. -As with other types of writing, creating a good minimal reprex takes hard work and time. +I'm going to arbitrarily decide how many species records are present for each year, assign every other row to treatment or control, and randomly select species for each row. +```{r} +years <- 1:4 +species <- c("Species A", "Species B", "Species C") +plots <- c("Control", "Treatment") -> "I would have written a shorter letter, but I did not have the time." -> -> - Blaise Pascal, *Lettres Provinciales*, 1657 +total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily +total_rows <- sum(total_records_per_year) # how many total rows will we have? +# Create the fake data using `rep()` and `sample()`. +minimal_data <- data.frame(year = rep(years, times = total_records_per_year), + plot_type = rep(plots, length.out = total_rows), + species_id = sample(species, size = total_rows, replace = TRUE)) +``` -::::::::::::::::::::::::::::::::::::::::::: callout -## Computational reproducibility +We can go ahead and paste that into our reprex script. -Every object should be able to map back to either a file, a built-in dataset in R, or another intermediate step. If you found any variables where you weren't able to answer the "Where did this come from?" question, then that's a problem! Did you build a script that mistakenly relied on an object that was in your environment but was never properly defined? +```{r} +# Provide a minimal dataset +years <- 1:4 +species <- c("Species A", "Species B", "Species C") +plots <- c("Control", "Treatment") -Mapping exercises like this can be a great way to check whether entire script is reproducible. Reproducibility is important in more cases than just debugging! More and more journals are requiring full analysis code to be posted, and if that code isn't reproducible, it will severely hamper other researchers' efforts to confirm and expand on what you've done. +total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily +total_rows <- sum(total_records_per_year) # how many total rows will we have? -Various packages can help you keep track of your code and make it more reproducible. Check out the [`{targets}`](https://books.ropensci.org/targets/) and [`{renv}`](https://rstudio.github.io/renv/articles/renv.html) packages in particular if you're interested in learning more. -::::::::::::::::::::::::::::::::::::::::::: +# Create the fake data using `rep()` and `sample()`. +minimal_data <- data.frame(year = rep(years, times = total_records_per_year), + plot_type = rep(plots, length.out = total_rows), + species_id = sample(species, size = total_rows, replace = TRUE)) +``` -Luckily, we can make our lives easier if we realize that helpers don't always need the exact same variables and datasets, just reasonably good stand-ins. Let's think back to the last episode, where we talked about different ways to create minimal reproducible datasets. We can lean on those skills here to make our example reproducible and greatly reduce the amount of code that we need to include. +And finally, let's change our code to use `minimal_data` instead of `control_spectab`. (While we're at it, let's remove the piece of code that creates `control_spectab`--including a minimal dataset directly means that we don't need that code.) -:::::::::::::::::::::::::::::::::::::::::::challenge -## Incorporating minimal datasets +Let's calculate the proportion of Species A caught, instead of spectabilis. -Brainstorm some places in our reprex where you could use minimal reproducible data to make your problem area code snippet reproducible. +```{r} +# Calculate proportion of Species A caught +head(minimal_data) +prop_speciesA <- minimal_data %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + dplyr::filter(species_id == "Species A") # keep only Species A +``` -Which of the techniques from the [data episode](LINK TO DATA EPISODE) will you choose in each case, and why? +And now let's check to make sure that the code actually reproduces our problem. Remember, the problem was that the proportions of Species A caught seemed to be way too low based on what we knew about the frequency of occurrence of each k-rat species. -This exercise should take about 5 minutes. -::::::::::::::::::::::::::::::::::::::::::: -::: solution -FIXME -::: +When we created our sample data, we randomly allocated species A, B, and C to each row. Even after accounting for the stratification of the years and plot types, we would expect the proportion of Species A caught to be somewhere vaguely around 0.33 in each plot type and year. If we've reproduced our problem correctly, we would see values much lower than 0.33. -_**Using a minimal dataset simplifies not just your data but also your code, because it lets you avoid including data wrangling steps in your reprex!**_ +```{r} +head(prop_speciesA) +``` -### Step 4. Simplify +Indeed, those proportions look way too low! 3%, 1%... that's an order of magnitude off from what we expect to see here. I think we have successfully reproduced the problem using a minimal dataset. To make things extra easy for Jordan, let's add some comments in the script to point out the problem. -We're almost done! Now we have code that runs because it includes the necessary `library()` calls and makes use of minimal datasets that still allow us to showcase the problem. Our script is almost ready to send to our helpers. +```{r} +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 -But reading someone else's code can be slow! We want to make it very, very easy for our helper to see which part of the code is important to focus on. Let's see if there are any places where we can trim code down even more to eliminate distractions. +# Load packages needed for the analysis +library(dplyr) -Often, analysis code contains exploratory steps or other analyses that don't directly relate to the problem, such as calls to `head()`, `View()`, `str()`, or similar functions. (Exception: if you're using these directly to show things like dimension changes that help to illustrate the problem). +# Provide a minimal dataset +years <- 1:4 +species <- c("Species A", "Species B", "Species C") +plots <- c("Control", "Treatment") -Some other common examples are exploratory analyses, extra formatting added to plots, and [ANOTHER EXAMPLE]. +total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily +total_rows <- sum(total_records_per_year) # how many total rows will we have? -When cutting these things, we have to be careful not to remove anything that would cause the code to no longer reproduce our problem. In general, it's a good idea to comment out the line you think is extraneous, re-run the code, and check that the focal problem persists before removing it entirely. +# Create the fake data using `rep()` and `sample()`. +minimal_data <- data.frame(year = rep(years, times = total_records_per_year), + plot_type = rep(plots, length.out = total_rows), + species_id = sample(species, size = total_rows, replace = TRUE)) # Because I assigned the species randomly, we should expect Species A to occur *roughly* 33% of the time. -:::::::::::::::::::::::::::::::::::::::::::challenge -## Trimming down the bells and whistles +# Calculate proportion of Species A caught +head(minimal_data) +prop_speciesA <- minimal_data %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + dplyr::filter(species_id == "Species A") # keep only Species A -[Ex: removing various things, observing what happens, identifying whether or not we care about those things. (Need to include at least one that's tricky, like maybe it does change the actual values but it doesn't change their relationship to each other)] +head(prop_speciesA) # Species A only occurs 1-3% of the time in each plot type in each year. Why is this off by an order of magnitude? (This is the same problem I was seeing in my real data--the occurrence proportions were way, way too small.) +``` -This exercise should take about 5 minutes. -::::::::::::::::::::::::::::::::::::::::::: -::: solution -FIXME +### Step 3. Simplify + +We're almost done! Now we have code that runs because it includes the necessary `library()` calls and makes use of a minimal dataset that still allows us to showcase the problem. Our script is almost ready to send to Jordan. + +One last thing we can check is whether there are any other places where we can trim down the minimal example even more to eliminate distractions. + +Often, analysis code contains exploratory steps or other analyses that don't directly relate to the problem, such as calls to `head()`, `View()`, `str()`, or similar functions. (Exception: if you're using these directly to show things like dimension changes that help to illustrate the problem). + +Some other common examples are exploratory analyses, or extra formatting added to plots that doesn't change their interpretation. + +:::callout +When cutting extra lines of code, we have to be careful not to remove anything that would cause the code to no longer reproduce our problem. In general, it's a good idea to comment out the line you think is extraneous, re-run the code, and check that the focal problem persists before removing it entirely. ::: +In our case, the code looks pretty minimal. We do have a call to `head()` at the end, but that's being used to clearly demonstrate the problem, so it should be left in. Trimming down the minimal dataset any further, for example by removing `plot_type` or `year`, would involve changing the analysis code and possibly not reproducing the problem anymore. + +--- +XXX @XOR @PL: I wish I had picked an example that had some extraneous parts to take out, such as a ggplot call with really long code. Would love to include that. Since I didn't, I think I can take out this entire section of the roadmap, or add it as a bonus exercise that gives a completely different example with a lot of extraneous things to remove. Another option would be to add some fluff to the original code so that we can remove that fluff now. Thoughts/preferences? +--- + Great work! We've created a minimal reproducible example. In the next episode, we'll learn about `{reprex}`, a package that can help us double-check that our example is reproducible by running it in a clean environment. (As an added bonus, `{reprex}` will format our example nicely so it's easy to post to places like Slack, GitHub, and StackOverflow.) More on that soon. For now, let's review the road map that we just practiced. @@ -414,13 +590,11 @@ More on that soon. For now, let's review the road map that we just practiced. ### Step 1. Identify the problem area - Which part of the code is causing the problem? Move it over to the reprex script so we can focus on it. -### Step 2. Give context: functions and packages +### Step 2. Identify dependencies - Make sure that helpers have access to all the functions they'll need to run your code snippet. + - Make sure helpers can access the variables they'll need to run the code, or reasonable stand-ins. -### Step 3. Give context: variables and datasets - - Make sure that helpers have access to all the variables they'll need to run your code snippet, or reasonable stand-ins. - -### Step 4. Simplify +### Step 3. Simplify - Remove any extra code that isn't absolutely necessary to demonstrate your problem. :::::::::::::::::::::::::::::::::::::::::::challenge @@ -434,3 +608,46 @@ Let's take a moment to reflect on this process. This exercise should take about 5 minutes. ::::::::::::::::::::::::::::::::::::::::::: + + +# XXX maybe this story should end with Mishka solving their own problem. + + +### run an ANOVA or something +XXX Note: There's more to the analysis, and I only managed to walk through one example of an error. Is this example juicy enough? Should we just discard the rest of the analysis? +XXX Note: I actually think the analysis here is wrong and might need to be re-worked--would need to do pairwise comparisons. + +#### MODELING #### +counts_mod <- lm(count_per_day ~ plot_type + species_id, data = counts_per_day) +summary(counts_mod) + +# with interaction term: +counts_mod_interact <- lm(count_per_day ~ plot_type*species_id, data = counts_per_day) +summary(counts_mod_interact) + +summary(counts_mod) +summary(counts_mod_interact) + + + + + + + + +:::::::::::::::::::::::::::::::::::::callout +## Stepping through code, line by line + +Placing your cursor on a line of code and using the keyboard shortcut Cmd + Enter (Mac) or Ctrl + Enter (Windows) will run that line of code *and* it will automatically advance your cursor to the next line. This makes it easy to "step through" your code without having to click or highlight. +::::::::::::::::::::::::::::::::::::::::::: + + +::::::::::::::::::::::::::::::::::::::::::: callout +## Computational reproducibility + +Every object should be able to map back to either a file, a built-in dataset in R, or another intermediate step. If you found any variables where you weren't able to answer the "Where did this come from?" question, then that's a problem! Did you build a script that mistakenly relied on an object that was in your environment but was never properly defined? + +Mapping exercises like this can be a great way to check whether entire script is reproducible. Reproducibility is important in more cases than just debugging! More and more journals are requiring full analysis code to be posted, and if that code isn't reproducible, it will severely hamper other researchers' efforts to confirm and expand on what you've done. + +Various packages can help you keep track of your code and make it more reproducible. Check out the [`{targets}`](https://books.ropensci.org/targets/) and [`{renv}`](https://rstudio.github.io/renv/articles/renv.html) packages in particular if you're interested in learning more. +::::::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/5-asking-your-question.md b/episodes/5-asking-your-question.md index 5a19e2d3..0eec131e 100644 --- a/episodes/5-asking-your-question.md +++ b/episodes/5-asking-your-question.md @@ -5,25 +5,459 @@ exercises: 2 --- :::::::::::::::::::::::::::::::::::::: questions - - How can I make sure my minimal reproducible example will actually run correctly for someone else? - How can I easily share a reproducible example with a mentor or helper, or online? - How do I ask a good question? - :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives - -- Use the reprex package to support making reproducible examples. +- Use the reprex package to test whether an example is reproducible. - Use the reprex package to format reprexes for posting online. - Understand the benefits and drawbacks of different places to get help. - Have a road map to follow when posting a question to make sure it's a good question. - +- Understand what the {reprex} package does and doesn't do. :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: keypoints - - The {reprex} package makes it easy to format and share your reproducible examples. - The {reprex} package helps you test whether your reprex is reproducible, and also helps you prepare the reprex to share with others. - Following a certain set of steps will make your questions clearer and likelier to get answered. :::::::::::::::::::::::::::::::::::::::::::::::: + +Congratulations on making it this far. Creating a good reproducible example is a lot of work! In this episode, we'll talk about how to finish up your reprex, make sure it's ready to send to your helper or post online, and make absolutely double sure that it actually runs. + +There are three principles to remember when you think about sharing your reprex with other people: + +1. Reproducibility +2. Formatting +3. Context + +## 1. Reproducibility + +You might be thinking, *Haven't we already talked a lot about reproducibility?* We have! We discussed variables and packages, minimal datasets, and making sure that the problem is meaningfully reproduced by the data that you choose. But there are some reasons that a code snippet that appears reproducible in your own R session might not actually be runnable by someone else. + +Some possible reasons: + +- You forgot to account for the origins of some functions and/or variables. We went through our code methodically, but what if we missed something? It would be nice to confirm that the code is as self-contained as we thought it was. + +- Your code accidentally relies on objects in your R environment that won't exist for other people. For example, imagine you defined a function `my_awesome_custom_function()` in a project-specific `functions.R` script, and your code calls that function. + +![A function called `"my_awesome_custom_function"` is lurking in my R environment. I must have defined it a while ago and forgotten! Code that includes this function will not run for someone else unless the function definition is also included in the reprex.](fig/custom_function.png) + +``` +my_awesome_custom_function("rain") +# [1] "rain is awesome!" +``` + +I might conclude that this code is reproducible--after all, it works when I run it! But unless I include the function defintion in the reprex itself, nobody who doesn't already have that custom function defined in their R environment will be able to run the code. + +A corrected reprex would look like this: + +``` +my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))} +my_awesome_custom_function("rain") +#[1] "rain is awesome!" +``` + +Wouldn't it be nice if we had a way to check whether any traps like this are lurking in our draft reprex? + +Luckily, there's a package that can help you test out your reprexes in a clean, isolated environment to make sure they're actually reproducible. Introducing `{reprex}`! + +The most important function in the `{reprex}` package is called `reprex()`. Here's how to use it. + +First, install and load the `{reprex}` package. + +``` +#install.packages("reprex") +library(reprex) +``` + +Second, write some code. This is your reproducible example. + +``` +(y <- 1:4) +mean(y) +``` + +Third, highlight that code and copy it to your clipboard (e.g. `Cmd + C` on Mac, or `Ctrl + C` on Windows). + +Finally, type `reprex()` into your console. + +``` +# (with the target code snippet copied to your clipboard already...) +# In the console: +reprex() +``` + +`{reprex}` will grab the code that you copied to your clipboard and _run that code in an isolated environment_. It will return a nicely formatted reproducible example that includes your code and and any results, plots, warnings, or errors generated. + +The generated output will be on your clipboard by default, so all you have to do is paste it into GitHub, StackOverflow, Slack, or another venue. + +::: callout +This workflow is unusual and takes some getting used to! Instead of copying your code *into* the function, you simply copy it to the clipboard (a mysterious, invisible place to most of us) and then let the blank, empty `reprex()` function go over to the clipboard by itself and find it. + +And then the completed, rendered reprex replaces the original code on the clipboard and all you need to do is paste, not copy and paste. + +Try to practice this workflow a few times with very simple reprexes so you can get a sense for the steps. +::: + +Let's practice this one more time. Here's some very simple code: + +``` +library(ggplot2) +library(dplyr) +mpg %>% + ggplot(aes(x = factor(cyl), y = displ))+ + geom_boxplot() +``` + +Let's highlight the code snippet, copy it to the clipboard, and then run `reprex()` in the console. + +``` +# In the console: +reprex() +``` + +The result, which was automatically placed onto my clipboard and which I pasted here, looks like this: + +``` r +library(ggplot2) +library(dplyr) +#> +#> Attaching package: 'dplyr' +#> The following objects are masked from 'package:stats': +#> +#> filter, lag +#> The following objects are masked from 'package:base': +#> +#> intersect, setdiff, setequal, union +mpg %>% + ggplot(aes(x = factor(cyl), y = displ))+ + geom_boxplot() +``` + +![](https://i.imgur.com/bOA0BAS.png) + +Created on 2024-12-29 with [reprex v2.1.1](https://reprex.tidyverse.org) + +Nice and neat! It even includes the plot produced, so I don't have to take screenshots and figure out how to attach them to an email or something. + +The formatting is great, but `{reprex}` really shines when you treat it as a helpful collaborator in your process of building a reproducible example (including all dependencies, providing minimal data, etc.) + +Let's see what happens if we forget to include `library(ggplot2)` in our small reprex above. + +``` +library(dplyr) +mpg %>% + ggplot(aes(x = factor(cyl), y = displ))+ + geom_boxplot() +``` + +As before, let's copy that code to the clipboard, run `reprex()` in the console, and paste the result here. + +``` +# In the console: +reprex() +``` + +``` r +library(dplyr) +#> +#> Attaching package: 'dplyr' +#> The following objects are masked from 'package:stats': +#> +#> filter, lag +#> The following objects are masked from 'package:base': +#> +#> intersect, setdiff, setequal, union +mpg %>% + ggplot(aes(x = factor(cyl), y = displ))+ + geom_boxplot() +#> Error in ggplot(., aes(x = factor(cyl), y = displ)): could not find function "ggplot" +``` + +Created on 2024-12-29 with [reprex v2.1.1](https://reprex.tidyverse.org) + +Now we get an error message indicating that R cannot find the function `ggplot`! That's because we forgot to load the `ggplot2` package in the reprex. + +This happened even though we had `ggplot2` already loaded in our own current RStudio session. `{reprex}` deliberately "plays dumb", running the code in a clean, isolated R session that's different from the R session we've been working in. This keeps us honest and makes sure we don't forget any important packages or function definitions. + +Let's return to our previous example with the custom function. + +``` +my_awesome_custom_function("rain") +``` + +``` +# In the console: +reprex() +``` + +``` r +my_awesome_custom_function("rain") +#> Error in my_awesome_custom_function("rain"): could not find function "my_awesome_custom_function" +``` + +Created on 2024-12-29 with [reprex v2.1.1](https://reprex.tidyverse.org) + +By contrast, if we include the function definition: + +``` +my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))} +my_awesome_custom_function("rain") +``` + +``` +# In the console: +reprex() +``` + +``` r +my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))} +my_awesome_custom_function("rain") +#> [1] "rain is awesome!" +``` + +Created on 2024-12-29 with [reprex v2.1.1](https://reprex.tidyverse.org) + +## Other features of `{reprex}` + +### Session Info +Another nice thing about `{reprex}` is that you can choose to include information about your R session, in case your error has something to do with your R settings rather than the code itself. You can do that using the `session_info` argument to `reprex()`. + +For example: + +``` +library(ggplot2) +library(dplyr) +mpg %>% + ggplot(aes(x = factor(cyl), y = displ))+ + geom_boxplot() +``` + +``` +# In the console: +reprex(session_info = TRUE) +``` + +``` r +library(ggplot2) +library(dplyr) +#> +#> Attaching package: 'dplyr' +#> The following objects are masked from 'package:stats': +#> +#> filter, lag +#> The following objects are masked from 'package:base': +#> +#> intersect, setdiff, setequal, union +mpg %>% + ggplot(aes(x = factor(cyl), y = displ))+ + geom_boxplot() +``` + +![](https://i.imgur.com/kHBf9Zr.png) + +Created on 2024-12-29 with [reprex v2.1.1](https://reprex.tidyverse.org) + +
+ +Session info + + +``` r +sessioninfo::session_info() +#> ─ Session info ─────────────────────────────────────────────────────────────── +#> setting value +#> version R version 4.4.2 (2024-10-31) +#> os macOS Monterey 12.7.6 +#> system aarch64, darwin20 +#> ui X11 +#> language (EN) +#> collate en_US.UTF-8 +#> ctype en_US.UTF-8 +#> tz America/New_York +#> date 2024-12-29 +#> pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) +#> +#> ─ Packages ─────────────────────────────────────────────────────────────────── +#> package * version date (UTC) lib source +#> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0) +#> colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0) +#> curl 6.0.1 2024-11-20 [1] https://jeroen.r-universe.dev (R 4.4.2) +#> digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1) +#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0) +#> evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1) +#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0) +#> farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0) +#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) +#> fs 1.6.5 2024-10-30 [1] CRAN (R 4.4.1) +#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0) +#> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0) +#> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1) +#> gtable 0.3.6 2024-10-25 [1] CRAN (R 4.4.1) +#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) +#> knitr 1.49 2024-11-08 [1] CRAN (R 4.4.1) +#> labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0) +#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) +#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0) +#> munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0) +#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0) +#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0) +#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0) +#> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.4.0) +#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0) +#> rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.4.1) +#> rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.1) +#> scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0) +#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0) +#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0) +#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0) +#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0) +#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0) +#> withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1) +#> xfun 0.49 2024-10-31 [1] CRAN (R 4.4.1) +#> xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.0) +#> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0) +#> +#> [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library +#> +#> ────────────────────────────────────────────────────────────────────────────── +``` + +
+ +### Formatting + +The output of `reprex()` is markdown, which can easily be copied and pasted into many sites/apps. However, different places have slightly different formatting conventions for markdown. `{reprex}` lets you customize the output of your reprex according to where you're planning to post it. + +The default, `venue = "gh"`, gives you "[GitHub-Flavored Markdown](https://github.github.com/gfm/)". Another format you might want is "r", which gives you a runnable R script, with commented output interleaved. + +Check out the formatting options in the help file with `?reprex`, and try out a few depending on the destination of your reprex! + +::: callout +## `{reprex}` can't do everything for you! + +People often mention `{reprex}` as a useful tool for creating reproducible examples, but it can't do the work of crafting the example for you! The package doesn't locate the problem, pare down the code, create a minimal dataset, or automatically include package dependencies. + +A better way to think of `{reprex}` is as a tool to check your work as you go through the process of creating a reproducible example, and to help you polish up the result. +::: + +## Testing it out + +Now that we've met our new reprex-making collaborator, let's use it to test out the reproducible example we created in the previous episode. + +Here's the code we wrote: + +``` +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 + +# Load packages needed for the analysis +library(dplyr) + +# Provide a minimal dataset +years <- 1:4 +species <- c("Species A", "Species B", "Species C") +plots <- c("Control", "Treatment") + +total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily +total_rows <- sum(total_records_per_year) # how many total rows will we have? + +# Create the fake data using `rep()` and `sample()`. +minimal_data <- data.frame(year = rep(years, times = total_records_per_year), + plot_type = rep(plots, length.out = total_rows), + species_id = sample(species, size = total_rows, replace = TRUE)) # Because I assigned the species randomly, we should expect Species A to occur *roughly* 33% of the time. + +# Calculate proportion of Species A caught +head(minimal_data) +prop_speciesA <- minimal_data %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + dplyr::filter(species_id == "Species A") # keep only Species A + +head(prop_speciesA) # Species A only occurs 1-3% of the time in each plot type in each year. Why is this off by an order of magnitude? (This is the same problem I was seeing in my real data--the occurrence proportions were way, way too small.) +``` + +Time to find out if our example is actually reproducible! Let's copy it to the clipboard and run `reprex()`. Since we want to give Jordan a runnable R script, we can use `venue = "r"`. + +``` +# In the console: +reprex(venue = "r") +``` + +It worked! + +``` +# This script will contain my minimal reproducible example. +# Created by Mishka on 2024-12-17 + +# Load packages needed for the analysis +library(dplyr) +#> +#> Attaching package: 'dplyr' +#> The following objects are masked from 'package:stats': +#> +#> filter, lag +#> The following objects are masked from 'package:base': +#> +#> intersect, setdiff, setequal, union + +# Provide a minimal dataset +years <- 1:4 +species <- c("Species A", "Species B", "Species C") +plots <- c("Control", "Treatment") + +total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily +total_rows <- sum(total_records_per_year) # how many total rows will we have? + +# Create the fake data using `rep()` and `sample()`. +minimal_data <- data.frame(year = rep(years, times = total_records_per_year), + plot_type = rep(plots, length.out = total_rows), + species_id = sample(species, size = total_rows, replace = TRUE)) # Because I assigned the species randomly, we should expect Species A to occur *roughly* 33% of the time. + +# Calculate proportion of Species A caught +head(minimal_data) +#> year plot_type species_id +#> 1 1 Control Species A +#> 2 1 Treatment Species A +#> 3 1 Control Species A +#> 4 1 Treatment Species B +#> 5 1 Control Species C +#> 6 1 Treatment Species A +prop_speciesA <- minimal_data %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop") %>% + mutate(prop = total_count/sum(total_count)) %>% + dplyr::filter(species_id == "Species A") # keep only Species A + +head(prop_speciesA) # Species A only occurs 1-3% of the time in each plot type in each year. Why is this off by an order of magnitude? (This is the same problem I was seeing in my real data--the occurrence proportions were way, way too small.) +#> # A tibble: 6 × 5 +#> year plot_type species_id total_count prop +#> +#> 1 1 Control Species A 2 0.0364 +#> 2 1 Treatment Species A 2 0.0364 +#> 3 2 Control Species A 1 0.0182 +#> 4 2 Treatment Species A 1 0.0182 +#> 5 3 Control Species A 2 0.0364 +#> 6 4 Control Species A 4 0.0727 +``` + +Now we have a beautifully-formatted reprex that includes runnable code and all the context needed to reproduce the problem. + +It's time to send another email to Jordan to finally get help with our problem. It's important to include a brief description of the problem, including what we were hoping to achieve and what the problem seems to be. + +> Hi Jordan, +> I took some time to learn how to write reproducible examples. I've simplified my problem down to a "reprex" and formatted it with the `{reprex}` package (attached). I'm using a simplified dataset here, but hopefully it illustrates the problem. +> Basically, my data has three species, and I'm trying to calculate the proportion of each species that was caught in each plot type in each of several years. In the example, I would expect each species to come up approximately one third of the time, but instead I'm seeing really tiny proportions, like 1-3%. My real data shows a similar pattern. +> I think something must be wrong with how I'm calculating the proportions, but I can't quite figure it out. Can you shed some light on what might be going on? +> Thanks so much! +> Mishka +> Attachment: reprex-script.R + +XXX how should this end? Some possibilities: +- no solution! The point wasn't to find the solution, it was to create a reprex. +- Jordan writes back and solves the problem +- Mishka figures out what was wrong in the process of drafting the email +- Something else? diff --git a/episodes/fig/custom_function.png b/episodes/fig/custom_function.png new file mode 100644 index 00000000..0fc38a5e Binary files /dev/null and b/episodes/fig/custom_function.png differ diff --git a/learners/setup.md b/learners/setup.md index c8cf32d9..a9c937d7 100644 --- a/learners/setup.md +++ b/learners/setup.md @@ -122,5 +122,5 @@ We will download the data directly from R during the lessons. However, if you ar The data files for the lesson can be downloaded manually: - - [cleaned data](../episodes/data/cleaned/surveys_complete_77_89.csv) and - - [zip file of raw data](../episodes/data/new_data.zip). + - [cleaned data](../episodes/data/surveys_complete_77_89.csv) and + - [zip file of raw data](). # FIXME diff --git a/renv/profiles/lesson-requirements/renv.lock b/renv/profiles/lesson-requirements/renv.lock index a6eb81ad..3e6e5445 100644 --- a/renv/profiles/lesson-requirements/renv.lock +++ b/renv/profiles/lesson-requirements/renv.lock @@ -17,6 +17,17 @@ ] }, "Packages": { + "DBI": { + "Package": "DBI", + "Version": "1.2.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "065ae649b05f1ff66bb0c793107508f5" + }, "MASS": { "Package": "MASS", "Version": "7.3-61", @@ -93,6 +104,26 @@ ], "Hash": "4ac8e423216b8b70cb9653d1b3f71eb9" }, + "askpass": { + "Package": "askpass", + "Version": "1.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "sys" + ], + "Hash": "c39f4155b3ceb1a9a2799d700fbd4b6a" + }, + "backports": { + "Package": "backports", + "Version": "1.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "e1e1b9d75c37401117b636b7ae50827a" + }, "base64enc": { "Package": "base64enc", "Version": "0.1-3", @@ -127,6 +158,18 @@ ], "Hash": "e84984bf5f12a18628d9a02322128dfd" }, + "blob": { + "Package": "blob", + "Version": "1.2.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "rlang", + "vctrs" + ], + "Hash": "40415719b5a479b87949f3aa0aee737c" + }, "boot": { "Package": "boot", "Version": "1.3-31", @@ -139,6 +182,26 @@ ], "Hash": "de2a4646c18661d6a0a08ec67f40b7ed" }, + "broom": { + "Package": "broom", + "Version": "1.0.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "backports", + "dplyr", + "generics", + "glue", + "lifecycle", + "purrr", + "rlang", + "stringr", + "tibble", + "tidyr" + ], + "Hash": "8fcc818f3b9887aebaf206f141437cc9" + }, "bslib": { "Package": "bslib", "Version": "0.8.0", @@ -185,6 +248,18 @@ ], "Hash": "d7e13f49c19103ece9e58ad2d83a7354" }, + "cellranger": { + "Package": "cellranger", + "Version": "1.1.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "rematch", + "tibble" + ], + "Hash": "f61dbaec772ccd2e17705c1e872e9e7c" + }, "cli": { "Package": "cli", "Version": "3.6.3", @@ -220,6 +295,19 @@ ], "Hash": "d954cb1c57e8d8b756165d7ba18aa55a" }, + "conflicted": { + "Package": "conflicted", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "memoise", + "rlang" + ], + "Hash": "bb097fccb22d156624fd07cd2894ddb6" + }, "cpp11": { "Package": "cpp11", "Version": "0.5.0", @@ -242,6 +330,57 @@ ], "Hash": "859d96e65ef198fd43e82b9628d593ef" }, + "curl": { + "Package": "curl", + "Version": "6.0.1", + "Source": "Repository", + "Repository": "https://jeroen.r-universe.dev", + "RemoteUrl": "https://github.com/jeroen/curl", + "RemoteSha": "c7b5ad150ec4e0701c203df19e3920ca90d1a54c", + "Requirements": [ + "R" + ], + "Hash": "e8ba62486230951fcd2b881c5be23f96" + }, + "data.table": { + "Package": "data.table", + "Version": "1.16.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "2e00b378fc3be69c865120d9f313039a" + }, + "dbplyr": { + "Package": "dbplyr", + "Version": "2.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "DBI", + "R", + "R6", + "blob", + "cli", + "dplyr", + "glue", + "lifecycle", + "magrittr", + "methods", + "pillar", + "purrr", + "rlang", + "tibble", + "tidyr", + "tidyselect", + "utils", + "vctrs", + "withr" + ], + "Hash": "39b2e002522bfd258039ee4e889e0fd1" + }, "digest": { "Package": "digest", "Version": "0.6.37", @@ -276,6 +415,25 @@ ], "Hash": "fedd9d00c2944ff00a0e2696ccf048ec" }, + "dtplyr": { + "Package": "dtplyr", + "Version": "1.3.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "data.table", + "dplyr", + "glue", + "lifecycle", + "rlang", + "tibble", + "tidyselect", + "vctrs" + ], + "Hash": "54ed3ea01b11e81a86544faaecfef8e2" + }, "evaluate": { "Package": "evaluate", "Version": "1.0.1", @@ -324,6 +482,22 @@ ], "Hash": "bd1297f9b5b1fc1372d19e2c4cd82215" }, + "forcats": { + "Package": "forcats", + "Version": "1.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "magrittr", + "rlang", + "tibble" + ], + "Hash": "1a0a9a3d5083d0d573c4214576f1e690" + }, "fs": { "Package": "fs", "Version": "1.6.5", @@ -335,6 +509,28 @@ ], "Hash": "7f48af39fa27711ea5fbd183b399920d" }, + "gargle": { + "Package": "gargle", + "Version": "1.5.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "fs", + "glue", + "httr", + "jsonlite", + "lifecycle", + "openssl", + "rappdirs", + "rlang", + "stats", + "utils", + "withr" + ], + "Hash": "fc0b272e5847c58cd5da9b20eedbd026" + }, "generics": { "Package": "generics", "Version": "0.1.3", @@ -382,6 +578,59 @@ ], "Hash": "5899f1eaa825580172bb56c08266f37c" }, + "googledrive": { + "Package": "googledrive", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "gargle", + "glue", + "httr", + "jsonlite", + "lifecycle", + "magrittr", + "pillar", + "purrr", + "rlang", + "tibble", + "utils", + "uuid", + "vctrs", + "withr" + ], + "Hash": "e99641edef03e2a5e87f0a0b1fcc97f4" + }, + "googlesheets4": { + "Package": "googlesheets4", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cellranger", + "cli", + "curl", + "gargle", + "glue", + "googledrive", + "httr", + "ids", + "lifecycle", + "magrittr", + "methods", + "purrr", + "rematch2", + "rlang", + "tibble", + "utils", + "vctrs", + "withr" + ], + "Hash": "d6db1667059d027da730decdc214b959" + }, "gtable": { "Package": "gtable", "Version": "0.3.6", @@ -398,6 +647,27 @@ ], "Hash": "de949855009e2d4d0e52a844e30617ae" }, + "haven": { + "Package": "haven", + "Version": "2.5.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "cpp11", + "forcats", + "hms", + "lifecycle", + "methods", + "readr", + "rlang", + "tibble", + "tidyselect", + "vctrs" + ], + "Hash": "9171f898db9d9c4c1b2c745adc2c1ef1" + }, "here": { "Package": "here", "Version": "1.0.1", @@ -449,6 +719,32 @@ ], "Hash": "81d371a9cc60640e74e4ab6ac46dcedc" }, + "httr": { + "Package": "httr", + "Version": "1.4.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "curl", + "jsonlite", + "mime", + "openssl" + ], + "Hash": "ac107251d9d9fd72f0ca8049988f1d7f" + }, + "ids": { + "Package": "ids", + "Version": "1.0.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "openssl", + "uuid" + ], + "Hash": "99df65cfef20e525ed38c3d2577f7190" + }, "isoband": { "Package": "isoband", "Version": "0.2.7", @@ -632,6 +928,24 @@ ], "Hash": "785ef8e22389d4a7634c6c944f2dc07d" }, + "modelr": { + "Package": "modelr", + "Version": "0.1.11", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "broom", + "magrittr", + "purrr", + "rlang", + "tibble", + "tidyr", + "tidyselect", + "vctrs" + ], + "Hash": "4f50122dc256b1b6996a4703fecea821" + }, "munsell": { "Package": "munsell", "Version": "0.5.1", @@ -664,6 +978,16 @@ "Repository": "CRAN", "Hash": "27550641889a3abf3aec4d91186311ec" }, + "openssl": { + "Package": "openssl", + "Version": "2.2.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "askpass" + ], + "Hash": "d413e0fef796c9401a4419485f709ca1" + }, "pillar": { "Package": "pillar", "Version": "1.9.0", @@ -739,6 +1063,32 @@ ], "Hash": "b4404b1de13758dea1c0484ad0d48563" }, + "purrr": { + "Package": "purrr", + "Version": "1.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "lifecycle", + "magrittr", + "rlang", + "vctrs" + ], + "Hash": "1cba04a4e9414bdefc9dcaa99649a8dc" + }, + "ragg": { + "Package": "ragg", + "Version": "1.3.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "systemfonts", + "textshaping" + ], + "Hash": "0595fe5e47357111f29ad19101c7d271" + }, "rappdirs": { "Package": "rappdirs", "Version": "0.3.3", @@ -782,6 +1132,38 @@ ], "Hash": "9de96463d2117f6ac49980577939dfb3" }, + "readxl": { + "Package": "readxl", + "Version": "1.4.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cellranger", + "cpp11", + "progress", + "tibble", + "utils" + ], + "Hash": "8cf9c239b96df1bbb133b74aef77ad0a" + }, + "rematch": { + "Package": "rematch", + "Version": "2.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "cbff1b666c6fa6d21202f07e2318d4f1" + }, + "rematch2": { + "Package": "rematch2", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "tibble" + ], + "Hash": "76c9e04c712a05848ae7a23d2f170a40" + }, "renv": { "Package": "renv", "Version": "1.0.11", @@ -865,6 +1247,25 @@ "Repository": "CRAN", "Hash": "5f90cd73946d706cfe26024294236113" }, + "rvest": { + "Package": "rvest", + "Version": "1.0.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "httr", + "lifecycle", + "magrittr", + "rlang", + "selectr", + "tibble", + "xml2" + ], + "Hash": "0bcf0c6f274e90ea314b812a6d19a519" + }, "sass": { "Package": "sass", "Version": "0.4.9", @@ -899,6 +1300,19 @@ ], "Hash": "c19df082ba346b0ffa6f833e92de34d1" }, + "selectr": { + "Package": "selectr", + "Version": "0.4-2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "methods", + "stringr" + ], + "Hash": "3838071b66e0c566d55cc26bd6e27bf4" + }, "stringi": { "Package": "stringi", "Version": "1.8.4", @@ -929,6 +1343,38 @@ ], "Hash": "960e2ae9e09656611e0b8214ad543207" }, + "sys": { + "Package": "sys", + "Version": "3.4.3", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "de342ebfebdbf40477d0758d05426646" + }, + "systemfonts": { + "Package": "systemfonts", + "Version": "1.1.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cpp11", + "lifecycle" + ], + "Hash": "213b6b8ed5afbf934843e6c3b090d418" + }, + "textshaping": { + "Package": "textshaping", + "Version": "0.4.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cpp11", + "lifecycle", + "systemfonts" + ], + "Hash": "5142f8bc78ed3d819d26461b641627ce" + }, "tibble": { "Package": "tibble", "Version": "3.2.1", @@ -948,6 +1394,29 @@ ], "Hash": "a84e2cc86d07289b3b6f5069df7a004c" }, + "tidyr": { + "Package": "tidyr", + "Version": "1.3.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "cpp11", + "dplyr", + "glue", + "lifecycle", + "magrittr", + "purrr", + "rlang", + "stringr", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "915fb7ce036c22a6a33b5a8adb712eb1" + }, "tidyselect": { "Package": "tidyselect", "Version": "1.2.1", @@ -964,6 +1433,46 @@ ], "Hash": "829f27b9c4919c16b593794a6344d6c0" }, + "tidyverse": { + "Package": "tidyverse", + "Version": "2.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "broom", + "cli", + "conflicted", + "dbplyr", + "dplyr", + "dtplyr", + "forcats", + "ggplot2", + "googledrive", + "googlesheets4", + "haven", + "hms", + "httr", + "jsonlite", + "lubridate", + "magrittr", + "modelr", + "pillar", + "purrr", + "ragg", + "readr", + "readxl", + "reprex", + "rlang", + "rstudioapi", + "rvest", + "stringr", + "tibble", + "tidyr", + "xml2" + ], + "Hash": "c328568cd14ea89a83bd4ca7f54ae07e" + }, "timechange": { "Package": "timechange", "Version": "0.3.0", @@ -1006,6 +1515,16 @@ ], "Hash": "62b65c52671e6665f803ff02954446e9" }, + "uuid": { + "Package": "uuid", + "Version": "1.2-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "34e965e62a41fcafb1ca60e9b142085b" + }, "vctrs": { "Package": "vctrs", "Version": "0.6.5", @@ -1081,6 +1600,19 @@ ], "Hash": "8687398773806cfff9401a2feca96298" }, + "xml2": { + "Package": "xml2", + "Version": "1.3.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "methods", + "rlang" + ], + "Hash": "1d0336142f4cd25d8d23cd3ba7a8fb61" + }, "yaml": { "Package": "yaml", "Version": "2.3.10", diff --git a/scripts/krat-analysis.R b/scripts/krat-analysis.R new file mode 100644 index 00000000..d0b47a23 --- /dev/null +++ b/scripts/krat-analysis.R @@ -0,0 +1,143 @@ +# Kangaroo rat analysis using the Portal data +# Created by: Mishka +# Last updated: 2024-11-22 + +# Load packages to use in this script +library(readr) +library(dplyr) +library(ggplot2) +library(stringr) +library(here) + +# Read in the data +rodents <- read_csv(here("scripts/data/surveys_complete_77_89.csv")) + +### DATA WRANGLING #### +glimpse(rodents) # or click on the environment +str(rodents) # an alternative that does the same thing +head(rodents) # or open fully with View() or click in environment + +table(rodents$taxa) + +# Abundance distribution of taxa +rodents %>% + ggplot(aes(x=taxa))+ + geom_bar() + +# Examine NA values +## How do we find NAs anyway? ---- +head(is.na(rodents$taxa)) # logical--tells us when an observation is an NA (T or F) + +# Not very helpful. BUT +sum(is.na(rodents$taxa)) # sum considers T = 1 and F = 0 + +# Simplify down to just rodents +rodents <- rodents %>% + filter(taxa == "Rodent") +glimpse(rodents) + +# Just kangaroo rats because this is what we are studying +krats <- rodents %>% + filter(genus == "Dipodomys") +dim(krats) # okay, so that's a lot smaller, great. +glimpse(krats) + +# Prep for time analysis +# To examine trends over time, we'll need to create a date column +krats <- krats %>% + mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) + +# Examine differences in different time periods +krats <- krats %>% + mutate(time_period = ifelse(year < 1988, "early", "late")) + +# Check that this went through; check for NAs +table(krats$time_period, exclude = NULL) # learned how to do this earlier + +### QUESTION 1: How many k-rats over time in the past? ### +# How many kangaroo rats of each species were found at the study site in past years (so you know what to expect for a sample size this year)? + +# Numbers over time by plot type +krats %>% + ggplot(aes(x = date, fill = plot_type)) + + geom_histogram()+ + facet_wrap(~species)+ + theme_bw()+ + scale_fill_viridis_d(option = "plasma")+ + geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue") + +# Oops we gotta get rid of the unidentified k-rats +krats <- krats %>% + filter(species != "sp.") + +# Re-do the plot above +krats %>% + ggplot(aes(x = date, fill = plot_type)) + + geom_histogram()+ + facet_wrap(~species)+ + theme_bw()+ + scale_fill_viridis_d(option = "plasma")+ + geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue") + +# How many individuals caught per day? +krats_per_day <- krats %>% + group_by(date, year, species) %>% + summarize(n = n()) %>% + group_by(species) + +krats_per_day %>% + ggplot(aes(x = species, y = n))+ + geom_boxplot(outlier.shape = NA)+ + geom_jitter(width = 0.2, alpha = 0.2, aes(col = year))+ + theme_classic()+ + ylab("Number per day")+ + xlab("Species") + +#### QUESTION 2: Do the k-rat exclusion plots work? ##### +# Do the k-rat exclusion plots work? (i.e. Does the abundance of each species differ by plot?) +# If the k-rat plots work, then we would expect: +# A. Fewer k-rats overall in any of the exclusion plots than in the control, with the fewest in the long-term k-rat exclusion plot +counts_per_day <- krats %>% + group_by(year, plot_id, plot_type, month, day, species_id) %>% + summarize(count_per_day = n()) + +counts_per_day %>% + ggplot(aes(x = plot_type, y = count_per_day, fill = species_id, group = interaction(plot_type, species_id)))+ + geom_boxplot(outlier.size = 0.5)+ + theme_minimal()+ + labs(title = "Kangaroo rat captures, all years", + x = "Plot type", + y = "Individuals per day", + fill = "Species") + +# B. For Spectabilis-specific exclosure, we expect a lower proportion of spectabilis there than in the other plots. +control_spectab <- krats %>% + filter(plot_type %in% c("Control", "Spectab exclosure")) + +head(control_spectab) +prop_spectab <- control_spectab %>% + group_by(year, plot_type, species_id) %>% + summarize(total_count = n(), .groups = "drop_last") %>% + mutate(prop = total_count/sum(total_count)) %>% + filter(species_id == "DS") # keep only spectabilis + +prop_spectab %>% + ggplot(aes(x = year, y = prop, col = plot_type))+ + geom_point()+ + geom_line()+ + theme_minimal()+ + labs(title = "Spectab exclosures did not reduce proportion of\nspectab captures", + y = "Spectabilis proportion", + x = "Year", + color = "Plot type") + +#### MODELING #### +counts_mod <- lm(count_per_day ~ plot_type + species_id, data = counts_per_day) +summary(counts_mod) + +# with interaction term: +counts_mod_interact <- lm(count_per_day ~ plot_type*species_id, data = counts_per_day) +summary(counts_mod_interact) + +summary(counts_mod) +summary(counts_mod_interact) diff --git a/scripts/reprex-script.R b/scripts/reprex-script.R new file mode 100644 index 00000000..4f94b71b --- /dev/null +++ b/scripts/reprex-script.R @@ -0,0 +1 @@ +# This is a script for the Minimal Reproducible Code lesson