diff --git a/06-farm.qmd b/06-farm.qmd
index 961db384..d6f99de3 100644
--- a/06-farm.qmd
+++ b/06-farm.qmd
@@ -2,7 +2,7 @@
engine: knitr
---
-# Farm data {#sec-farm-data}
+# Measurement, censuses, and sampling {#sec-farm-data}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
diff --git a/07-gather.qmd b/07-gather.qmd
index c48c73fb..58cd287c 100644
--- a/07-gather.qmd
+++ b/07-gather.qmd
@@ -2,7 +2,7 @@
engine: knitr
---
-# Gather data {#sec-gather-data}
+# APIs, scraping, and parsing {#sec-gather-data}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
@@ -1496,7 +1496,7 @@ In general the result is not too bad. OCR is a useful tool but is not perfect an
1. *(Plan)* Consider the following scenario: *A group of five undergraduates---Matt, Ash, Jacki, Rol, and Mike---each read some number of pages from a book each day for 100 days. Two of the undergraduates are a couple and so their number of pages is positively correlated, however all the others are independent.* Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.
2. *(Simulate)* Please further consider the scenario described and simulate the situation (note the relationship between some variables). Then write five tests based on the simulated data.
-3. *(Acquire)* Please obtain some actual data about snowfall and add a script updating the simulated tests to these actual data.
+3. *(Acquire)* Please obtain some actual data, similar to the scenario, and add a script updating the simulated tests to these actual data.
4. *(Explore)* Build a graph and table using the real data.
5. *(Communicate)* Please write some text to accompany the graph and table. Separate the code appropriately into `R` files and a Quarto doc. Submit a link to a high-quality GitHub repo.
diff --git a/08-hunt.qmd b/08-hunt.qmd
index c7dbf2e9..dc3989b0 100644
--- a/08-hunt.qmd
+++ b/08-hunt.qmd
@@ -2,7 +2,7 @@
engine: knitr
---
-# Hunt data {#sec-hunt-data}
+# Experiments and surveys {#sec-hunt-data}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
@@ -53,7 +53,7 @@ library(tidyverse)
## Introduction
-This chapter is about obtaining data with experiments. This is a situation in which we can explicitly control and vary what we are interested in. The advantage of this is that identifying and estimating an effect should be clear. There is a treatment group that is subject to what we are interested in, and a control group that is not. These are randomly split before treatment. And so, if they end up different, then it must be because of the treatment. Unfortunately, life is rarely so smooth. Arguing about how similar the treatment and control groups were tends to carry on indefinitely. And before we can estimate an effect, we need to be able to measure whatever it is that we are interested in, which is often surprisingly difficult.
+This chapter is about obtaining data with experiments and surveys. An experiment is a situation in which we can explicitly control and vary what we are interested in. The advantage of this is that identifying and estimating an effect should be clear. There is a treatment group that is subject to what we are interested in, and a control group that is not. These are randomly split before treatment. And so, if they end up different, then it must be because of the treatment. Unfortunately, life is rarely so smooth. Arguing about how similar the treatment and control groups were tends to carry on indefinitely. And before we can estimate an effect, we need to be able to measure whatever it is that we are interested in, which is often surprisingly difficult.
By way of motivation, consider the situation of someone who moved to San Francisco in 2014---as soon as they moved the Giants won the World Series and the Golden State Warriors began a historic streak of World Championships. They then moved to Chicago, and immediately the Cubs won the World Series for the first time in 100 years. They then moved to Massachusetts, and the Patriots won the Super Bowl again, and again, and again. And finally, they moved to Toronto, where the Raptors immediately won the World Championship. Should a city pay them to move, or could municipal funds be better spent elsewhere?
@@ -889,63 +889,63 @@ question_or_not |>
### Practice {.unnumbered}
-1. *(Plan)* Consider the following scenario: *A political candidate is interested in how two polling values change over the course of an election campaign: approval rating and vote-share. The two are measured as percentages, and are somewhat correlated. There tends to be large changes when there is a debate between candidates.* Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.
-2. *(Simulate)* Please further consider the scenario described and simulate the situation. Please include five tests based on the simulated data. Submit a link to a GitHub Gist that contains your code.
-3. *(Acquire)* Please identify and document a possible source of such a dataset.
-4. *(Explore)* Please use `ggplot2` to build the graph that you sketched using the simulated data. Submit a link to a GitHub Gist that contains your code.
-5. *(Communicate)* Please write two paragraphs about what you did.
+1. *(Plan)* Consider the following scenario: *A political candidate is interested in how two polling values change over the course of an election campaign: approval rating and vote-share. The two are measured as percentages, and are somewhat correlated. There tends to be large changes when there is a debate between candidates.* Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.
+2. *(Simulate)* Please simulate situation, including the relationship, and then write tests for the simulated dataset.
+3. *(Acquire)* Please obtain some actual data, similar to the scenario, and add a script updating the simulated tests to these actual data.
+4. *(Explore)* Build graphs and tables using the real data.
+5. *(Communicate)* Write a short paper using Quarto and submit a link to a high-quality GitHub repo.
### Quiz {.unnumbered}
1. Which of the following best describes the fundamental problem of causal inference (pick one)?
- a. We cannot observe both the treatment and control outcomes for the same individual simultaneously.
- b. Randomization cannot eliminate all biases in an experiment.
- c. It is impossible to establish external validity in any experiment.
- d. Surveys cannot accurately measure individual preferences.
+ a. Randomization cannot eliminate all biases in an experiment.
+ b. Surveys cannot accurately measure individual preferences.
+ c. We cannot observe both the treatment and control outcomes for the same individual simultaneously.
+ d. It is impossible to establish external validity in any experiment.
2. In the Neyman-Rubin potential outcomes framework, what is the primary goal when conducting an experiment (pick one)?
- a. To maximize the sample size for greater statistical power.
- b. To estimate the causal effect by comparing treatment and control groups.
- c. To ensure all participants receive the treatment at some point.
- d. To focus on external validity over internal validity.
-3. Based on @gertler2016impact, what does the basic impact evaluation formula $\Delta = (Y_i|t=1) - (Y_i|t=0)$ represent (pick one)?
- a. The total cost of a program.
- b. The difference in outcomes between treatment and comparison groups.
- c. The average change in a participant's salary.
- d. The effect of external market forces on outcomes.
+ a. To estimate the causal effect by comparing treatment and control groups.
+ b. To focus on external validity over internal validity.
+ c. To maximize the sample size for greater statistical power.
+ d. To ensure all participants receive the treatment at some point.
+3. From @gertler2016impact, what does the basic impact evaluation formula $\Delta = (Y_i|t=1) - (Y_i|t=0)$ represent (pick one)?
+ a. The difference in outcomes between treatment and comparison groups.
+ b. The average change in a participant's salary.
+ c. The effect of external market forces on outcomes.
+ d. The total cost of a program.
4. Why is randomization important in experimental design (pick one)?
a. It ensures the sample is representative of the population.
b. It eliminates the need for a control group.
- c. It helps create treatment and control groups that are similar except for the treatment.
- d. It guarantees external validity.
-5. Based on @gertler2016impact, what is a common problem when trying to measure the counterfactual (pick one)?
- a. It is impossible to observe both treatment and non-treatment outcomes for the same individual.
+ c. It guarantees external validity.
+ d. It helps create treatment and control groups that are similar except for the treatment.
+5. From @gertler2016impact, what is a common problem when trying to measure the counterfactual (pick one)?
+ a. Only randomized trials can provide the counterfactual.
b. Data for control groups are always inaccurate.
- c. Only randomized trials can provide the counterfactual.
+ c. It is impossible to observe both treatment and non-treatment outcomes for the same individual.
d. Programs typically do not have sufficient participants.
-6. Based on @gertler2016impact, selection bias occurs when (pick one):
- a. The program is implemented at a national scale.
- b. Program evaluation lacks financial support.
- c. Data collection is incomplete.
- d. Participants are not randomly assigned.
+6. From @gertler2016impact, when does selection bias happen (pick one)?
+ a. Program evaluation lacks financial support.
+ b. The program is implemented at a national scale.
+ c. Participants are not randomly assigned.
+ d. Data collection is incomplete.
7. What is external validity (pick one)?
- a. Findings from an experiment hold in that setting.
- b. Findings from an experiment that has been repeated many times.
- c. Findings from an experiment hold outside that setting.
- d. Findings from an experiment for which code and data are available.
+ a. Findings from an experiment that has been repeated many times.
+ b. Findings from an experiment hold in that setting.
+ c. Findings from an experiment for which code and data are available.
+ d. Findings from an experiment hold outside that setting.
8. What is internal validity (pick one)?
- a. Findings from an experiment hold in that setting.
- b. Findings from an experiment hold outside that setting.
- c. Findings from an experiment that has been repeated many times.
- d. Findings from an experiment for which code and data are available.
-9. Based on @gertler2016impact, what does internal validity refer to in an impact evaluation (pick one)?
- a. The ability to generalize findings to other populations.
- b. The efficiency of program management.
- c. The long-term sustainability of a program.
- d. The accuracy of measuring the causal effect of a program.
-10. Based on @gertler2016impact, what does external validity refer to in an impact evaluation (pick one)?
- a. The ability to generalize the results to the eligible population.
- b. The effectiveness of a randomized control trial.
- c. The administrative costs of a program.
+ a. Findings from an experiment for which code and data are available.
+ b. Findings from an experiment that has been repeated many times.
+ c. Findings from an experiment hold in that setting.
+ d. Findings from an experiment hold outside that setting.
+9. From @gertler2016impact, what does internal validity refer to in an impact evaluation (pick one)?
+ a. The accuracy of measuring the causal effect of a program.
+ b. The ability to generalize findings to other populations.
+ c. The efficiency of program management.
+ d. The long-term sustainability of a program.
+10. From @gertler2016impact, what does external validity refer to in an impact evaluation (pick one)?
+ a. The administrative costs of a program.
+ b. The ability to generalize the results to the eligible population.
+ c. The effectiveness of a randomized control trial.
d. The extent to which outcomes reflect policy changes.
11. Please write some code for the following dataset that would randomly assign people into one of two groups.
@@ -963,134 +963,134 @@ netflix_data <-
)
```
-12. Based on @gertler2016impact, a valid comparison group must have all of the following characteristics EXCEPT (pick one):
- a. Be affected directly or indirectly by the program.
- b. The same average characteristics as the treatment group.
- c. Have outcomes that would change the same way as the treatment group.
+12. From @gertler2016impact, a valid comparison group must have all of the following characteristics EXCEPT (pick one)?
+ a. The same average characteristics as the treatment group.
+ b. Have outcomes that would change the same way as the treatment group.
+ c. Be affected directly or indirectly by the program.
d. React to the program in a similar way if given the program.
-13. Based on @gertler2016impact, before-and-after comparisons are considered counterfeit estimates because (pick one):
- a. They focus on unimportant metrics.
- b. They involve random assignment.
+13. From @gertler2016impact, why are before-and-after comparisons considered counterfeit estimates (pick one)?
+ a. They involve random assignment.
+ b. They focus on unimportant metrics.
c. They require large data samples.
d. They assume outcomes do not change over time.
-14. Based on @gertler2016impact, which scenario could ethically allow the use of randomized assignment as a program allocation tool (pick one)?
+14. From @gertler2016impact, which scenario could ethically allow the use of randomized assignment as a program allocation tool (pick one)?
a. All participants are enrolled based on income levels.
- b. A program has more eligible participants than available spaces.
- c. Every eligible participant can be accommodated by the program.
- d. The program only serves one specific group.
+ b. Every eligible participant can be accommodated by the program.
+ c. The program only serves one specific group.
+ d. A program has more eligible participants than available spaces.
15. The Tuskegee Syphilis Study is an example of a violation of which ethical principle (pick one)?
- a. Obtaining informed consent from participants.
+ a. Maintaining confidentiality of participant data.
b. Ensuring statistical power in experimental design.
- c. Maintaining confidentiality of participant data.
+ c. Obtaining informed consent from participants.
d. Providing monetary compensation to participants.
16. What does "equipoise" refer to in the context of clinical trials (pick one)?
- a. The balance between treatment efficacy and side effects.
- b. The ethical requirement of genuine uncertainty about the treatment's effectiveness.
- c. The statistical equilibrium achieved when sample sizes are equal.
- d. The state where all participants have equal access to the treatment.
+ a. The statistical equilibrium achieved when sample sizes are equal.
+ b. The state where all participants have equal access to the treatment.
+ c. The balance between treatment efficacy and side effects.
+ d. The ethical requirement of genuine uncertainty about the treatment's effectiveness.
17. @ware1989investigating [p. 299] mentions "randomized-consent" and continues that it was "attractive in this setting because a standard approach to informed consent would require that parents of infants near death be approached to give informed consent for an invasive surgical procedure that would then, in some instances, not be administered. Those familiar with the agonizing experience of having a child in a neonatal intensive care unit can appreciate that the process of obtaining informed consent would be both frightening and stressful to parents." To what extent do you agree with this position, especially given, as @ware1989investigating [p. 305], mentions "the need to withhold information about the study from parents of infants receiving Conventional Medical Therapy (CMT)"?
18. Which of the following is a key concern when designing survey questions (pick one)?
- a. Using technical jargon to appear more credible.
- b. Leading respondents toward a desired answer.
+ a. Asking multiple questions at once to save time.
+ b. Using technical jargon to appear more credible.
c. Ensuring questions are relevant and easily understood by respondents.
- d. Asking multiple questions at once to save time.
+ d. Leading respondents toward a desired answer.
19. In the context of experiments, what is a "confounder" (pick one)?
- a. A variable that is intentionally manipulated by the researcher.
- b. A variable that is not controlled for and may affect the outcome.
- c. A participant who does not follow the experimental protocol.
+ a. A participant who does not follow the experimental protocol.
+ b. A variable that is intentionally manipulated by the researcher.
+ c. A variable that is not controlled for and may affect the outcome.
d. An error in data collection leading to invalid results.
20. The Oregon Health Insurance Experiment primarily aimed to assess the impact of what (pick one)?
- a. Introducing a new private health insurance plan.
- b. Randomly providing Medicaid to low-income adults to study health outcomes.
- c. Comparing different medical treatments for chronic illnesses.
- d. Evaluating the cost-effectiveness of health interventions.
+ a. Randomly providing Medicaid to low-income adults to study health outcomes.
+ b. Introducing a new private health insurance plan.
+ c. Evaluating the cost-effectiveness of health interventions.
+ d. Comparing different medical treatments for chronic illnesses.
21. In survey design, what is the purpose of a pilot study (pick one)?
- a. To collect preliminary data for publication.
+ a. To ensure all respondents understand the study's hypotheses.
b. To test and refine the survey instrument before full deployment.
c. To increase the sample size for better statistical power.
- d. To ensure all respondents understand the study's hypotheses.
+ d. To collect preliminary data for publication.
22. Why might an A/A test be conducted in the context of A/B testing (pick one)?
- a. To compare two entirely different treatments.
+ a. To test the effectiveness of the control condition.
b. To ensure the randomization process is properly creating comparable groups.
- c. To save resources by not implementing a new treatment.
- d. To test the effectiveness of the control condition.
+ c. To compare two entirely different treatments.
+ d. To save resources by not implementing a new treatment.
23. What ethical concern is particularly relevant to A/B testing in industry settings (pick one)?
- a. Difficulty in measuring long-term effects.
- b. The high cost of conducting experiments.
+ a. The high cost of conducting experiments.
+ b. Difficulty in measuring long-term effects.
c. Lack of informed consent from users being experimented on.
d. Ensuring statistical significance in large datasets.
24. Pretend that you work as a junior analyst for a large consulting firm. Further, pretend that your consulting firm has taken a contract to put together a facial recognition model for a government border security department. Write at least three paragraphs, with examples and references, discussing your thoughts, with regard to ethics, on this matter.
25. What does the term "average treatment effect" (ATE) refer to (pick one)?
- a. The difference in outcomes between the treatment and control groups across the entire sample.
- b. The effect of treatment on a single individual.
- c. The average outcome observed in the control group.
+ a. The effect of treatment on a single individual.
+ b. The average outcome observed in the control group.
+ c. The difference in outcomes between the treatment and control groups across the entire sample.
d. The total sum of all treatment effects observed.
26. In the context of experiments, what does "blinding" refer to (pick one)?
- a. Keeping the sample size hidden from participants.
+ a. Using complex statistical methods to analyze data.
b. Ensuring participants do not know whether they are receiving the treatment or control.
- c. Randomly assigning treatments without recording the assignments.
- d. Using complex statistical methods to analyze data.
+ c. Keeping the sample size hidden from participants.
+ d. Randomly assigning treatments without recording the assignments.
27. What is a primary reason for doing simulation before analyzing real data in experiments (pick one)?
a. Simulation is more accurate than real data analyses.
- b. Simulation helps understand expected outcomes and potential errors in analysis.
- c. Simulation requires less computational power.
- d. Simulation eliminates the need for actual data collection.
+ b. Simulation requires less computational power.
+ c. Simulation eliminates the need for actual data collection.
+ d. Simulation helps understand expected outcomes and potential errors in analysis.
28. Which statement best captures the concept of "selection bias" (pick one)?
- a. Participants drop out of a study at random.
- b. The sample accurately represents the target population.
- c. The method of selecting participants causes the sample to be unrepresentative.
- d. All variables are controlled except for the treatment variable.
+ a. The sample accurately represents the target population.
+ b. All variables are controlled except for the treatment variable.
+ c. Participants drop out of a study at random.
+ d. The method of selecting participants causes the sample to be unrepresentative.
29. Please redo the Upworthy analysis, but for "!" instead of "?". What is the difference in clicks (pick one)?
a. -8.3
b. -7.2
- c. -4.5
- d. -5.6
+ c. -5.6
+ d. -4.5
30. As described by @pewletterman, which sampling methodology was used to increase the likelihood of including respondents from smaller religious groups without introducing bias (pick one)?
- a. Quota sampling.
- b. Snowball sampling.
- c. Composite measure of size.
- d. Random digit dialing.
+ a. Snowball sampling.
+ b. Quota sampling.
+ c. Random digit dialing.
+ d. Composite measure of size.
31. As described by @pewletterman, how did the researchers ensure that their survey was ethically conducted (pick one)?
- a. They provided financial incentives for participation.
- b. They obtained approval from an institutional research review board (IRB) within India.
- c. They only surveyed individuals who volunteered.
- d. They anonymized data by not collecting any demographic information.
-32. Based on @Stantcheva2023, what is coverage error in survey sampling (pick one)?
- a. The difference between the target population and the sample frame.
- b. The difference between the planned sample and the actual respondents.
- c. Errors due to respondent's inattentiveness.
- d. Bias from oversampling minorities.
-33. Based on @Stantcheva2023, what is the moderacy response bias (pick one)?
- a. The tendency to choose extreme values on a scale.
+ a. They obtained approval from an institutional research review board (IRB) within India.
+ b. They only surveyed individuals who volunteered.
+ c. They anonymized data by not collecting any demographic information.
+ d. They provided financial incentives for participation.
+32. From @Stantcheva2023, what is coverage error in survey sampling (pick one)?
+ a. Errors due to respondent's inattentiveness.
+ b. The difference between the target population and the sample frame.
+ c. Bias from oversampling minorities.
+ d. The difference between the planned sample and the actual respondents.
+33. From @Stantcheva2023, what is the moderacy response bias (pick one)?
+ a. The tendency to choose middle options regardless of question content.
b. Bias introduced by question order.
- c. The tendency to choose middle options regardless of question content.
+ c. The tendency to choose extreme values on a scale.
d. The tendency to agree with the surveyor's expected answer.
-34. Based on @Stantcheva2023, which of the following is a way to minimize social desirability bias in online surveys (pick one)?
- a. Making the respondent's identity public.
- b. Offering high monetary rewards.
- c. Providing reassurances about the confidentiality of responses.
+34. From @Stantcheva2023, which of the following is a way to minimize social desirability bias in online surveys (pick one)?
+ a. Offering high monetary rewards.
+ b. Providing reassurances about the confidentiality of responses.
+ c. Making the respondent's identity public.
d. Keeping survey questions long and complex.
-35. Based on @Stantcheva2023, what does response order bias refer to (pick one)?
- a. Respondents systematically choosing extreme values.
- b. Respondents skipping sensitive questions.
+35. From @Stantcheva2023, what does response order bias refer to (pick one)?
+ a. Respondents skipping sensitive questions.
+ b. Respondents systematically choosing extreme values.
c. Respondents failing to understand the question.
- d. Respondents choosing answers based on the order they are presented.
-36. Based on @Stantcheva2023, while managing a survey, you should do everything APART from (pick one)?
- a. Soft-launch the survey.
+ d. Respondents choosing answers based on their order.
+36. From @Stantcheva2023, while managing a survey, you should do everything APART from (pick one)?
+ a. Check the data.
b. Monitor the survey.
c. Test statistical hypotheses.
- d. Check the data.
-37. A common approach to minimizing question order effect is to randomize the order of questions. To what extent do you think this is effective?
-8. Based on @Stantcheva2023, what is a good practice for recruiting respondents in online surveys (pick one)?
- a. Revealing the survey's topic in the invitation email.
- b. Emphasizing the length of the survey to increase engagement.
- c. Providing minimal information about the survey's purpose initially.
- d. Offering the highest possible monetary incentives.
-38. Based on @Stantcheva2023, what does attrition in surveys refer to (pick one)?
- a. The rate at which respondents drop out before completing the survey.
- b. The total number of people who received the invitation.
- c. The accuracy of the data collected.
- d. The differences between respondents and nonrespondents.
+ d. Soft-launch the survey.
+37. A common approach to minimizing question order effect is to randomize the order of questions. To what extent do you think this is effective?
+38. From @Stantcheva2023, what is a good practice for recruiting respondents in online surveys (pick one)?
+ a. Offering the highest possible monetary incentives.
+ b. Providing minimal information about the survey's purpose initially.
+ c. Revealing the survey's topic in the invitation email.
+ d. Emphasizing the length of the survey to increase engagement.
+39. From @Stantcheva2023, what does attrition in surveys refer to (pick one)?
+ a. The total number of people who received the invitation.
+ b. The accuracy of the data collected.
+ c. The differences between respondents and nonrespondents.
+ d. The rate at which respondents drop out before completing the survey.
### Activity {.unnumbered}
diff --git a/09-clean_and_prepare.rmarkdown b/09-clean_and_prepare.rmarkdown
deleted file mode 100644
index 7e79b3b1..00000000
--- a/09-clean_and_prepare.rmarkdown
+++ /dev/null
@@ -1,1913 +0,0 @@
----
-engine: knitr
----
-
-
-# Clean and prepare {#sec-clean-and-prepare}
-
-**Prerequisites**
-
-- Read *Data Feminism*, [@datafeminism2020]
- - Focus on Chapter 5 "Unicorns, Janitors, Ninjas, Wizards, and Rock Stars", which discusses the importance of considering different sources of data about the same process.
-- Read *R for Data Science*, [@r4ds]
- - Focus on Chapter 6 "Data tidying", which provides an overview of tidy data and some strategies to obtain it.
-- Read *An introduction to data cleaning with R*, [@de2013introduction]
- - Focus on Chapter 2 "From raw data to technically correct data", which provides detailed information about reading data into R and various classes.
-- Read *What The Washington Post Elections Engineering team had to learn about election data* [@washingtonpostelections]
- - Details several practical issues about real-world datasets.
-- Read *Column Names as Contracts*, [@columnnamesascontracts]
- - Introduces the benefits of having a limited vocabulary for naming variables.
-- Read *Combining Statistical, Physical, and Historical Evidence to Improve Historical Sea-Surface Temperature Records*, [@Chan2021Combining]
- - Details the difficulty of creating a dataset of temperatures from observations taken by different ships at different times.
-
-**Key concepts and skills**
-
-- Cleaning and preparing a dataset is difficult work that involves making many decisions. Planning an endpoint and simulating the dataset that we would like to end up with are key elements of cleaning and preparing data.
-- It can help to work in an iterative way, beginning with a small sample of the dataset. Write code to fix some aspect, and then iterate and generalize to additional tranches.
-- During that process we should also develop a series of tests and checks that the dataset should pass. This should focus on key features that we would expect of the dataset.
-- We should be especially concerned about the class of variables, having clear names, and that the unique values of each variable are as expected given all this.
-
-**Software and packages**
-
-- Base R [@citeR]
-- `janitor` [@janitor]
-- `knitr` [@citeknitr]
-- `lubridate` [@GrolemundWickham2011]
-- `modelsummary` [@citemodelsummary]
-- `opendatatoronto` [@citeSharla]
-- `pdftools` [@pdftools]
-- `pointblank` [@pointblank]
-- `readxl` [@readxl]
-- `scales` [@scales]
-- `stringi` [@stringi]
-- `testthat` [@testthat]
-- `tidyverse` [@tidyverse]
-- `validate` [@validate]
-
-
-```{r}
-#| message: false
-#| warning: false
-
-library(janitor)
-library(knitr)
-library(lubridate)
-library(modelsummary)
-library(opendatatoronto)
-library(pdftools)
-library(pointblank)
-library(readxl)
-library(scales)
-library(stringi)
-library(testthat)
-library(tidyverse)
-library(validate)
-```
-
-
-
-
-## Introduction
-
-> "Well, Lyndon, you may be right and they may be every bit as intelligent as you say," said Rayburn, "but I'd feel a whole lot better about them if just one of them had run for sheriff once."
->
-> Sam Rayburn reacting to Lyndon Johnson's enthusiasm about John Kennedy's incoming cabinet, as quoted in *The Best and the Brightest* [@halberstam, p. 41].
-
-In this chapter we put in place more formal approaches for data cleaning and preparation\index{data!cleaning}. These are centered around:
-
-1. validity\index{validity};
-2. internal consistency\index{consistency!internal}; and
-3. external consistency\index{consistency!external}.
-
-Your model does not care whether you validated your data, but you should. Validity\index{validity} means that the values in the dataset are not obviously wrong. For instance, with few exceptions, currencies should not have letters in them, names should not have numbers, and velocities should not be faster than the speed of light. Internal consistency\index{consistency!internal} means the dataset does not contradict itself. For instance, that might mean that constituent columns add to the total column. External consistency\index{consistency!external} means that the dataset does not, in general, contradict outside sources, and is deliberate when it does. For instance, if our dataset purports to be about the population of cities, then we would expect that they are the same as, to a rough approximation, say, those available from relevant censuses on Wikipedia.
-
-SpaceX, the United States rocket company, uses cycles of ten or 50 Hertz (equivalent to 0.1 and 0.02 seconds, respectively) to control their rockets. Each cycle, the inputs from sensors, such as temperature and pressure, are read, processed, and used to make a decision, such as whether to adjust some setting [@martinpopper]. We recommend a similar iterative approach of small adjustments during data cleaning and preparation\index{data!cleaning}. Rather than trying to make everything perfect from the start, just get started, and iterate through a process of small, continuous improvements.
-
-To a large extent, the role of data cleaning and preparation\index{data!cleaning} is so great that the only people that understand a dataset are those that have cleaned it. Yet, the paradox of data cleaning is that often those that do the cleaning and preparation are those that have the least trust in the resulting dataset. At some point in every data science workflow, those doing the modeling should do some data cleaning. Even though few want to do it [@Sambasivan2021], it can be as influential as modeling. To clean and prepare data is to make many decisions, some of which may have important effects on our results. For instance, @labelsiswrongs find the test sets of some popular datasets in computer science contain, on average, labels that are wrong in around three per cent of cases.\index{computer science} @Banes2022 re-visit the Sumatran orang-utan *(Pongo abelii)* reference genome and find that nine of the ten samples had some issue. And @eveninaccountingwhat find a substantial difference between the as-filed and standardized versions of a company's accounting data, especially for complex financial situations. Like Sam Rayburn wishing that Kennedy's cabinet despite their intelligence, had experience in the nitty-gritty, a data scientist needs to immerse themselves in the messy reality of their dataset.
-
-The reproducibility crisis\index{reproducibility!crisis}, which was identified early in psychology [@anniesfind] but since extended to many other disciplines in the physical and social sciences, brought to light issues such as p-value "hacking"\index{p-hacking}, researcher degrees of freedom, file-drawer issues, and even data and results fabrication [@gelman2013garden]. Steps are now being put in place to address these. But there has been relatively little focus on the data gathering, cleaning, and preparation aspects of applied statistics, despite evidence that decisions made during these steps greatly affect statistical results [@huntington2021influence]. In this chapter we focus on these issues.
-
-While the statistical practices that underpin data science are themselves correct and robust when applied to simulated datasets, data science is not typically conducted with data that follow the assumptions underlying the models that are commonly fit. For instance, data scientists are interested in "messy, unfiltered, and possibly unclean data---tainted by heteroskedasticity, complex dependence and missingness patterns---that until recently were avoided in polite conversations between more traditional statisticians" [@craiu2019hiring]. Big data\index{data!big data} does not resolve this issue and may even exacerbate it. For instance, population inference based on larger amounts of poor-quality data, without adjusting for data issues, will just lead to more confidently wrong conclusions [@meng2018statistical]. The problems that are found in much of applied statistics research are not necessarily associated with researcher quality, or their biases [@silberzahn2018many]. Instead, they are a result of the context within which data science is conducted. This chapter provides an approach and tools to explicitly think about this work.
-
-@gelman2020most, writing about the most important statistical ideas of the past 50 years, say that each of them enabled new ways of thinking about data analysis. These ideas brought into the tent of statistics, approaches that "had been considered more a matter of taste or philosophy".\index{statistics} The focus on data cleaning and preparation\index{data!cleaning} in this chapter is analogous, insofar as it represents a codification, or bringing inside the tent, of aspects that are typically, incorrectly, considered those of taste rather than core statistical concerns.
-
-The workflow\index{workflow} for data cleaning and preparation\index{data!cleaning} that we advocate is:
-
-1. Save the original, unedited data.
-2. Begin with an end in mind by sketching and simulating.
-3. Write tests and documentation.
-4. Execute the plan on a small sample.
-5. Iterate the plan.
-6. Generalize the execution.
-7. Update tests and documentation.
-
-We will need a variety of skills to be effective, but this is the very stuff of data science. The approach needed is some combination of dogged and sensible. Perfect is very much the enemy of good enough when it comes to data cleaning. And to be specific, it is better to have 90 per cent of the data cleaned and prepared, and to start exploring that, before deciding whether it is worth the effort to clean and prepare the remaining 10 per cent. Because that remainder will likely take an awful lot of time and effort.
-
-All data regardless of whether they were obtained from farming, gathering, or hunting, will have issues. We need approaches that can deal with a variety of concerns, and more importantly, understand how they might affect our modeling [@van2005data]. To clean data is to analyze data. This is because the process forces us to make choices about what we value in our results [@thatrandyauperson].
-
-## Workflow
-
-### Save the original, unedited data
-
-The first step is to save the original, unedited data\index{data!unedited} into a separate, local folder. The original, unedited data establishes the foundation for reproducibility [@wilsongoodenough]. If we obtained our data from a third-party, such as a government website, then we have no control over whether they will continue to host that data, update it, or change the address at which it is available. Saving a local copy also reduces the burden that we impose on their servers.
-
-Having locally saved the original, unedited data we must maintain a copy of it in that state, and not modify it. As we begin to clean and prepare it, we instead make these changes to a copy of the dataset. Maintaining the original, unedited dataset, and using scripts to create the dataset that we are interested in analyzing, ensures that our entire workflow is reproducible. It may be that the changes that we decide to make today, are not ones that we would make tomorrow, having learnt more about the dataset. We need to ensure that we have that data in the original, unedited state in case we need to return to it [@Borer2009].
-
-We may not always be allowed to share that original, unedited data, but we can almost always create something similar. For instance, if we are using a restricted-use computer, then it may be that the best we can do is create a simulated version of the original, unedited data that conveys the main features, and include detailed access instructions in a README file.
-
-
-### Plan
-
-Planning the endpoint forces us to begin with an end in mind and is important for a variety of reasons. As with scraping data, introduced in @sec-gather-data, it helps us to be proactive about scope-creep. But with data cleaning it additionally forces us to really think about what we want the final dataset to look like.
-
-The first step is to sketch the dataset that we are interested in. The key features of the sketch will be aspects such as the names of the columns, their class, and the possible range of values. For instance, we might be interested in the populations of US states. Our sketch might look like @fig-sketchdataplan.
-
-![Planned dataset of US states and their populations](figures/state_population_sketch.png){#fig-sketchdataplan width=40% fig-align="center"}
-
-In this case, the sketch forces us to decide that we want full names rather than abbreviations for the state names, and the population to be measured in millions. The process of sketching this endpoint has forced us to make decisions early on and be clear about our desired endpoint.
-
-We then implement that using code to simulate data.\index{simulation!US state population} Again, this process forces us to think about what reasonable values look like in our dataset because we must decide which functions to use. We need to think carefully about the unique values of each variable. For instance, if the variable is meant to be "gender" then unique values such as "male", "female", "other", and "unknown" may be expected, but a number such as "1,000" would likely be wrong. It also forces us to be explicit about names because we must assign the output of those functions to a variable. For instance, we could simulate some population data for the US states.
-
-
-```{r}
-#| message: false
-#| warning: false
-
-set.seed(853)
-
-simulated_population <-
- tibble(
- state = state.name,
- population = runif(n = 50, min = 0, max = 50) |>
- round(digits = 2)
- )
-
-simulated_population
-```
-
-
-
-::: {.content-visible when-format="pdf"}
-Our purpose, during data cleaning and preparation\index{data!cleaning}, is to then bring our original, unedited data close to that plan. Ideally, we would plan so that the desired endpoint of our dataset is "tidy data"\index{tidy data}. This is introduced in the ["R essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html), but briefly, it means that [@r4ds; @wickham2014tidy, p. 4]:
-
-1. each variable is in its own column;
-2. each observation is in its own row; and
-3. each value is in its own cell.
-
-Begin thinking about validity and internal consistency\index{consistency!internal} at this stage. What are some of the features that these data should have? Note these as you go through the process of simulating the dataset because we will draw on them to write tests.
-:::
-
-::: {.content-visible unless-format="pdf"}
-Our purpose, during data cleaning and preparation\index{data!cleaning}, is to then bring our original, unedited data close to that plan. Ideally, we would plan so that the desired endpoint of our dataset is "tidy data"\index{tidy data}. This is introduced in [Online Appendix -@sec-r-essentials], but briefly, it means that [@r4ds; @wickham2014tidy, p. 4]:
-
-1. each variable is in its own column;
-2. each observation is in its own row; and
-3. each value is in its own cell.
-
-Begin thinking about validity and internal consistency\index{consistency!internal} at this stage. What are some of the features that these data should have? Note these as you go through the process of simulating the dataset because we will draw on them to write tests.
-:::
-
-### Start small
-
-Having thoroughly planned we can turn to the original, unedited data that we are dealing with. Usually we want to manipulate the original, unedited data into a rectangular dataset as quickly as possible. This allows us to use familiar functions from the `tidyverse`. For instance, let us assume that we are starting with a `.txt` file.
-
-The first step is to look for regularities in the dataset. We want to end up with tabular data, which means that we need some type of delimiter to distinguish different columns. Ideally this might be features such as a comma, a semicolon, a tab, a double space, or a line break. In the following case we could take advantage of the comma.
-
-```
-Alabama, 5
-Alaska, 0.7
-Arizona, 7
-Arkansas, 3
-California, 40
-```
-
-In more challenging cases there may be some regular feature of the dataset that we can take advantage of. Sometimes various text is repeated, as in the following case.\index{text!cleaning}\index{data cleaning!text}
-
-```
-State is Alabama and population is 5 million.
-State is Alaska and population is 0.7 million.
-State is Arizona and population is 7 million.
-State is Arkansas and population is 3 million.
-State is California and population is 40 million.
-```
-
-In this case, although we do not have a traditional delimiter, we can use the regularity of "State is", "and population is ", and "million" to get what we need. A more difficult case is when we do not have line breaks. This final case is illustrative of that.
-
-```
-Alabama 5 Alaska 0.7 Arizona 7 Arkansas 3 California 40
-```
-
-One way to approach this is to take advantage of the different classes and values that we are looking for. For instance, we know that we are after US states, so there are only 50 possible options (setting D.C. to one side for the time being), and we could use the these as a delimiter. We could also use the fact that population is a number, and so separate based on a space followed by a number.
-
-We will now convert this final case into tidy data.
-
-
-```{r}
-unedited_data <-
- c("Alabama 5 Alaska 0.7 Arizona 7 Arkansas 3 California 40")
-
-tidy_data <-
- tibble(raw = unedited_data) |>
- separate(
- col = raw,
- into = letters[1:5],
- sep = "(?<=[[:digit:]]) " # A bracket preceded by numbers
- ) |>
- pivot_longer(
- cols = letters[1:5],
- names_to = "drop_me",
- values_to = "separate_me"
- ) |>
- separate(
- col = separate_me,
- into = c("state", "population"),
- sep = " (?=[[:digit:]])" # A space followed by a number
- ) |>
- select(-drop_me)
-
-tidy_data
-```
-
-
-
-### Write tests and documentation
-
-::: {.content-visible when-format="pdf"}
-Having established a rectangular dataset, albeit a messy one, we should begin to look at the classes that we have. We do not necessarily want to fix the classes at this point, because that can result in lost data. But we look at the class to see what it is, compare it to our simulated dataset, and note the columns where it is different to see what changes need to be made. Background on `class()` is available in the ["R essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html).
-:::
-
-::: {.content-visible unless-format="pdf"}
-Having established a rectangular dataset, albeit a messy one, we should begin to look at the classes that we have. We do not necessarily want to fix the classes at this point, because that can result in lost data. But we look at the class to see what it is, compare it to our simulated dataset, and note the columns where it is different to see what changes need to be made. Background on `class()` is available in [Online Appendix -@sec-r-essentials].
-:::
-
-Before changing the class and before going on to more bespoke issues, we should deal with some common issues including:
-
-- Commas and other punctuation, such as denomination signs ($, €, £, etc.), in variables that should be numeric.
-- Inconsistent formatting of dates, such as "December" and "Dec" and "12" all in the one variable.
-- Unexpected character encoding, especially in Unicode, which may not display consistently.^[By way of background, character encoding is needed for computers, which are based on strings of 0s and 1s, to be able to consider symbols such as alphabets. One source of particularly annoying data cleaning issues is different character encoding. This is especially common when dealing with foreign languages and odd characters. In general, we use an encoding called UTF-8. The encoding of a character vector can be found using `Encoding()`.]
-
-Typically, we want to fix anything immediately obvious. For instance, we should remove commas that have been used to group digits in currencies. However, the situation will often feel overwhelming. What we need to do is to look at the unique values in each variable, and then triage what we will fix. We make the triage decision based on what is likely to have the largest impact. That usually means creating counts of the observations, sorting them in descending order, and then dealing with them in this order.
-
-When the tests of membership are passed---which we initially establish based on simulation and experience---then we can change the class, and run all the tests again. We have adapted this idea from the software development approach of unit testing. Tests are crucial because they enable us to understand whether software (or in this case data) is fit for our purpose [@researchsoftware]. Tests, especially in data science, are not static things that we just write once and then forget. Instead they should update and evolve as needed.
-
-:::{.callout-note}
-## Oh, you think we have good data on that!
-
-The simplification of reality can be especially seen in sports records, which necessarily must choose what to record. Sports records are fit for some purposes and not for others. For instance, chess is played on an 8 x 8 board of alternating black and white squares. The squares are denoted by a unique combination of both a letter (A-G) and a number (1-8). Most pieces have a unique abbreviation, for instance knights are N and bishops are B. Each game is independently recorded using this "algebraic notation" by each player. These records allow us to recreate the moves of the game. The 2021 Chess World Championship was contested by Magnus Carlsen and Ian Nepomniachtchi. There were a variety of reasons this game was particularly noteworthy---including it being the longest world championship game---but one is the uncharacteristic mistakes that both Carlsen and Nepomniachtchi made. For instance, at Move 33 Carlsen did not exploit an opportunity; and at Move 36 a different move would have provided Nepomniachtchi with a promising endgame [@PeterDoggers]. One reason for these mistakes may have been that both players at that point in the game had very little time remaining---they had to decide on their moves very quickly. But there is no sense of that in the representation provided by the game sheet because it does not record time remaining. The record is fit for purpose as a "correct" representation of what happened in the game; but not necessarily why it happened.
-:::
-
-Let us run through an example with a collection of strings, some of which are slightly wrong. This type of output is typical of OCR, introduced in @sec-gather-data, which often gets most of the way there, but not quite.
-
-
-```{r}
-messy_string <- paste(
- c("Patricia, Ptricia, PatricIa, Patric1a, PatricIa"),
- c("PatrIcia, Patricia, Patricia, Patricia , 8atricia"),
- sep = ", "
-)
-```
-
-
-As before, we first get this into a rectangular dataset.
-
-
-```{r}
-messy_dataset <-
- tibble(names = messy_string) |>
- separate_rows(names, sep = ", ")
-
-messy_dataset
-```
-
-
-We now need to decide which of these errors we are going to fix. To help us decide which are most important, we create a count.
-
-
-```{r}
-messy_dataset |>
- count(names, sort = TRUE)
-```
-
-
-The most common unique observation is the correct one. The next one---"PatricIa"---looks like the "i" has been incorrectly capitalized. This is true for "PatrIcia" as well. We can fix the capitalization issues with `str_to_title()`, which converts the first letter of each word in a string to uppercase and the rest to lowercase, and then redo the count.
-
-::: {.content-visible when-format="pdf"}
-Background on strings is available in the ["R essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html).
-:::
-
-::: {.content-visible unless-format="pdf"}
-Background on strings is available in [Online Appendix -@sec-r-essentials].
-:::
-
-
-```{r}
-messy_dataset_fix_I_8 <-
- messy_dataset |>
- mutate(
- names = str_to_title(names)
- )
-
-messy_dataset_fix_I_8 |>
- count(names, sort = TRUE)
-```
-
-
-Already this is much better with 60 per cent of the values are correct, compared with the earlier 30 per cent. There are two more clear errors---"8tricia" and "Ptricia"---with the first distinguished by an “8” instead of a “P”, and the second missing an "a". We can fix these issues with `str_replace_all()`.
-
-
-```{r}
-messy_dataset_fix_a_n <-
- messy_dataset_fix_I_8 |>
- mutate(
- names = str_replace_all(names, "8atricia", "Patricia"),
- names = str_replace_all(names, "Ptricia", "Patricia")
- )
-
-messy_dataset_fix_a_n |>
- count(names, sort = TRUE)
-```
-
-
-We have achieved an 80 per cent outcome with not too much effort. The final two issues are more subtle. The first has occurred because the "i" has been incorrectly coded as a "1". In some fonts this will show up, but in others it will be more difficult to see. This is a common issue, especially with OCR, and something to be aware of. The second occurs because of a trailing space. Trailing and leading spaces are another common issue and we can address them with `str_trim()`. After we fix these two remaining issues we have all entries corrected.
-
-
-```{r}
-cleaned_data <-
- messy_dataset_fix_a_n |>
- mutate(
- names = str_replace_all(names, "Patric1a", "Patricia"),
- names = str_trim(names, side = c("right"))
- )
-
-cleaned_data |>
- count(names, sort = TRUE)
-```
-
-
-We have been doing the tests in our head in this example. We know that we are hoping for "Patricia". But we can start to document this test as well. One way is to look to see if values other than "Patricia" exist in the dataset.
-
-
-```{r}
-check_me <-
- cleaned_data |>
- filter(names != "Patricia")
-
-if (nrow(check_me) > 0) {
- print("Still have values that are not Patricia!")
-}
-```
-
-
-We can make things a little more imposing by stopping our code execution if the condition is not met with `stopifnot()`. To use this function we define a condition that we would like met. We could implement this type of check throughout our code. For instance if we expected there to be a certain number of observations in the dataset, or for a certain variable to have various properties, such as being an integer or a factor.
-
-
-```{r}
-stopifnot(nrow(check_me) == 0)
-```
-
-
-We can use `stopifnot()` to ensure that our script is working as expected as it runs.
-
-Another way to write tests for our dataset is to use `testthat`. Although developed for testing packages, we can use the functionality to test our datasets. For instance, we can use `expect_length()` to check the length of a dataset and `expect_equal()` to check the content.
-
-
-```{r}
-#| message: false
-#| warning: false
-
-# Is the dataset of length one?
-expect_length(check_me, 1)
-# Are the observations characters?
-expect_equal(class(cleaned_data$names), "character")
-# Is every unique observation "Patricia"?
-expect_equal(unique(cleaned_data$names), "Patricia")
-```
-
-
-If the tests pass then nothing happens, but if the tests fail then the script will stop.
-
-What do we test? It is a difficult problem, and we detail a range of more-specific tests in the next section. But broadly we test what we have, against what we expect. The engineers working on the software for the Apollo\index{Apollo} program in the 1960s initially considered writing tests to be "busy work" [@digitalapollo, p. 170]. But they eventually came to realize that NASA would not have faith that software could be used to send men to the moon unless it was accompanied by a comprehensive suite of tests. And it is the same for data science.\index{data science!need for tests}
-
-Start with tests for validity\index{validity}\index{testing!validity}. These will typically check the class of the variables, their unique values, and the number of observations. For instance, if we were using a recent dataset then columns that are years could be tested to ensure that all elements have four digits and start with a "2". @peterbaumgartnertesting describes this as tests on the schema.
-
-After that, turn to checks of internal consistency\index{consistency!internal}. For instance, if there are variables of different numeric responses, then check that the sum of those equals a total variable, or if it does not then this difference is explainable. Finally, turn to tests for external consistency\index{consistency!external}. Here we want to use outside information to inform our tests. For instance, if we had a variable of the neonatal mortality rate (NMR) for Germany (this concept was introduced in @sec-fire-hose), then we could look at the estimates from the World Health Organization (WHO), and ensure our NMR variable aligns. Experienced analysts do this all in their head. The issue is that it does not scale, can be inconsistent, and overloads on reputation. We return to this issue in @sec-its-just-a-linear-model in the context of modeling.
-
-We write tests throughout our code, rather than right at the end. In particular, using `stopifnot()` statements on intermediate steps ensures that the dataset is being cleaned in a way that we expect. For instance, when merging two datasets we could check:
-
-1) The variable names in the datasets are unique, apart from the column/s to be used as the key/s.
-2) The number of observations of each type is being carried through appropriately.
-3) The dimensions of the dataset are not being unexpectedly changed.
-
-
-### Iterate, generalize, and update
-
-We could now iterate the plan. In this most recent case, we started with ten entries. There is no reason that we could not increase this to 100 or even 1,000. We may need to generalize the cleaning procedures and tests. But eventually we would start to bring the dataset into some sort of order.
-
-## Checking and testing
-
-Robert Caro, the biographer of Lyndon Johnson introduced in @sec-on-writing, spent years tracking down everyone connected to the 36th President of the United States. Caro and his wife Ina went so far as to live in Texas Hill Country for three years so that they could better understand where Johnson was from. When Caro heard that Johnson, as a senator, would run to the Senate from where he stayed in D.C., he ran that route multiple times himself to try to understand why Johnson was running. Caro eventually understood it only when he ran the route as the sun was rising, just as Johnson had done; it turns out that the sun hits the Senate Rotunda in a particularly inspiring way [@caroonworking, p. 156]. This background work enabled him to uncover aspects that no one else knew. For instance, Johnson almost surely stole his first election win [@caroonworking, p. 116]. We need to understand our data to this same extent. We want to metaphorically turn every page.
-
-The idea of negative space is well established in design. It refers to that which surrounds the subject. Sometimes negative space is used as an effect. For instance the logo of FedEx, an American logistics company, has negative space between the E and X that creates an arrow. In a similar way, we want to be cognizant of the data that we have, and the data that we do not have [@citemyboy]. We are worried that the data that we do not have somehow has meaning, potentially even to the extent of changing our conclusions. When we are cleaning data, we are looking for anomalies. We are interested in values that are in the dataset that should not be, but also the opposite situation---values that should be in the dataset but are not. There are three tools that we use to identify these situations: graphs, counts, and tests.
-
-We also use these tools to ensure that we are not changing correct observations to incorrect. Especially when our cleaning and preparation requires many steps, it may be that fixes at one stage are undone later. We use graphs, counts, and especially tests, to prevent this. The importance of these grows exponentially with the size of the dataset. Small and medium datasets are more amenable to manual inspection and other aspects that rely on the analyst, while larger datasets especially require more efficient strategies [@hand2018statistical].
-
-### Graphs
-
-Graphs are an invaluable tool when cleaning data, because they show each observation in the dataset, potentially in relation to the other observations.\index{data!graphs} They are useful for identifying when a value does not belong. For instance, if a value is expected to be numerical, but is still a character then it will not plot, and a warning will be displayed. Graphs will be especially useful for numerical data, but are still useful for text and categorical data. Let us pretend that we have a situation where we are interested in a person's age, for some youth survey. We have the following data:
-
-
-```{r}
-youth_survey_data <-
- tibble(ages = c(
- 15.9, 14.9, 16.6, 15.8, 16.7, 17.9, 12.6, 11.5, 16.2, 19.5, 150
- ))
-```
-
-```{r}
-#| eval: true
-#| echo: false
-
-youth_survey_data_fixed <-
- youth_survey_data |>
- mutate(ages = if_else(ages == 150, 15.0, ages))
-```
-
-```{r}
-#| label: fig-youth-survey
-#| fig-cap: "The ages in the simulated youth survey dataset identify a data issue"
-#| fig-subcap: ["Before cleaning", "After cleaning"]
-#| layout-ncol: 2
-
-youth_survey_data |>
- ggplot(aes(x = ages)) +
- geom_histogram(binwidth = 1) +
- theme_minimal() +
- labs(
- x = "Age of respondent",
- y = "Number of respondents"
- )
-
-youth_survey_data_fixed |>
- ggplot(aes(x = ages)) +
- geom_histogram(binwidth = 1) +
- theme_minimal() +
- labs(
- x = "Age of respondent",
- y = "Number of respondents"
- )
-```
-
-
-@fig-youth-survey-1 shows an unexpected value of 150. The most likely explanation is that the data were incorrectly entered, missing the decimal place, and should be 15.0. We could fix that, document it, and then redo the graph, which would show that everything seemed more valid (@fig-youth-survey-2).
-
-### Counts
-
-We want to focus on getting most of the data right, so we are interested in the counts of unique values.\index{data!counts} Hopefully most of the data are concentrated in the most common counts. But it can also be useful to invert it and see what is especially uncommon. The extent to which we want to deal with these depends on what we need. Ultimately, each time we fix one we are getting very few additional observations, potentially even just one. Counts are especially useful with text or categorical data but can be helpful with numerical data as well.
-
-Let us see an example of text data, each of which is meant to be "Australia".\index{text!cleaning}
-
-
-```{r}
-australian_names_data <-
- tibble(
- country = c(
- "Australie", "Austrelia", "Australie", "Australie", "Aeustralia",
- "Austraia", "Australia", "Australia", "Australia", "Australia"
- )
- )
-
-australian_names_data |>
- count(country, sort = TRUE)
-```
-
-
-The use of this count identifies where we should spend our time: changing "Australie" to "Australia" would almost double the amount of usable data.
-
-Turning, briefly to numeric data, @Preece1981 recommends plotting counts of the final digit of each observation in a variable. For instance, if the observations of the variable were "41.2", "80.3", "20.7", "1.2", "46.5", "96.2", "32.7", "44.3", "5.1", and "49.0". Then we note that 0, 1 and 5 all occur once, 3 and 7 occur twice, and 2 occurs three times. We might expect that there should be a uniform distribution of these final digits. But that is surprisingly often not the case, and the ways in which it differs can be informative. For instance, it may be that data were rounded, or recorded by different collectors.
-
-For instance, later in this chapter we will gather, clean, and prepare some data from the 2019 Kenyan census. We pre-emptively use that dataset here and look at the count of the final digits of the ages. That is, say, from age 35 we take "5", from age 74, we take "4". @tbl-countofages shows the expected age-heaping that occurs because some respondents reply to questions about age with a value to the closest 5 or 10. If we had an age variable without that pattern then we might expect it had been constructed from a different type of question.
-
-
-```{r}
-#| eval: true
-#| echo: false
-#| message: false
-#| warning: false
-#| label: tbl-countofages
-#| tbl-cap: "Excess of 0 and 5 digits in counts of the final digits of single-year ages in Nairobi from the 2019 Kenyan census"
-
-arrow::read_parquet(
- file = "outputs/data/cleaned_nairobi_2019_census.parquet") |>
- filter(!age %in% c("Total", "NotStated", "100+")) |>
- filter(age_type != "age-group") |>
- mutate(age = str_squish(age)) |>
- mutate(age = as.integer(age)) |>
- filter(age >= 20 & age < 100) |>
- filter(gender == "total") |>
- mutate(final_digit = str_sub(age, start= -1)) |>
- summarise(sum = sum(number),
- .by = final_digit) |>
- kable(
- col.names = c("Final digit of age", "Number of times"),
- booktabs = TRUE,
- linesep = "",
- format.args = list(big.mark = ",")
- )
-```
-
-
-
-### Tests
-
-As we said in @sec-reproducible-workflows, if you write code, then you are a programmer, but there is a difference between someone coding for fun, and, say, writing the code that runs the James Webb Telescope. Following @weinbergpsychology [p. 122], we can distinguish between amateurs and professionals based the existence of subsequent users. When you first start out coding, you typically write code that only you will use. For instance, you may write some code for a class paper. After you get a grade, then in most cases, the code will not be run again. In contrast, a professional writes code for, and often with, other people.
-
-Much academic research these days relies on code. If that research is to contribute to lasting knowledge, then the code that underpins it is being written for others and must work for others well after the researcher has moved to other projects. A professional places appropriate care on tasks that ensure code can be considered by others. A large part of that is tests.
-
-@jplcodingstandards [p. 14] claim that analysis after the fact "often find at least one defect per one hundred lines of code written". There is no reason to believe that code without tests is free of defects, just that they are not known. As such, we should strive to include tests in our code when possible.\index{data!tests} There is some infrastructure for testing data science code. For instance, in Python there is the Test-Driven Data Analysis library of @tdda, but more is needed.
-
-Some things are so important that we require that the cleaned dataset have them. These are conditions that we should check. They would typically come from experience, expert knowledge, or the planning and simulation stages. For instance, there should be no negative numbers in an age variable, and few ages above 110. For these we could specifically require that the condition is met. Another example is when doing cross-country analysis, a list of country names that we know should be in our dataset would be useful. Our test would then be that there were:
-
-1) values not in that list that were in our dataset, or vice versa; and
-2) countries that we expected to be in our dataset that were not.
-
-To have a concrete example, let us consider if we were doing some analysis about the five largest counties in Kenya\index{Kenya}. From looking it up, we find these are: "Nairobi", "Kiambu", "Nakuru", "Kakamega", and "Bungoma". We can create that variable.
-
-
-```{r}
-correct_kenya_counties <-
- c(
- "Nairobi", "Kiambu", "Nakuru", "Kakamega", "Bungoma"
- )
-```
-
-
-Then pretend we have the following dataset, which contains errors.
-
-
-```{r}
-top_five_kenya <-
- tibble(county = c(
- "Nairobi", "Nairob1", "Nakuru", "Kakamega", "Nakuru",
- "Kiambu", "Kiambru", "Kabamega", "Bun8oma", "Bungoma"
- ))
-
-top_five_kenya |>
- count(county, sort = TRUE)
-```
-
-
-Based on the count we know that we must fix some of them. There are two with numbers in the names.
-
-
-```{r}
-top_five_kenya_fixed_1_8 <-
- top_five_kenya |>
- mutate(
- county = str_replace_all(county, "Nairob1", "Nairobi"),
- county = str_replace_all(county, "Bun8oma", "Bungoma")
- )
-
-top_five_kenya_fixed_1_8 |>
- count(county, sort = TRUE)
-```
-
-
-At this point we can compare this with our known correct variable. We check both ways, i.e. is there anything in the correct variable not in our dataset, and is there anything in the dataset not in our correct variable. We use our check conditions to decide whether we are finished.
-
-
-```{r}
-if (all(top_five_kenya_fixed_1_8$county |>
- unique() %in% correct_kenya_counties)) {
- "The cleaned counties match the expected countries"
-} else {
- "Not all of the counties have been cleaned completely"
-}
-if (all(correct_kenya_counties %in% top_five_kenya_fixed_1_8$county |>
- unique())) {
- "The expected countries are in the cleaned counties"
-} else {
- "Not all the expected countries are in the cleaned counties"
-}
-```
-
-
-It is clear that we still have cleaning to do because not all the counties match what we were expecting.
-
-#### Aspects to test
-
-We will talk about explicit tests for class and dates, given their outsized importance, and how common it is for them to go wrong. But other aspects to explicitly consider testing include:\index{data!tests}
-
-- Variables of monetary values should be tested for reasonable bounds given the situation. In some cases negative values will not be possible. Sometimes an upper bound can be identified. Monetary variables should be numeric. They should not have commas or other separators. They should not contain symbols such as currency signs or semicolons.
-- Variables of population values should likely not be negative. Populations of cities should likely be somewhere between 100,000 and 50,000,000. They again should be numeric, and contain only numbers, no symbols.
-- Names should be character variables. They likely do not contain numbers. They may contain some limited set of symbols, and this would be context specific.
-- The number of observations is surprisingly easy to inadvertently change. While it is fine for this to happen deliberately, when it happens accidentally it can create substantial problems. The number of observations should be tested at the start of any data cleaning process against the data simulation and this expectation updated as necessary. It should be tested throughout the data cleaning process, but especially before and after any joins.
-
-More generally, work with experts and draw on prior knowledge to work out some reasonable features for the variables of interest and then implement these. For instance, consider how @scamswillnotsaveus was able to quickly identify an error in a claim about user numbers by roughly comparing it with how many institutions in the US receive federal financial aid.
-
-We can use `validate` to set up a series of tests. For instance, here we will simulate some data with clear issues.
-
-
-```{r}
-#| warning: false
-#| message: false
-
-set.seed(853)
-
-dataset_with_issues <-
- tibble(
- age = c(
- runif(n = 9, min = 0, max = 100) |> round(),
- 1000
- ),
- gender = c(
- sample(
- x = c("female", "male", "other", "prefer not to disclose"),
- size = 9,
- replace = TRUE,
- prob = c(0.4, 0.4, 0.1, 0.1)
- ),
- "tasmania"
- ),
- income = rexp(n = 10, rate = 0.10) |> round() |> as.character()
- )
-
-dataset_with_issues
-```
-
-
-In this case, there is an impossible age, one observation in the gender variable that should not be there, and finally, income is a character variable instead of a numeric. We use `validator()` to establish rules we expect the data to satisfy and `confront()` to determine whether it does.
-
-
-```{r}
-#| warning: false
-#| message: false
-
-rules <- validator(
- is.numeric(age),
- is.character(gender),
- is.numeric(income),
- age < 120,
- gender %in% c("female", "male", "other", "prefer not to disclose")
-)
-
-out <-
- confront(dataset_with_issues, rules)
-
-summary(out)
-```
-
-
-In this case, we can see that there are issues with the final three rules that we established. More generally, @datavalidationbook provides many example tests that can be used.
-
-As mentioned in @sec-farm-data, gender is something that we need to be especially careful about. We will typically have a small number of responses that are neither "male" or "female". The correct way to deal with the situation depends on context. But if responses other than "male" or "female" are going to be removed from the dataset and ignored, because there are too few of them, showing respect for the respondent might mean including a brief discussion of how they were similar or different to the rest of the dataset. Plots and a more extensive discussion could then be included in an appendix.
-
-
-#### Class
-
-::: {.content-visible when-format="pdf"}
-It is sometimes said that Americans are obsessed with money, while the English are obsessed with class. In the case of data cleaning and preparation we need to be English.\index{data!class} Class is critical and worthy of special attention. We introduce class in the ["R essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html) and here we focus on "numeric", "character", and "factor". Explicit checks of the class of variables are essential. Accidentally assigning the wrong class to a variable can have a large effect on subsequent analysis. It is important to:
-
-- check whether some value should be a number or a factor; and
-- check that values are numbers not characters.
-:::
-
-::: {.content-visible unless-format="pdf"}
-It is sometimes said that Americans are obsessed with money, while the English are obsessed with class. In the case of data cleaning and preparation we need to be English. Class is critical and worthy of special attention. We introduce class in [Online Appendix -@sec-r-essentials] and here we focus on "numeric", "character", and "factor". Explicit checks of the class of variables are essential. Accidentally assigning the wrong class to a variable can have a large effect on subsequent analysis. It is important to:
-
-- check whether some value should be a number or a factor; and
-- check that values are numbers not characters.
-:::
-
-
-To understand why it is important to be clear about whether a value is a number or a factor, consider the following situation:\index{data!class}\index{simulation!check class}
-
-
-```{r}
-simulated_class_data <-
- tibble(
- response = c(1, 1, 0, 1, 0, 1, 1, 0, 0),
- group = c(1, 2, 1, 1, 2, 3, 1, 2, 3)
- ) |>
- mutate(
- group_as_integer = as.integer(group),
- group_as_factor = as.factor(group),
- )
-```
-
-
-We use logistic regression, which we cover in more detail in @sec-its-just-a-linear-model, and first include "group" as an integer, then we include it as a factor. @tbl-effect-of-class shows how different the results are and highlights the importance of getting the class of variables used in regression right. In the former, where group is an integer, we impose a consistent relationship between the different levels of the observations, whereas in the latter, where it is a factor, we enable more freedom.
-
-
-```{r}
-#| label: tbl-effect-of-class
-#| tbl-cap: "Examining the effect of class on regression results"
-
-models <- list(
- "Group as integer" = glm(
- response ~ group_as_integer,
- data = simulated_class_data,
- family = "binomial"
- ),
- "Group as factor" = glm(
- response ~ group_as_factor,
- data = simulated_class_data,
- family = "binomial"
- )
-)
-modelsummary(models)
-```
-
-
-Class is so important, subtle, and can have such a pernicious effect on analysis, that analysis with a suite of tests that check class is easier to believe.\index{data!class} Establishing this suite is especially valuable just before modeling, but it is worthwhile setting this up as part of data cleaning and preparation. One reason that Jane Street, the US proprietary trading firm, uses a particular programming language, OCaml, is that its type system makes it more reliable with regard to class [@somers2015].\index{Jane Street} When code matters, class is of vital concern.
-
-There are many open questions around the effect and implications of type in computer science more generally but there has been some work. For instance, @Gao2017 find that the use of a static type system would have caught around 15 per cent of errors in production JavaScript systems. Languages have been developed, such as Typescript, where the primary difference, in this case from JavaScript, is that they are strongly typed. @Turcotte2020 examine some of the considerations for adding a type system in R. They develop a prototype that goes some way to addressing the technical issues, but acknowledge that large-scale implementation would be challenging for many reasons including the need for users to change.
-
-To this point in this book when we have used `read_csv()`, and other functions for importing data, we have allowed the function to guess the class of the variables. Moving forward we will be more deliberate and instead specify it ourselves using "col_types". For instance, instead of:
-
-::: {.content-visible when-format="pdf"}
-
-```{r}
-#| eval: false
-#| echo: true
-
-raw_igme_data <-
- read_csv(
- file =
- paste0("https://childmortality.org/wp-content",
- "/uploads/2021/09/UNIGME-2021.csv"),
- show_col_types = FALSE
- )
-```
-
-
-We recommend using:
-
-
-```{r}
-#| eval: false
-#| echo: true
-
-raw_igme_data <-
- read_csv(
- file =
- paste0("https://childmortality.org/wp-content",
- "/uploads/2021/09/UNIGME-2021.csv"),
- col_select = c(`Geographic area`, TIME_PERIOD, OBS_VALUE),
- col_types = cols(
- `Geographic area` = col_character(),
- TIME_PERIOD = col_character(),
- OBS_VALUE = col_double(),
- )
- )
-```
-
-:::
-
-::: {.content-visible unless-format="pdf"}
-
-```{r}
-#| eval: false
-#| echo: true
-
-raw_igme_data <-
- read_csv(
- file = "https://childmortality.org/wp-content/uploads/2021/09/UNIGME-2021.csv",
- show_col_types = FALSE
- )
-```
-
-
-We recommend using:
-
-
-```{r}
-#| eval: false
-#| echo: true
-
-raw_igme_data <-
- read_csv(
- file = "https://childmortality.org/wp-content/uploads/2021/09/UNIGME-2021.csv",
- col_select = c(`Geographic area`, TIME_PERIOD, OBS_VALUE),
- col_types = cols(
- `Geographic area` = col_character(),
- TIME_PERIOD = col_character(),
- OBS_VALUE = col_double(),
- )
- )
-```
-
-:::
-
-This is typically an iterative process of initially reading in the dataset, getting a quick sense of it, and then reading it in properly with only the necessary columns and classes specified. While this will require a little extra work of us, it is important that we are clear about class.
-
-#### Dates
-
-A shibboleth for whether someone has worked with dates is their reaction when you tell them you are going to be working with dates.\index{data!cleaning} If they share a horror story, then they have likely worked with dates before!
-
-Extensive checking of dates is important. Ideally, we would like dates to be in the following format: YYYY-MM-DD. There are differences of opinion as to what is an appropriate date format in the broader world. Reasonable people can differ on whether 1 July 2022 or July 1, 2022 is better, but YYYY-MM-DD is the international standard and we should use that in our date variables where possible.
-
-A few tests that could be useful include:
-
-- If a column is days of the week, then test that the only components are Monday, Tuesday, $\dots$, Sunday. Further, test that all seven days are present. Similarly, for month.
-- Test that the number of days is appropriate for each month, for instance, check that September has 30 days, etc.
-- Check whether the dates are in order in the dataset. This need not necessarily be the case, but often when it is not, there are issues worth exploring.
-- Check that the years are complete and appropriate to the analysis period.
-
-In @sec-fire-hose we introduced a dataset of shelter usage in Toronto in 2021 using `opendatatoronto`.\index{Canada!Toronto shelter usage} Here we examine that same dataset, but for 2017, to illustrate some issues with dates. We first need to download the data.^[If this does not work, then the City of Toronto government may have moved the datasets. Instead use: `earlier_toronto_shelters <- read_csv("https://www.tellingstorieswithdata.com/inputs/data/earlier_toronto_shelters.csv")`.]
-
-
-```{r}
-#| eval: false
-#| echo: true
-
-toronto_shelters_2017 <-
- search_packages("Daily Shelter Occupancy") |>
- list_package_resources() |>
- filter(name == "Daily shelter occupancy 2017.csv") |>
- group_split(name) |>
- map_dfr(get_resource, .id = "file")
-
-write_csv(
- x = toronto_shelters_2017,
- file = "toronto_shelters_2017.csv"
-)
-```
-
-```{r}
-#| eval: false
-#| echo: false
-#| warning: false
-
-write_csv(
- x = toronto_shelters_2017,
- file = here::here("inputs/data/toronto_shelters_2017.csv")
-)
-```
-
-```{r}
-#| eval: true
-#| echo: false
-#| warning: false
-
-toronto_shelters_2017 <-
- read_csv(
- here::here("inputs/data/toronto_shelters_2017.csv"),
- show_col_types = FALSE
- )
-```
-
-
-We need to make the names easier to type and only keep relevant columns.
-
-
-```{r}
-#| warning: false
-#| message: false
-
-toronto_shelters_2017 <-
- toronto_shelters_2017 |>
- clean_names() |>
- select(occupancy_date, sector, occupancy, capacity)
-```
-
-
-The main issue with this dataset will be the dates. We will find that the dates appear to be mostly year-month-day, but certain observations may be year-day-month. We use `ymd()` from `lubridate` to parse the date in that order.
-
-
-```{r}
-#| warning: false
-#| message: false
-
-toronto_shelters_2017 <-
- toronto_shelters_2017 |>
- mutate(
- # remove times
- occupancy_date =
- str_remove(
- occupancy_date,
- "T[:digit:]{2}:[:digit:]{2}:[:digit:]{2}"
- )) |>
- mutate(generated_date = ymd(occupancy_date, quiet = TRUE))
-
-toronto_shelters_2017
-```
-
-
-The plot of the distribution of what purports to be the day component makes it clear that there are concerns (@fig-homeless-daycount-1). In particular we are concerned that the distribution of the days is not roughly uniform.
-
-
-```{r}
-#| label: fig-homeless-daycount
-#| fig-cap: "Examining the date in more detail"
-#| fig-subcap: ["Counts, by third component of occupancy date", "Comparison of row number with date"]
-#| layout-ncol: 2
-
-toronto_shelters_2017 |>
- separate(
- generated_date,
- into = c("one", "two", "three"),
- sep = "-",
- remove = FALSE
- ) |>
- count(three) |>
- ggplot(aes(x = three, y = n)) +
- geom_point() +
- theme_minimal() +
- labs(x = "Third component of occupancy date",
- y = "Number")
-
-toronto_shelters_2017 |>
- mutate(row_number = c(seq_len(nrow(toronto_shelters_2017)))) |>
- ggplot(aes(x = row_number, y = generated_date), alpha = 0.1) +
- geom_point(alpha = 0.3) +
- theme_minimal() +
- labs(
- x = "Row number",
- y = "Date"
- )
-```
-
-
-As mentioned, one graph that is especially useful when cleaning a dataset is the order the observations appear in the dataset. For instance, we would generally expect that there would be a rough ordering in terms of date. To examine whether this is the case, we can graph the date variable in the order it appears in the dataset (@fig-homeless-daycount-2).
-
-While this is just a quick graph it illustrates the point---there are a lot in order, but not all. If they were in order, then we would expect them to be along the diagonal. It is odd that the data are not in order, especially as there appears to be something systematic initially. We can summarize the data to get a count of occupancy by day.
-
-
-```{r}
-# Idea from Lisa Lendway
-toronto_shelters_by_day <-
- toronto_shelters_2017 |>
- drop_na(occupancy, capacity) |>
- summarise(
- occupancy = sum(occupancy),
- capacity = sum(capacity),
- usage = occupancy / capacity,
- .by = generated_date
- )
-```
-
-
-We are interested in the availability of shelter spots in Toronto for each day (@fig-plotoccupancy).
-
-
-```{r}
-#| label: fig-plotoccupancy
-#| fig-cap: "Occupancy per day in Toronto shelters"
-
-toronto_shelters_by_day |>
- ggplot(aes(x = day(generated_date), y = occupancy)) +
- geom_point(alpha = 0.3) +
- scale_y_continuous(limits = c(0, NA)) +
- labs(
- color = "Type",
- x = "Day",
- y = "Occupancy (number)"
- ) +
- facet_wrap(
- vars(month(generated_date, label = TRUE)),
- scales = "free_x"
- ) +
- theme_minimal() +
- scale_color_brewer(palette = "Set1")
-```
-
-
-It is clear there seems to be an issue with the first 12 days of the month. We noted that when we look at the data it is a bit odd that it is not in order. From @fig-homeless-daycount-2 it looks like there are some systematic issue that affects many observations. In general, it seems that it might be the case that in the date variable the first 12 days are the wrong way around, i.e. we think it is year-month-day, but it is actually year-day-month. But there are exceptions. As a first pass, we can flip those first 12 days of each month and see if that helps. It will be fairly blunt, but hopefully gets us somewhere.
-
-
-```{r}
-# Code by Monica Alexander
-padded_1_to_12 <- sprintf("%02d", 1:12)
-
-list_of_dates_to_flip <-
- paste(2017, padded_1_to_12,
- rep(padded_1_to_12, each = 12), sep = "-")
-
-toronto_shelters_2017_flip <-
- toronto_shelters_2017 |>
- mutate(
- year = year(generated_date),
- month = month(generated_date),
- day = day(generated_date),
- generated_date = as.character(generated_date),
- changed_date = if_else(
- generated_date %in% list_of_dates_to_flip,
- paste(year, day, month, sep = "-"),
- paste(year, month, day, sep = "-"),
- ),
- changed_date = ymd(changed_date)
- ) |>
- select(-year, -month, -day)
-```
-
-
-Now let us take a look (@fig-sheltersdatebyrowadj).
-
-
-```{r}
-#| label: fig-sheltersdatebyrowadj
-#| fig-cap: "Adjusted dates, occupancy in Toronto shelters"
-#| fig-subcap: ["Date of each row in order after adjustment", "Toronto shelters daily occupancy after adjustment"]
-#| layout-ncol: 2
-
-toronto_shelters_2017_flip |>
- mutate(counter = seq_len(nrow(toronto_shelters_2017_flip))) |>
- ggplot(aes(x = counter, y = changed_date)) +
- geom_point(alpha = 0.3) +
- labs(x = "Row in the dataset",
- y = "Date of that row") +
- theme_minimal()
-
-toronto_shelters_2017_flip |>
- drop_na(occupancy, capacity) |>
- summarise(occupancy = sum(occupancy),
- .by = changed_date) |>
- ggplot(aes(x = day(changed_date), y = occupancy)) +
- geom_point(alpha = 0.3) +
- scale_y_continuous(limits = c(0, NA)) +
- labs(color = "Type",
- x = "Changed day",
- y = "Occupancy (number)") +
- facet_wrap(vars(month(changed_date, label = TRUE)),
- scales = "free_x") +
- theme_minimal()
-```
-
-
-It has not fixed all the issues. For instance, notice there are now no entries below the diagonal (@fig-sheltersdatebyrowadj-1). But we can see that has almost entirely taken care of the systematic differences (@fig-sheltersdatebyrowadj-2). This is where we will leave this example.
-
-## Simulated example: running times
-
-To provide a specific example, which we will return to in @sec-its-just-a-linear-model, consider the time it takes someone to run five kilometers (which is a little over three miles), compared with the time it takes them to run a marathon (@fig-fivekmvsmarathon-1).
-
-Here we consider "simulate" and "acquire", focused on testing. In the simulation we specify a relationship of 8.4, as that is roughly the ratio between a five-kilometer run and the 42.2 kilometer distance of a marathon (a little over 26 miles).\index{simulation!running times}\index{distribution!Normal}
-
-
-```{r}
-#| eval: true
-#| include: true
-#| label: fig-simulatemarathondata
-#| message: false
-#| warning: false
-
-set.seed(853)
-
-num_observations <- 200
-expected_relationship <- 8.4
-fast_time <- 15
-good_time <- 30
-
-sim_run_data <-
- tibble(
- five_km_time =
- runif(n = num_observations, min = fast_time, max = good_time),
- noise = rnorm(n = num_observations, mean = 0, sd = 20),
- marathon_time = five_km_time * expected_relationship + noise
- ) |>
- mutate(
- five_km_time = round(x = five_km_time, digits = 1),
- marathon_time = round(x = marathon_time, digits = 1)
- ) |>
- select(-noise)
-
-sim_run_data
-```
-
-
-We can use our simulation to put in place various tests that we would want the actual data to satisfy. For instance, we want the class of the five kilometer and marathon run times to be numeric. And we want 200 observations.
-
-
-```{r}
-stopifnot(
- class(sim_run_data$marathon_time) == "numeric",
- class(sim_run_data$five_km_time) == "numeric",
- nrow(sim_run_data) == 200
-)
-```
-
-
-We know that any value that is less than 15 minutes or more than 30 minutes for the five-kilometer run time is likely something that needs to be followed up on.
-
-
-```{r}
-stopifnot(
- min(sim_run_data$five_km_time) >= 15,
- max(sim_run_data$five_km_time) <= 30
-)
-```
-
-
-Based on this maximum and the simulated relationship of 8.4, we would be surprised if we found any marathon times that were substantially over $30\times8.4=252$ minutes, after we allow for a little bit of drift, say 300 minutes. (To be clear, there is nothing wrong with taking longer than this to run a marathon, but it is just unlikely based on our simulation parameters). And we would be surprised if the world record marathon time, 121 minutes as at the start of 2023, were improved by anything more than a minute or two, say, anything faster than 118 minutes. (It will turn out that our simulated data do not satisfy this and result in a implausibly fast 88 minute marathon time, which suggests a need to improve the simulation.)
-
-
-```{r}
-#| eval: false
-#| include: true
-
-stopifnot(
- min(sim_run_data$marathon_time) >= 118,
- max(sim_run_data$marathon_time) <= 300
-)
-```
-
-
-We can then take these tests to real data. Actual survey data on the relationship between five kilometer and marathon run times are available from @Vickers2016. After downloading the data, which @Vickers2016 make available as an "Additional file", we can focus on the variables of interest and only individuals with both a five-kilometer time and a marathon time.
-
-
-```{r}
-#| eval: false
-#| include: true
-
-vickers_data <-
- read_excel("13102_2016_52_MOESM2_ESM.xlsx") |>
- select(k5_ti, mf_ti) |>
- drop_na()
-
-vickers_data
-```
-
-```{r}
-#| eval: true
-#| echo: false
-
-vickers_data <-
- read_excel("inputs/data/13102_2016_52_MOESM2_ESM.xlsx") |>
- select(k5_ti, mf_ti) |>
- drop_na()
-
-vickers_data
-```
-
-
-The first thing that we notice is that our data are in seconds, whereas we were expecting them to be in minutes. This is fine. Our simulation and tests can update, or we can adjust our data. Our simulation and tests retain their value even when the data turn out to be slightly different, which they inevitably will.
-
-In this case, we will divide by sixty, and round, to shift our data into minutes.
-
-
-```{r}
-vickers_data <-
- vickers_data |>
- mutate(five_km_time = round(k5_ti / 60, 1),
- marathon_time = round(mf_ti / 60, 1)
- ) |>
- select(five_km_time, marathon_time)
-
-vickers_data
-```
-
-```{r}
-#| eval: false
-#| include: true
-
-stopifnot(
- class(vickers_data$marathon_time) == "numeric",
- class(vickers_data$five_km_time) == "numeric",
- min(vickers_data$five_km_time) >= 15,
- max(vickers_data$five_km_time) <= 30,
- min(vickers_data$marathon_time) >= 118,
- max(vickers_data$marathon_time) <= 300
-)
-```
-
-
-In this case, our tests, which were written for the simulated data, identify that we have five kilometer run times that are faster that 15 minutes and longer than 30 minutes. They also identify marathon times that are longer than 300 minutes. If we were actually using this data for analysis, then our next step would be to plot the data, taking care to examine each of these points that our tests identified, and then either adjust the tests or the dataset.
-
-## Names
-
-> An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers).
->
-> @omggenes
-
-Names matter.\index{data!names} The land on which much of this book was written is today named Canada, but for a long time was known as Turtle Island. Similarly, there is a big rock in the center of Australia. For a long time, it was called Uluru, then it was known as Ayers Rock. Today it has a dual name that combines both. And in parts of the US South, including signage surrounding the South Carolina State House, the US Civil War is referred to as the War of Northern Aggression. In these examples, the name that is used conveys information, not only about the user, but about the circumstances. Even the British Royal Family recognizes the power of names. In 1917 they changed from the House of Saxe-Coburg and Gotha to the House of Windsor. It was felt that the former was too Germanic given World War I. Names matter in everyday life. And they matter in our code, too.
-
-When coding, names are critical and worthy of special attention because [@hermans2021programmer]:\index{data!names}
-
-1) they help document our code as they are contained in the code;
-2) they make up a large proportion of any script;
-3) they are referred to a lot by others; and
-4) they help the reader understand what is happening in the code.
-
-In addition to respecting the nature of the data, names need to satisfy two additional considerations:
-
-1) they need to be machine readable, and
-2) they need to be human readable.
-
-### Machine-readable
-
-Ensuring machine-readable names can be an easier standard to meet. It usually means avoiding spaces and special characters. A space can be replaced with an underscore. For instance, we prefer "my_data" to "my data". Avoiding spaces enables tab-completion which makes us more efficient. It also helps with reproducibility because spaces are considered differently by different operating systems.
-
-Usually, special characters should be removed because they can be inconsistent between different computers and languages. This is especially the case with slash, backslash, asterisk, and both single, and double quotation marks. Try to avoid using those in names.
-
-Names should also be unique within a dataset, and unique within a collection of datasets unless that particular variable is being deliberately used as a key to join different datasets. This usually means that the domain is critical for effective names, and when working as part of a team this all gets much more difficult [@hermans2017peter]. Names need to not only be unique, but notably different when there is a potential for confusion. For instance, for many years, the language PHP had both `mysql_escape_string` and `mysql_real_escape_string` [@somers2015]. It is easy to see how programmers may have accidentally written one when they meant the other.
-
-An especially useful function to use to get closer to machine-readable names is `clean_names()` from `janitor`. This deals with those issues mentioned above as well as a few others.\index{data!names}
-
-
-```{r}
-some_bad_names <-
- tibble(
- "Second Name has spaces" = c(1),
- "weird#symbol" = c(1),
- "InCoNsIsTaNtCaPs" = c(1)
- )
-
-bad_names_made_better <-
- some_bad_names |>
- clean_names()
-
-some_bad_names
-
-bad_names_made_better
-```
-
-
-### Human-readable
-
-> Programs must be written for people to read, and only incidentally for machines to execute
->
-> @abelson1996structure
-
-In the same way that we emphasized in @sec-on-writing that we write papers for the reader, here we emphasize that we write code for the reader. Human-readable names require an additional layer, and extensive consideration. Following @jsfcodingstandards [p. 25], we should avoid names that only differ by the use of the letter "O", instead of the number "0" or the letter "D". Similarly, "S" with "5".
-
-We should consider other cultures and how they may interpret some of the names that we use. We also need to consider different experience levels that subsequent users of the dataset may have. This is both in terms of experience with data science, but also experience with similar datasets. For instance, a variable called "flag" is often used to signal that a variable contains data that needs to be followed up with or treated carefully in some way. An experienced analyst will know this, but a beginner will not. Try to use meaningful names wherever possible [@lin2020ten]. It has been found that shorter names may take longer to comprehend [@shorternamestakelonger], and so it is often useful to avoid uncommon abbreviations where possible.
-
-@jennybryanonnames recommends that file names, in particular, should consider the default ordering that a file manager will impose.\index{data!names} This might mean adding prefixes such as "00-", "01-", etc to filenames (which might involve left-padding with zeros depending on the number of files). Critically it means using ISO 8601 for dates. That was introduced earlier and means that 2 December 2022 would be written "2022-12-02". The reason for using such file names is to provide information to other people about the order of the files.
-
-One interesting feature of R is that in certain cases partial matching on names is possible. For instance:
-
-
-```{r}
-partial_matching_example <-
- data.frame(
- my_first_name = c(1, 2),
- another_name = c("wow", "great")
- )
-
-partial_matching_example$my_first_name
-partial_matching_example$my
-```
-
-
-This behavior is not possible within the `tidyverse` (for instance, if `data.frame` were replaced with `tibble` in the above code). Partial matching should rarely be used. It makes it more difficult to understand code after a break, and for others to come to it fresh.
-
-Variable names should have a consistent structure.\index{data!names} For instance, imposing the naming pattern `verb_noun`, as in `read_csv()`, then having one function that was `noun_verb`, perhaps `csv_read()`, would be inconsistent. That inconsistency imposes a significant cost because it makes it more difficult to remember the name of the function.
-
-R, Python, and many of the other languages that are commonly used for data science are dynamically typed, as opposed to static typed.\index{data!class}\index{data science!language features} This means that class can be defined independently of declaring a variable. One interesting area of data science research is going partially toward static typed and understanding what that might mean. For instance, Python [enabled](https://peps.python.org/pep-0484/) type hints in 2014 [@boykistypehints]. While not required, this goes someway to being more explicit about types.
-
-@columnnamesascontracts advises using variable names as contracts. We do this by establishing a controlled vocabulary for them. In this way, we would define a set of words that we can use in names. In the controlled vocabulary of @columnnamesascontracts a variable could start with an abbreviation for its class, then something specific to what it pertains to, and then various details.\index{data!names}
-
-For instance, we could consider column names of "age" and "sex". Following @columnnamesascontracts we may change these to be more informative of the class and other information. This issue is not settled, and there is not yet best practice. For instance, there are arguments against this in terms of readability.
-
-
-```{r}
-some_names <-
- tibble(
- age = as.integer(c(1, 3, 35, 36)),
- sex = factor(c("male", "male", "female", "male"))
- )
-
-riederer_names <-
- some_names |>
- rename(
- integer_age_respondent = age,
- factor_sex_respondent = sex
- )
-
-some_names
-
-riederer_names
-```
-
-
-Even just trying to be a little more explicit and consistent about names throughout a project typically brings substantial benefits when we come to revisit the project later. Would a rose by any other name smell as sweet? Of course. But we call it a rose---or even better *Rosa rubiginosa*---because that helps others know what we are talking about, compared with, say, "red_thing", "five_petaled_smell_nice", "flower", or "r_1". It is clearer, and helps others efficiently understand.
-
-
-## 1996 Tanzanian DHS
-
-We will now go through the first of two examples. The Demographic and Health Surveys (DHS)\index{Demographic and Health Surveys} play an important role in gathering data in areas where we may not have other datasets. Here we will clean and prepare a DHS table about household populations in Tanzania\index{Tanzania!population in 1996} in 1996.\index{text!cleaning} As a reminder, the workflow that we advocate in this book is:
-
-$$
-\mbox{Plan}\rightarrow\mbox{Simulate}\rightarrow\mbox{Acquire}\rightarrow\mbox{Explore}\rightarrow\mbox{Share}
-$$
-
-We are interested in the distribution of age-groups, gender, and urban/rural. A quick sketch might look like @fig-tanzaniasketch.
-
-![Quick sketch of a dataset that we might be interested in](figures/09-tanzania_sketch.png){#fig-tanzaniasketch width=70% fig-align="center"}
-
-We can then simulate a dataset.\index{simulation!Tanzania DHS}\index{distribution!Normal}
-
-
-```{r}
-#| warning: false
-#| message: false
-
-set.seed(853)
-
-age_group <- tibble(starter = 0:19) |>
- mutate(lower = starter * 5, upper = starter * 5 + 4) |>
- unite(string_sequence, lower, upper, sep = "-") |>
- pull(string_sequence)
-
-mean_value <- 10
-
-simulated_tanzania_dataset <-
- tibble(
- age_group = age_group,
- urban_male = round(rnorm(length(age_group), mean_value)),
- urban_female = round(rnorm(length(age_group), mean_value)),
- rural_male = round(rnorm(length(age_group), mean_value)),
- rural_female = round(rnorm(length(age_group), mean_value)),
- total_male = round(rnorm(length(age_group), mean_value)),
- total_female = round(rnorm(length(age_group), mean_value))
- ) |>
- mutate(
- urban_total = urban_male + urban_female,
- rural_total = rural_male + rural_female,
- total_total = total_male + total_female
- )
-
-simulated_tanzania_dataset
-```
-
-
-Based on this simulation we are interested to test:\index{testing}
-
-a) Whether there are only numbers.
-b) Whether the sum of urban and rural match the total column.
-c) Whether the sum of the age-groups match the total.
-
-We begin by downloading the data.^[Or use: https://www.tellingstorieswithdata.com/inputs/pdfs/1996_Tanzania_DHS.pdf.]
-
-
-```{r}
-#| eval: false
-#| include: true
-
-download.file(
- url = "https://dhsprogram.com/pubs/pdf/FR83/FR83.pdf",
- destfile = "1996_Tanzania_DHS.pdf",
- mode = "wb"
-)
-```
-
-```{r}
-#| eval: false
-#| include: false
-
-# INTERNAL
-download.file(
- url = "https://dhsprogram.com/pubs/pdf/FR83/FR83.pdf",
- destfile = "inputs/pdfs/1996_Tanzania_DHS.pdf",
- mode = "wb"
-)
-```
-
-
-When we have a PDF and want to read the content into R, then `pdf_text()` from `pdftools` is useful, as introduced in @sec-gather-data. It works well for many recently produced PDFs because the content is text which it can extract. But if the PDF is an image, then `pdf_text()` will not work. Instead, the PDF will first need to go through OCR, which was also introduced in @sec-gather-data.
-
-
-```{r}
-#| eval: false
-#| include: true
-
-tanzania_dhs <-
- pdf_text(
- pdf = "1996_Tanzania_DHS.pdf"
- )
-```
-
-```{r}
-#| eval: true
-#| include: false
-
-# INTERNAL
-tanzania_dhs <- pdf_text("inputs/pdfs/1996_Tanzania_DHS.pdf")
-```
-
-
-In this case we are interested in Table 2.1, which is on the 33rd page of the PDF (@fig-tanzanian-dhs).
-
-![The page of interest in the 1996 Tanzanian DHS](figures/tanzanian_dhs.png){#fig-tanzanian-dhs width=95% fig-align="center"}
-
-We use `stri_split_lines()` from `stringi` to focus on that particular page.
-
-
-```{r}
-# From Bob Rudis: https://stackoverflow.com/a/47793617
-tanzania_dhs_page_33 <- stri_split_lines(tanzania_dhs[[33]])[[1]]
-```
-
-
-We first want to remove all the written content and focus on the table. We then want to convert that into a tibble so that we can use our familiar `tidyverse` approaches.
-
-
-```{r}
-#| warning: false
-#| message: false
-
-tanzania_dhs_page_33_only_data <- tanzania_dhs_page_33[31:55]
-
-tanzania_dhs_raw <- tibble(all = tanzania_dhs_page_33_only_data)
-
-tanzania_dhs_raw
-```
-
-
-All the columns have been collapsed into one, so we need to separate them. We will do this based on the existence of a space, which means we first need to change "Age group" to "Age-group" because we do not want that separated.
-
-
-```{r}
-# Separate columns
-tanzania_dhs_separated <-
- tanzania_dhs_raw |>
- mutate(all = str_squish(all)) |>
- mutate(all = str_replace(all, "Age group", "Age-group")) |>
- separate(
- col = all,
- into = c(
- "age_group",
- "male_urban", "female_urban", "total_urban",
- "male_rural", "female_rural", "total_rural",
- "male_total", "female_total", "total_total"
- ),
- sep = " ",
- remove = TRUE,
- fill = "right",
- extra = "drop"
- )
-
-tanzania_dhs_separated
-```
-
-
-Now we need to clean the rows and columns. One helpful "negative space" approach to work out what we need to remove, is to look at what is left if we temporarily remove everything that we know we want. Whatever is left is then a candidate for being removed. In this case we know that we want the columns to contain numbers, so we remove numeric digits from all columns to see what might stand in our way of converting from string to numeric.
-
-
-```{r}
-tanzania_dhs_separated |>
- mutate(across(everything(), ~ str_remove_all(., "[:digit:]"))) |>
- distinct()
-```
-
-
-In this case we can see that some commas and semicolons have been incorrectly considered decimal places. Also, some tildes and blank lines need to be removed. After that we can impose the correct class.
-
-
-```{r}
-tanzania_dhs_cleaned <-
- tanzania_dhs_separated |>
- slice(6:22, 24, 25) |>
- mutate(across(everything(), ~ str_replace_all(., "[,;]", "."))) |>
- mutate(
- age_group = str_replace(age_group, "20-~", "20-24"),
- age_group = str_replace(age_group, "40-~", "40-44"),
- male_rural = str_replace(male_rural, "14.775", "14775")
- ) |>
- mutate(across(starts_with(c(
- "male", "female", "total"
- )),
- as.numeric))
-
-tanzania_dhs_cleaned
-```
-
-
-Finally, we may wish to check that the sum of the constituent parts equals the total.
-
-
-```{r}
-tanzania_dhs_cleaned |>
- filter(!age_group %in% c("Total", "Number")) |>
- summarise(sum = sum(total_total))
-```
-
-In this case we can see that it is a few tenths of a percentage point off.
-
-
-## 2019 Kenyan census
-
-As a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan\index{Kenya!Nairobi} census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\index{text!cleaning}\index{Kenya!gender}\index{gender!2019 Kenyan census}\index{Kenya!2019 census}
-
-The distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded [here](https://www.knbs.or.ke/?wpdmpro=2019-kenya-population-and-housing-census-volume-iii-distribution-of-population-by-age-sex-and-administrative-units). While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.
-
-### Gather and clean
-
-We first need to download and read in the PDF of the 2019 Kenyan census.^[If the Kenyan government link breaks then replace their URL with: https://www.tellingstorieswithdata.com/inputs/pdfs/2019_Kenya_census.pdf.]
-
-
-```{r}
-#| eval: false
-#| include: true
-
-census_url <-
- paste0(
- "https://www.knbs.or.ke/download/2019-kenya-population-and-",
- "housing-census-volume-iii-distribution-of-population-by-age-",
- "sex-and-administrative-units/?wpdmdl=5729&refresh=",
- "620561f1ce3ad1644519921"
- )
-
-download.file(
- url = census_url,
- destfile = "2019_Kenya_census.pdf",
- mode = "wb"
-)
-```
-
-
-We can use `pdf_text()` from `pdftools` again here.
-
-
-```{r}
-#| eval: false
-#| include: true
-
-kenya_census <-
- pdf_text(
- pdf = "2019_Kenya_census.pdf"
- )
-```
-
-```{r}
-#| eval: true
-#| include: false
-
-# INTERNAL
-
-kenya_census <- pdf_text("inputs/pdfs/2019_Kenya_census.pdf")
-```
-
-
-In this example we will focus on the page of the PDF about Nairobi (@fig-examplekenyancensuspage).
-
-![Page from the 2019 Kenyan census about Nairobi](figures/09-Kenya_Nairobi.png){#fig-examplekenyancensuspage width=95% fig-align="center"}
-
-#### Make rectangular
-
-The first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.
-
-
-```{r}
-# Focus on the page of interest
-just_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]
-
-# Remove blank lines
-just_nairobi <- just_nairobi[just_nairobi != ""]
-
-# Remove titles, headings and other content at the top of the page
-just_nairobi <- just_nairobi[5:length(just_nairobi)]
-
-# Remove page numbers and other content at the bottom of the page
-just_nairobi <- just_nairobi[1:62]
-
-# Convert into a tibble
-demography_data <- tibble(all = just_nairobi)
-```
-
-
-At this point the data are in a tibble. This allows us to use our familiar `dplyr` verbs. In particular we want to separate the columns.
-
-
-```{r}
-demography_data <-
- demography_data |>
- mutate(all = str_squish(all)) |>
- mutate(all = str_replace(all, "10 -14", "10-14")) |>
- mutate(all = str_replace(all, "Not Stated", "NotStated")) |>
- # Deal with the two column set-up
- separate(
- col = all,
- into = c(
- "age", "male", "female", "total",
- "age_2", "male_2", "female_2", "total_2"
- ),
- sep = " ",
- remove = TRUE,
- fill = "right",
- extra = "drop"
- )
-```
-
-
-They are side by side at the moment. We need to instead append to the bottom.
-
-
-```{r}
-demography_data_long <-
- rbind(
- demography_data |> select(age, male, female, total),
- demography_data |>
- select(age_2, male_2, female_2, total_2) |>
- rename(
- age = age_2,
- male = male_2,
- female = female_2,
- total = total_2
- )
- )
-```
-
-```{r}
-# There is one row of NAs, so remove it
-demography_data_long <-
- demography_data_long |>
- remove_empty(which = c("rows"))
-
-demography_data_long
-```
-
-
-Having got it into a rectangular format, we now need to clean the dataset to make it useful.
-
-#### Validity
-
-To attain validity\index{validity} requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and `distinct()` is especially useful.
-
-
-```{r}
-demography_data_long |>
- select(male, female, total) |>
- mutate(across(everything(), ~ str_remove_all(., "[:digit:]"))) |>
- distinct()
-```
-
-
-We need to remove commas. While we could use `janitor` here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that `janitor` (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.
-
-
-```{r}
-demography_data_long <-
- demography_data_long |>
- mutate(across(c(male, female, total), ~ str_remove_all(., ","))) |>
- mutate(across(c(male, female, total), ~ as.integer(.)))
-
-demography_data_long
-```
-
-
-#### Internal consistency
-
-\index{consistency!internal}
-
-The census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as "ages 0 to 5", or a single age, such as "1".
-
-
-```{r}
-demography_data_long <-
- demography_data_long |>
- mutate(
- age_type = if_else(str_detect(age, "-"),
- "age-group",
- "single-year"),
- age_type = if_else(str_detect(age, "Total"),
- "age-group",
- age_type)
- )
-```
-
-
-At the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is `total` and `100+` in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.
-
-
-```{r}
-demography_data_long <-
- demography_data_long |>
- mutate(
- age = as_factor(age)
- )
-```
-
-
-### Check and test
-
-Having gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that "total" is the sum of "male" and "female", which are the only two gender categories available.
-
-
-```{r}
-#| echo: true
-demography_data_long |>
- mutate(
- check_sum = male + female,
- totals_match = if_else(total == check_sum, 1, 0)
- ) |>
- filter(totals_match == 0)
-```
-
-
-Finally, we want to check that the single-age counts sum to the age-groups.
-
-
-```{r}
-#| echo: true
-demography_data_long |>
- mutate(age_groups = if_else(age_type == "age-group",
- age,
- NA_character_)) |>
- fill(age_groups, .direction = "up") |>
- mutate(
- group_sum = sum(total),
- group_sum = group_sum / 2,
- difference = total - group_sum,
- .by = c(age_groups)
- ) |>
- filter(age_type == "age-group" & age_groups != "Total") |>
- head()
-```
-
-
-### Tidy-up
-
-Now that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.
-
-
-```{r}
-demography_data_tidy <-
- demography_data_long |>
- rename_with(~paste0(., "_total"), male:total) |>
- pivot_longer(cols = contains("_total"),
- names_to = "type",
- values_to = "number") |>
- separate(
- col = type,
- into = c("gender", "part_of_area"),
- sep = "_"
- ) |>
- select(age, age_type, gender, number)
-```
-
-```{r}
-#| include: false
-#| eval: false
-
-arrow::write_parquet(
- x = demography_data_tidy,
- sink = "outputs/data/cleaned_nairobi_2019_census.parquet")
-```
-
-
-The original purpose of cleaning this dataset was to make a table that is used by @alexander2021bayesian. We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (@fig-monicasnairobigraph).
-
-
-```{r}
-#| label: fig-monicasnairobigraph
-#| echo: true
-#| fig-cap: "Distribution of age and gender in Nairobi in 2019, based on Kenyan census"
-#| fig-height: 6
-
-demography_data_tidy |>
- filter(age_type == "single-year") |>
- select(age, gender, number) |>
- filter(gender != "total") |>
- ggplot(aes(x = age, y = number, fill = gender)) +
- geom_col(aes(x = age, y = number, fill = gender),
- position = "dodge") +
- scale_y_continuous(labels = comma) +
- scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), "100+")) +
- theme_classic() +
- scale_fill_brewer(palette = "Set1") +
- labs(
- y = "Number",
- x = "Age",
- fill = "Gender",
- caption = "Data source: 2019 Kenya Census"
- ) +
- theme(legend.position = "bottom") +
- coord_flip()
-```
-
-
-A variety of features are clear from @fig-monicasnairobigraph, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.
-
-Finally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: "area", "age", "gender", and "number". If we were to use our column names as contracts, then these could be: "chr_area", "fctr_group_age", "chr_group_gender", and "int_group_count".
-
-
-```{r}
-column_names_as_contracts <-
- demography_data_tidy |>
- filter(age_type == "single-year") |>
- select(age, gender, number) |>
- rename(
- "fctr_group_age" = "age",
- "chr_group_gender" = "gender",
- "int_group_count" = "number"
- )
-```
-
-
-::: {.content-visible when-format="pdf"}
-We can then use `pointblank` to set up tests for us (@fig-pointblankvalidation).
-
-
-```{r}
-#| echo: true
-#| eval: false
-#| message: false
-#| warning: false
-
-agent <-
- create_agent(tbl = column_names_as_contracts) |>
- col_is_character(columns = vars(chr_area, chr_group_gender)) |>
- col_is_factor(columns = vars(fctr_group_age)) |>
- col_is_integer(columns = vars(int_group_count)) |>
- col_vals_in_set(
- columns = chr_group_gender,
- set = c("male", "female", "total")
- ) |>
- interrogate()
-
-agent
-```
-
-
-![Example of Pointblank Validation](figures/pointblank_screenshot.png){#fig-pointblankvalidation width=70% fig-align="center"}
-:::
-
-::: {.content-visible unless-format="pdf"}
-We can then use `pointblank` to set-up tests for us.
-
-
-```{r}
-#| echo: true
-#| eval: true
-#| message: false
-#| warning: false
-
-agent <-
- create_agent(tbl = column_names_as_contracts) |>
- col_is_character(columns = vars(chr_group_gender)) |>
- col_is_factor(columns = vars(fctr_group_age)) |>
- col_is_integer(columns = vars(int_group_count)) |>
- col_vals_in_set(
- columns = chr_group_gender,
- set = c("male", "female", "total")
- ) |>
- interrogate()
-
-agent
-```
-
-:::
-
-
-
-## Exercises
-
-### Scales {.unnumbered}
-
-1. *(Plan)* Consider the following scenario: *You manage a shop with two employees and are interested in modeling their efficiency. The shop opens at 9am and closes at 5pm. The efficiency of the employees is mildly correlated and defined by the number of customers that they serve each hour. Be clear about whether you assume a negative or positive correlation.* Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.
-2. *(Simulate)* Please further consider the scenario described and simulate the situation. Please include five tests based on the simulated data. Submit a link to a GitHub Gist that contains your code.
-3. *(Acquire)* Please describe a possible source of such a dataset.
-4. *(Explore)* Please use `ggplot2` to build the graph that you sketched using the simulated data from step 1. Submit a link to a GitHub Gist that contains your code.
-5. *(Communicate)* Please write two paragraphs about what you did.
-
-### Questions {.unnumbered}
-
-1. If we had a character variable "some_words" with one observation `"You know what"` within a dataset called `sayings`, then which of the following would split it into its constituent words (pick one)?
- a. `separate(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")`
- b. `split(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")`
- c. `divide(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")`
- d. `part(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")`
- e. `unattach(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")`
-2. Is the following an example of tidy data?
-
-```{r}
-#| echo: true
-#| eval: false
-tibble(
- name = c("Ian", "Patricia", "Ville", "Karen"),
- age_group = c("18-29", "30-44", "45-60", "60+"),
-)
-```
-
- a. Yes
- b. No
-3. Which function would change "lemons" into "lemonade"?
- a. `str_replace(string = "lemons", pattern = "lemons", replacement = "lemonade")`
- b. `chr_replace(string = "lemons", pattern = "lemons", replacement = "lemonade")`
- c. `str_change(string = "lemons", pattern = "lemons", replacement = "lemonade")`
- d. `chr_change(string = "lemons", pattern = "lemons", replacement = "lemonade")`
-4. When dealing with ages, what are some desirable classes for the variable (select all that apply)?
- a. integer
- b. matrix
- c. numeric
-5. Please consider the following cities in Germany: "Berlin", "Hamburg", "Munich", "Cologne", "Frankfurt", and "Rostock". Use `testthat` to define three tests that could apply if we had a dataset with a variable "german_cities" that claimed to contain these, and only these, cities. Submit a link to a GitHub Gist.
-6. Which is the most acceptable format for dates in data science?
- a. YYYY-DD-MM
- b. YYYY-MM-DD
- c. DD-MM-YYYY
- d. MM-MM-YYYY
-7. Which of the following likely does not belong? `c(15.9, 14.9, 16.6, 15.8, 16.7, 17.9, I2.6, 11.5, 16.2, 19.5, 15.0)`
-8. With regard to "AV Rule 48" from @jsfcodingstandards [p. 25] which of the following are not allowed to differ identifiers (select all that apply)?
- a. Only a mixture of case
- b. The presence/absence of the underscore character
- c. The interchange of the letter "O" with the number "0" or the letter "D"
- d. The interchange of the letter "I" with the number "1" or the letter "l"
-9. With regard to @Preece1981 please discuss two ways in which final digits can be informative. Write at least a paragraph about each and include examples.
-
-### Tutorial {.unnumbered}
-
-With regard to @Jordan2019Artificial, @datafeminism2020 [Chapter 6], @thatrandyauperson, and other relevant work, to what extent do you think we should let the data speak for themselves? Please write at least two pages.
-
-Use Quarto, and include an appropriate title, author, date, link to a GitHub repo, and citations to produce a draft. After this, please pair with another student and exchange your written work. Update it based on their feedback, and be sure to acknowledge them by name in your paper. Submit a PDF.
-
-
diff --git a/docs/00-errata.html b/docs/00-errata.html
index ac1bed26..766ed7eb 100644
--- a/docs/00-errata.html
+++ b/docs/00-errata.html
@@ -238,19 +238,19 @@
diff --git a/docs/01-introduction.html b/docs/01-introduction.html
index 7bce206b..4ad5c7e6 100644
--- a/docs/01-introduction.html
+++ b/docs/01-introduction.html
@@ -233,19 +233,19 @@
diff --git a/docs/02-drinking_from_a_fire_hose.html b/docs/02-drinking_from_a_fire_hose.html
index 8ac44ea9..499cd105 100644
--- a/docs/02-drinking_from_a_fire_hose.html
+++ b/docs/02-drinking_from_a_fire_hose.html
@@ -267,19 +267,19 @@
@@ -834,18 +834,18 @@ simulated_data
# A tibble: 151 × 2
- Division Party
- <int> <chr>
- 1 1 Labor
- 2 2 Green
- 3 3 Liberal
- 4 4 Other
- 5 5 Other
- 6 6 Liberal
- 7 7 Other
- 8 8 Green
- 9 9 Green
-10 10 Green
+ Division Party
+ <int> <chr>
+ 1 1 Labor
+ 2 2 Liberal
+ 3 3 Green
+ 4 4 National
+ 5 5 Other
+ 6 6 Labor
+ 7 7 Other
+ 8 8 Liberal
+ 9 9 Green
+10 10 National
# ℹ 141 more rows
diff --git a/docs/03-workflow.html b/docs/03-workflow.html
index dda8fe79..8682e779 100644
--- a/docs/03-workflow.html
+++ b/docs/03-workflow.html
@@ -267,19 +267,19 @@
@@ -1531,7 +1531,7 @@
-
First bit of code: 0.001 sec elapsed
+
First bit of code: 0 sec elapsed
tic ("Second bit of code" )
Sys.sleep (3 )
@@ -1541,7 +1541,7 @@
-
Second bit of code: 3.007 sec elapsed
+
Second bit of code: 3.008 sec elapsed
And so we know that there is something slowing down the code. (In this artificial case it is Sys.sleep()
causing a delay of three seconds.)
diff --git a/docs/04-writing_research.html b/docs/04-writing_research.html
index 22381c48..6e2d24d2 100644
--- a/docs/04-writing_research.html
+++ b/docs/04-writing_research.html
@@ -267,19 +267,19 @@
diff --git a/docs/05-static_communication.html b/docs/05-static_communication.html
index 910cc9b8..d066d8ab 100644
--- a/docs/05-static_communication.html
+++ b/docs/05-static_communication.html
@@ -267,19 +267,19 @@
@@ -4857,8 +4857,8 @@ Paper
-
diff --git a/docs/06-farm.html b/docs/06-farm.html
index bc5f205a..f44f64de 100644
--- a/docs/06-farm.html
+++ b/docs/06-farm.html
@@ -7,7 +7,7 @@
-Telling Stories with Data - 6 Farm data
+Telling Stories with Data - 6 Measurement, censuses, and sampling
@@ -1211,23 +1211,23 @@
-
-
@@ -1860,23 +1860,23 @@
-
-
@@ -2512,23 +2512,23 @@
-
-
@@ -3159,23 +3159,23 @@
-
-
@@ -3816,23 +3816,23 @@
-
-
@@ -4465,23 +4465,23 @@
-
-
@@ -5110,23 +5110,23 @@
-
-
diff --git a/docs/24-interaction.html b/docs/24-interaction.html
index 6c2b0546..f3d88bcc 100644
--- a/docs/24-interaction.html
+++ b/docs/24-interaction.html
@@ -261,19 +261,19 @@
@@ -714,8 +714,8 @@
Figure F.2: Interactive map of US bases
@@ -767,8 +767,8 @@
Figure F.3: Interactive map of US bases with colored circules to indicate spend
@@ -801,8 +801,8 @@
Figure F.4: Interactive map of US bases using Mapdeck
diff --git a/docs/25-datasheet.html b/docs/25-datasheet.html
index 0ae2acf9..7bf59d87 100644
--- a/docs/25-datasheet.html
+++ b/docs/25-datasheet.html
@@ -204,19 +204,19 @@
diff --git a/docs/26-sql.html b/docs/26-sql.html
index b527620a..7df97ab5 100644
--- a/docs/26-sql.html
+++ b/docs/26-sql.html
@@ -238,19 +238,19 @@
diff --git a/docs/28-deploy.html b/docs/28-deploy.html
index 128a8219..0941e71b 100644
--- a/docs/28-deploy.html
+++ b/docs/28-deploy.html
@@ -238,19 +238,19 @@
diff --git a/docs/29-activities.html b/docs/29-activities.html
index 7fad5e2a..4d4a1ec2 100644
--- a/docs/29-activities.html
+++ b/docs/29-activities.html
@@ -238,19 +238,19 @@
diff --git a/docs/98-cocktails.html b/docs/98-cocktails.html
index 75c6ebfd..4e376995 100644
--- a/docs/98-cocktails.html
+++ b/docs/98-cocktails.html
@@ -213,19 +213,19 @@
diff --git a/docs/99-references.html b/docs/99-references.html
index fea2bc6d..8e6100c8 100644
--- a/docs/99-references.html
+++ b/docs/99-references.html
@@ -202,19 +202,19 @@
diff --git a/docs/index.html b/docs/index.html
index 2fa52bfe..7cdfe329 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -234,19 +234,19 @@
@@ -554,7 +554,7 @@ Structure
This book is structured around six parts: I) Foundations, II) Communication, III) Acquisition, IV) Preparation, V) Modeling, and VI) Applications.
Part I—Foundations—begins with 1 Telling stories with data which provides an overview of what I am trying to achieve with this book and why you should read it. 2 Drinking from a fire hose goes through three worked examples. The intention of these is that you can experience the full workflow recommended in this book without worrying too much about the specifics of what is happening. That workflow is: plan, simulate, acquire, model, and communicate. It is normal to not initially follow everything in this chapter, but you should go through it, typing out and executing the code yourself. If you only have time to read one chapter of this book, then I recommend that one. 3 Reproducible workflows introduces some key tools for reproducibility used in the workflow that I advocate. These are aspects like Quarto, R Projects, Git and GitHub, and using R in practice.
Part II—Communication—considers written and static communication. 4 Writing research details the features that quantitative writing should have and how to write a crisp, quantitative research paper. Static communication in 5 Static communication introduces features like graphs, tables, and maps.
-Part III—Acquisition—focuses on turning our world into data. 6 Farm data begins with measurement, and then steps through essential concepts from sampling that govern our approach to data. It then considers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, well-documented, pre-packaged datasets. 7 Gather data covers aspects like using Application Programming Interfaces (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Finally, 8 Hunt data covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys.
+Part III—Acquisition—focuses on turning our world into data. 6 Measurement, censuses, and sampling begins with measurement, and then steps through essential concepts from sampling that govern our approach to data. It then considers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, well-documented, pre-packaged datasets. 7 APIs, scraping, and parsing covers aspects like using Application Programming Interfaces (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Finally, 8 Experiments and surveys covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys.
Part IV—Preparation—covers how to respectfully transform the original, unedited data into something that can be explored and shared. 9 Clean and prepare begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then goes through specific steps to take and checks to implement. 10 Store and share focuses on methods of storing and retrieving those datasets, including the use of R data packages and parquet. It then continues onto considerations and steps to take when wanting to disseminate datasets as broadly as possible, while at the same time respecting those whose data they are based on.
Part V—Modeling—begins with exploratory data analysis in 11 Exploratory data analysis . This is the critical process of coming to understand a dataset, but not something that typically finds itself into the final product. The process is an end in itself. In 12 Linear models the use of linear models to explore data is introduced. And 13 Generalized linear models considers generalized linear models, including logistic, Poisson, and negative binomial regression. It also introduces multilevel modeling.
Part VI—Applications—provides three applications of modeling. 15 Causality from observational data focuses on making causal claims from observational data and covers approaches such as difference-in-differences, regression discontinuity, and instrumental variables. 16 Multilevel regression with post-stratification introduces multilevel regression with post-stratification, which is where we use a statistical model to adjust a sample for known biases. 17 Text as data is focused on text-as-data.
diff --git a/docs/search.json b/docs/search.json
index 3be06b07..c4ac61f5 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -24,7 +24,7 @@
"href": "index.html#structure-and-content",
"title": "Telling Stories with Data",
"section": "Structure and content",
- "text": "Structure and content\nThis book is structured around six parts: I) Foundations, II) Communication, III) Acquisition, IV) Preparation, V) Modeling, and VI) Applications.\nPart I—Foundations—begins with 1 Telling stories with data which provides an overview of what I am trying to achieve with this book and why you should read it. 2 Drinking from a fire hose goes through three worked examples. The intention of these is that you can experience the full workflow recommended in this book without worrying too much about the specifics of what is happening. That workflow is: plan, simulate, acquire, model, and communicate. It is normal to not initially follow everything in this chapter, but you should go through it, typing out and executing the code yourself. If you only have time to read one chapter of this book, then I recommend that one. 3 Reproducible workflows introduces some key tools for reproducibility used in the workflow that I advocate. These are aspects like Quarto, R Projects, Git and GitHub, and using R in practice.\nPart II—Communication—considers written and static communication. 4 Writing research details the features that quantitative writing should have and how to write a crisp, quantitative research paper. Static communication in 5 Static communication introduces features like graphs, tables, and maps.\nPart III—Acquisition—focuses on turning our world into data. 6 Farm data begins with measurement, and then steps through essential concepts from sampling that govern our approach to data. It then considers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, well-documented, pre-packaged datasets. 7 Gather data covers aspects like using Application Programming Interfaces (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Finally, 8 Hunt data covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys.\nPart IV—Preparation—covers how to respectfully transform the original, unedited data into something that can be explored and shared. 9 Clean and prepare begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then goes through specific steps to take and checks to implement. 10 Store and share focuses on methods of storing and retrieving those datasets, including the use of R data packages and parquet. It then continues onto considerations and steps to take when wanting to disseminate datasets as broadly as possible, while at the same time respecting those whose data they are based on.\nPart V—Modeling—begins with exploratory data analysis in 11 Exploratory data analysis. This is the critical process of coming to understand a dataset, but not something that typically finds itself into the final product. The process is an end in itself. In 12 Linear models the use of linear models to explore data is introduced. And 13 Generalized linear models considers generalized linear models, including logistic, Poisson, and negative binomial regression. It also introduces multilevel modeling.\nPart VI—Applications—provides three applications of modeling. 15 Causality from observational data focuses on making causal claims from observational data and covers approaches such as difference-in-differences, regression discontinuity, and instrumental variables. 16 Multilevel regression with post-stratification introduces multilevel regression with post-stratification, which is where we use a statistical model to adjust a sample for known biases. 17 Text as data is focused on text-as-data.\n18 Concluding remarks offers some concluding remarks, details some outstanding issues, and suggests some next steps.\nOnline appendices offer critical aspects that are either a little too unwieldy for the size constraints of the page, or likely to need more frequent updating than is reasonable for a printed book. Online Appendix A — R essentials goes through some essential tasks in R, which is the statistical programming language used in this book. It can be a reference chapter and some students find themselves returning to it as they go through the rest of the book. Online Appendix C — Datasets provides a list of datasets that may be useful for assessment. The core of this book is centered around Quarto, however its predecessor, R Markdown, has not yet been sunsetted and there is a lot of material available for it. As such, Online Appendix D — R Markdown contains R Markdown equivalents of the Quarto-specific aspects in 3 Reproducible workflows. A set of papers is included in Online Appendix E — Papers. If you write these, you will be conducting original research on a topic that is of interest to you. Although open-ended research may be new to you, the extent to which you are able to: develop your own questions, use quantitative methods to explore them, and communicate your findings, is the measure of the success of this book. Online Appendix F — Interaction covers aspects such as websites, web applications, and maps that can be interacted with. Online Appendix G — Datasheets provides an example of a datasheet. Online Appendix H — SQL essentials gives a brief overview of SQL essentials. 14 Prediction provides a discussion of prediction-focused modeling. Online Appendix I — Production considers how to make model estimates and forecasts more widely available. Finally, Online Appendix J — Class activities provides ideas for using this book in class.",
+ "text": "Structure and content\nThis book is structured around six parts: I) Foundations, II) Communication, III) Acquisition, IV) Preparation, V) Modeling, and VI) Applications.\nPart I—Foundations—begins with 1 Telling stories with data which provides an overview of what I am trying to achieve with this book and why you should read it. 2 Drinking from a fire hose goes through three worked examples. The intention of these is that you can experience the full workflow recommended in this book without worrying too much about the specifics of what is happening. That workflow is: plan, simulate, acquire, model, and communicate. It is normal to not initially follow everything in this chapter, but you should go through it, typing out and executing the code yourself. If you only have time to read one chapter of this book, then I recommend that one. 3 Reproducible workflows introduces some key tools for reproducibility used in the workflow that I advocate. These are aspects like Quarto, R Projects, Git and GitHub, and using R in practice.\nPart II—Communication—considers written and static communication. 4 Writing research details the features that quantitative writing should have and how to write a crisp, quantitative research paper. Static communication in 5 Static communication introduces features like graphs, tables, and maps.\nPart III—Acquisition—focuses on turning our world into data. 6 Measurement, censuses, and sampling begins with measurement, and then steps through essential concepts from sampling that govern our approach to data. It then considers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, well-documented, pre-packaged datasets. 7 APIs, scraping, and parsing covers aspects like using Application Programming Interfaces (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Finally, 8 Experiments and surveys covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys.\nPart IV—Preparation—covers how to respectfully transform the original, unedited data into something that can be explored and shared. 9 Clean and prepare begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then goes through specific steps to take and checks to implement. 10 Store and share focuses on methods of storing and retrieving those datasets, including the use of R data packages and parquet. It then continues onto considerations and steps to take when wanting to disseminate datasets as broadly as possible, while at the same time respecting those whose data they are based on.\nPart V—Modeling—begins with exploratory data analysis in 11 Exploratory data analysis. This is the critical process of coming to understand a dataset, but not something that typically finds itself into the final product. The process is an end in itself. In 12 Linear models the use of linear models to explore data is introduced. And 13 Generalized linear models considers generalized linear models, including logistic, Poisson, and negative binomial regression. It also introduces multilevel modeling.\nPart VI—Applications—provides three applications of modeling. 15 Causality from observational data focuses on making causal claims from observational data and covers approaches such as difference-in-differences, regression discontinuity, and instrumental variables. 16 Multilevel regression with post-stratification introduces multilevel regression with post-stratification, which is where we use a statistical model to adjust a sample for known biases. 17 Text as data is focused on text-as-data.\n18 Concluding remarks offers some concluding remarks, details some outstanding issues, and suggests some next steps.\nOnline appendices offer critical aspects that are either a little too unwieldy for the size constraints of the page, or likely to need more frequent updating than is reasonable for a printed book. Online Appendix A — R essentials goes through some essential tasks in R, which is the statistical programming language used in this book. It can be a reference chapter and some students find themselves returning to it as they go through the rest of the book. Online Appendix C — Datasets provides a list of datasets that may be useful for assessment. The core of this book is centered around Quarto, however its predecessor, R Markdown, has not yet been sunsetted and there is a lot of material available for it. As such, Online Appendix D — R Markdown contains R Markdown equivalents of the Quarto-specific aspects in 3 Reproducible workflows. A set of papers is included in Online Appendix E — Papers. If you write these, you will be conducting original research on a topic that is of interest to you. Although open-ended research may be new to you, the extent to which you are able to: develop your own questions, use quantitative methods to explore them, and communicate your findings, is the measure of the success of this book. Online Appendix F — Interaction covers aspects such as websites, web applications, and maps that can be interacted with. Online Appendix G — Datasheets provides an example of a datasheet. Online Appendix H — SQL essentials gives a brief overview of SQL essentials. 14 Prediction provides a discussion of prediction-focused modeling. Online Appendix I — Production considers how to make model estimates and forecasts more widely available. Finally, Online Appendix J — Class activities provides ideas for using this book in class.",
"crumbs": [
"Preface"
]
@@ -203,7 +203,7 @@
"href": "02-drinking_from_a_fire_hose.html#australian-elections",
"title": "2 Drinking from a fire hose",
"section": "2.2 Australian elections",
- "text": "2.2 Australian elections\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\n\n2.2.1 Plan\nFor this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.\nThe basic requirement for the dataset is that it has the name of the seat (sometimes called a “division” in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is Figure 2.2 (a).\n\n\n\n\n\n\n\n\n\n\n\n(a) Quick sketch of a dataset that could be useful for analyzing Australian elections\n\n\n\n\n\n\n\n\n\n\n\n(b) Quick sketch of a possible graph of the number of seats won by each party\n\n\n\n\n\n\n\nFigure 2.2: Sketches of a potential dataset and graph related to an Australian election\n\n\n\nWe also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is Figure 2.2 (b).\n\n\n2.2.2 Simulate\nWe now simulate some data, to bring some specificity to our sketches.\nTo get started, within Posit Cloud, make a new Quarto document: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto document\\(\\dots\\)”. Give it a title, such as “Exploring the 2022 Australian Election”, add your name as author, and unclick “Use visual markdown editor” (Figure 2.3 (a)). Leave the other options as their default, and then click “Create”.\n\n\n\n\n\n\n\n\n\n\n\n(a) Creating a new Quarto document\n\n\n\n\n\n\n\n\n\n\n\n(b) Installing rmarkdown if necessary\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) After initial setup and with a preamble\n\n\n\n\n\n\n\n\n\n\n\n(d) Highlighting the green arrow to run the chunk\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Highlighting the cross to remove the messages\n\n\n\n\n\n\n\n\n\n\n\n(f) Highlighting the render button\n\n\n\n\n\n\n\nFigure 2.3: Getting started with a Quarto document\n\n\n\nYou may get a notification along the lines of “Package rmarkdown required\\(\\dots\\).” (Figure 2.3 (b)). If that happens, click “Install”. For this example, we will put everything into this one Quarto document. You should save it as “australian_elections.qmd”: “File” \\(\\rightarrow\\) “Save As\\(\\dots\\)”.\nRemove almost all the default content, and then beneath the heading material create a new R code chunk: “Code” \\(\\rightarrow\\) “Insert Chunk”. Then add preamble documentation that explains:\n\nthe purpose of the document;\nthe author and contact details;\nwhen the file was written or last updated; and\nprerequisites that the file relies on.\n\n\n#### Preamble ####\n# Purpose: Read in data from the 2022 Australian Election and make\n# a graph of the number of seats each party won.\n# Author: Rohan Alexander\n# Email: rohan.alexander@utoronto.ca\n# Date: 1 January 2023\n# Prerequisites: Know where to get Australian elections data.\n\nIn R, lines that start with “#” are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a “#”. Also make it clear that this is the preamble section by surrounding that with “####”. The result should look like Figure 2.3 (c).\nAfter this we need to setup the workspace. This involves installing and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the tidyverse and janitor packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nHadley Wickham is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013. He developed the tidyverse collection of packages, and has published many books including R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023) and Advanced R (Wickham 2019). He was awarded the COPSS Presidents’ Award in 2019.\n\n\nAn example of installing the packages follows. Run this code by clicking the small green arrow associated with the R code chunk (Figure 2.3 (d)).\n\n#### Workspace setup ####\ninstall.packages(\"tidyverse\")\ninstall.packages(\"janitor\")\n\nNow that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code can be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (Figure 2.3 (e)).\n\n#### Workspace setup ####\n# install.packages(\"tidyverse\")\n# install.packages(\"janitor\")\n\nlibrary(tidyverse)\nlibrary(janitor)\n\nWe can render the entire document by clicking “Render” (Figure 2.3 (f)). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\nFor an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance ?tidyverse.\nTo simulate our data, we need to create a dataset with two variables: “Division” and “Party”, and some values for each. In the case of “Division” reasonable values would be a name of one of the 151 Australian divisions. In the case of “Party” reasonable values would be one of the following five: “Liberal”, “Labor”, “National”, “Green”, or “Other”. Again, this code can be run by clicking the small green arrow associated with the R code chunk.\n\nsimulated_data <-\n tibble(\n # Use 1 through to 151 to represent each division\n \"Division\" = 1:151,\n # Randomly pick an option, with replacement, 151 times\n \"Party\" = sample(\n x = c(\"Liberal\", \"Labor\", \"National\", \"Green\", \"Other\"),\n size = 151,\n replace = TRUE\n )\n )\n\nsimulated_data\n\n# A tibble: 151 × 2\n Division Party \n <int> <chr> \n 1 1 Labor \n 2 2 Green \n 3 3 Liberal\n 4 4 Other \n 5 5 Other \n 6 6 Liberal\n 7 7 Other \n 8 8 Green \n 9 9 Green \n10 10 Green \n# ℹ 141 more rows\n\n\nAt a certain point, your code will not run and you will want to ask others for help. Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub is more completely in Chapter 3, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on GitHub (Figure 2.4 (a)). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a “+” in the top right, and select “New gist” (Figure 2.4 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) GitHub sign-up screen\n\n\n\n\n\n\n\n\n\n\n\n(b) New GitHub Gist\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Create a public GitHub Gist to share code\n\n\n\n\n\n\n\nFigure 2.4: Creating a Gist to share code when asking for help\n\n\n\nFrom here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes “.R” at the end, for instance, “australian_elections.R”. In Figure 2.4 (c) it will turn out that we have incorrect capitalization, library(Tidyverse) instead of library(tidyverse).\nClick “Create public gist”. We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.\n\n\n2.2.3 Acquire\nNow we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC), which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to read_csv() from readr. We do not need to explicitly load readr because it is part of the tidyverse. The <- or “assignment operator” allocates the output of read_csv() to an object called “raw_elections_data”.\n\n#### Read in the data ####\nraw_elections_data <-\n read_csv(\n file = \n \"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv\",\n show_col_types = FALSE,\n skip = 1\n )\n\n# We have read the data from the AEC website. We may like to save\n# it in case something happens or they move it.\nwrite_csv(\n x = raw_elections_data,\n file = \"australian_voting.csv\"\n)\n\nWe can take a quick look at the dataset using head() which will show the first six rows, and tail() which will show the last six rows.\n\nhead(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Austral… ALP \n2 197 Aston VIC 36704 Alan TUDGE Liberal LP \n3 198 Ballarat VIC 36409 Catherine KING Austral… ALP \n4 103 Banks NSW 37018 David COLEMAN Liberal LP \n5 180 Barker SA 37083 Tony PASIN Liberal LP \n6 104 Barton NSW 36820 Linda BURNEY Austral… ALP \n\ntail(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 152 Wentworth NSW 37451 Allegra SPENDER Indepen… IND \n2 153 Werriwa NSW 36810 Anne Maree STANLEY Austral… ALP \n3 150 Whitlam NSW 36811 Stephen JONES Austral… ALP \n4 178 Wide Bay QLD 37506 Llew O'BRIEN Liberal… LNP \n5 234 Wills VIC 36452 Peter KHALIL Austral… ALP \n6 316 Wright QLD 37500 Scott BUCHHOLZ Liberal… LNP \n\n\nWe need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using clean_names() from janitor.\n\n#### Basic cleaning ####\nraw_elections_data <-\n read_csv(\n file = \"australian_voting.csv\",\n show_col_types = FALSE\n )\n\n\n# Make the names easier to type\ncleaned_elections_data <-\n clean_names(raw_elections_data)\n\n# Have a look at the first six rows\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 8\n division_id division_nm state_ab candidate_id given_nm surname party_nm \n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Australian …\n2 197 Aston VIC 36704 Alan TUDGE Liberal \n3 198 Ballarat VIC 36409 Catherine KING Australian …\n4 103 Banks NSW 37018 David COLEMAN Liberal \n5 180 Barker SA 37083 Tony PASIN Liberal \n6 104 Barton NSW 36820 Linda BURNEY Australian …\n# ℹ 1 more variable: party_ab <chr>\n\n\nThe names are faster to type because RStudio will auto-complete them. To do this, we begin typing the name of a variable and then use the “tab” key to complete it.\nThere are many variables in the dataset, and we are primarily interested in two: “division_nm” and “party_nm”. We can choose certain variables of interest with select() from dplyr which we loaded as part of the tidyverse. The “pipe operator”, |>, pushes the output of one line to be the first input of the function on the next line.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n select(\n division_nm,\n party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division_nm party_nm \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nSome of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with names(). And we can change the names using rename() from dplyr.\n\nnames(cleaned_elections_data)\n\n[1] \"division_nm\" \"party_nm\" \n\n\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n rename(\n division = division_nm,\n elected_party = party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nWe could now look at the unique values in the “elected_party” column using unique().\n\ncleaned_elections_data$elected_party |>\n unique()\n\n[1] \"Australian Labor Party\" \n[2] \"Liberal\" \n[3] \"Liberal National Party of Queensland\"\n[4] \"The Greens\" \n[5] \"The Nationals\" \n[6] \"Independent\" \n[7] \"Katter's Australian Party (KAP)\" \n[8] \"Centre Alliance\" \n\n\nAs there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using case_match() from dplyr.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n mutate(\n elected_party =\n case_match(\n elected_party,\n \"Australian Labor Party\" ~ \"Labor\",\n \"Liberal National Party of Queensland\" ~ \"Liberal\",\n \"Liberal\" ~ \"Liberal\",\n \"The Nationals\" ~ \"Nationals\",\n \"The Greens\" ~ \"Greens\",\n \"Independent\" ~ \"Other\",\n \"Katter's Australian Party (KAP)\" ~ \"Other\",\n \"Centre Alliance\" ~ \"Other\"\n )\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party\n <chr> <chr> \n1 Adelaide Labor \n2 Aston Liberal \n3 Ballarat Labor \n4 Banks Liberal \n5 Barker Liberal \n6 Barton Labor \n\n\nOur data now matches our plan (Figure 2.2 (a)). For every electoral division we have the party of the person that won it.\nHaving now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.\n\nwrite_csv(\n x = cleaned_elections_data,\n file = \"cleaned_elections_data.csv\"\n)\n\n\n\n2.2.4 Explore\nWe may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in Figure 2.2 (b).\nFirst, we read in the dataset that we just created.\n\n#### Read in the data ####\ncleaned_elections_data <-\n read_csv(\n file = \"cleaned_elections_data.csv\",\n show_col_types = FALSE\n )\n\nWe can get a quick count of how many seats each party won using count() from dplyr.\n\ncleaned_elections_data |>\n count(elected_party)\n\n# A tibble: 5 × 2\n elected_party n\n <chr> <int>\n1 Greens 4\n2 Labor 77\n3 Liberal 48\n4 Nationals 10\n5 Other 12\n\n\nTo build the graph that we are interested in, we use ggplot2 which is part of the tidyverse. The key aspect of this package is that we build graphs by adding layers using “+”, which we call the “add operator”. In particular we will create a bar chart using geom_bar() from ggplot2 (Figure 2.5 (a)).\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) + # aes abbreviates \"aesthetics\" \n geom_bar()\n\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) +\n geom_bar() +\n theme_minimal() + # Make the theme neater\n labs(x = \"Party\", y = \"Number of seats\") # Make labels more meaningful\n\n\n\n\n\n\n\n\n\n\n\n(a) Default options\n\n\n\n\n\n\n\n\n\n\n\n(b) Improved theme and labels\n\n\n\n\n\n\n\nFigure 2.5: Number of seats won, by political party, at the 2022 Australian Federal Election\n\n\n\nFigure 2.5 (a) accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (Figure 2.5 (b)).\n\n\n2.2.5 Share\nTo this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.\n\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.\nWe downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R (R Core Team 2023) including the tidyverse (Wickham et al. 2019) and janitor (Firke 2023). We then created a graph of the number of seats that each political party won (Figure 2.5).\nWe found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.\nThe distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.\n\nOne aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, Cardoso (2020) and Bronner (2020).",
+ "text": "2.2 Australian elections\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\n\n2.2.1 Plan\nFor this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.\nThe basic requirement for the dataset is that it has the name of the seat (sometimes called a “division” in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is Figure 2.2 (a).\n\n\n\n\n\n\n\n\n\n\n\n(a) Quick sketch of a dataset that could be useful for analyzing Australian elections\n\n\n\n\n\n\n\n\n\n\n\n(b) Quick sketch of a possible graph of the number of seats won by each party\n\n\n\n\n\n\n\nFigure 2.2: Sketches of a potential dataset and graph related to an Australian election\n\n\n\nWe also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is Figure 2.2 (b).\n\n\n2.2.2 Simulate\nWe now simulate some data, to bring some specificity to our sketches.\nTo get started, within Posit Cloud, make a new Quarto document: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto document\\(\\dots\\)”. Give it a title, such as “Exploring the 2022 Australian Election”, add your name as author, and unclick “Use visual markdown editor” (Figure 2.3 (a)). Leave the other options as their default, and then click “Create”.\n\n\n\n\n\n\n\n\n\n\n\n(a) Creating a new Quarto document\n\n\n\n\n\n\n\n\n\n\n\n(b) Installing rmarkdown if necessary\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) After initial setup and with a preamble\n\n\n\n\n\n\n\n\n\n\n\n(d) Highlighting the green arrow to run the chunk\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Highlighting the cross to remove the messages\n\n\n\n\n\n\n\n\n\n\n\n(f) Highlighting the render button\n\n\n\n\n\n\n\nFigure 2.3: Getting started with a Quarto document\n\n\n\nYou may get a notification along the lines of “Package rmarkdown required\\(\\dots\\).” (Figure 2.3 (b)). If that happens, click “Install”. For this example, we will put everything into this one Quarto document. You should save it as “australian_elections.qmd”: “File” \\(\\rightarrow\\) “Save As\\(\\dots\\)”.\nRemove almost all the default content, and then beneath the heading material create a new R code chunk: “Code” \\(\\rightarrow\\) “Insert Chunk”. Then add preamble documentation that explains:\n\nthe purpose of the document;\nthe author and contact details;\nwhen the file was written or last updated; and\nprerequisites that the file relies on.\n\n\n#### Preamble ####\n# Purpose: Read in data from the 2022 Australian Election and make\n# a graph of the number of seats each party won.\n# Author: Rohan Alexander\n# Email: rohan.alexander@utoronto.ca\n# Date: 1 January 2023\n# Prerequisites: Know where to get Australian elections data.\n\nIn R, lines that start with “#” are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a “#”. Also make it clear that this is the preamble section by surrounding that with “####”. The result should look like Figure 2.3 (c).\nAfter this we need to setup the workspace. This involves installing and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the tidyverse and janitor packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nHadley Wickham is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013. He developed the tidyverse collection of packages, and has published many books including R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023) and Advanced R (Wickham 2019). He was awarded the COPSS Presidents’ Award in 2019.\n\n\nAn example of installing the packages follows. Run this code by clicking the small green arrow associated with the R code chunk (Figure 2.3 (d)).\n\n#### Workspace setup ####\ninstall.packages(\"tidyverse\")\ninstall.packages(\"janitor\")\n\nNow that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code can be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (Figure 2.3 (e)).\n\n#### Workspace setup ####\n# install.packages(\"tidyverse\")\n# install.packages(\"janitor\")\n\nlibrary(tidyverse)\nlibrary(janitor)\n\nWe can render the entire document by clicking “Render” (Figure 2.3 (f)). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\nFor an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance ?tidyverse.\nTo simulate our data, we need to create a dataset with two variables: “Division” and “Party”, and some values for each. In the case of “Division” reasonable values would be a name of one of the 151 Australian divisions. In the case of “Party” reasonable values would be one of the following five: “Liberal”, “Labor”, “National”, “Green”, or “Other”. Again, this code can be run by clicking the small green arrow associated with the R code chunk.\n\nsimulated_data <-\n tibble(\n # Use 1 through to 151 to represent each division\n \"Division\" = 1:151,\n # Randomly pick an option, with replacement, 151 times\n \"Party\" = sample(\n x = c(\"Liberal\", \"Labor\", \"National\", \"Green\", \"Other\"),\n size = 151,\n replace = TRUE\n )\n )\n\nsimulated_data\n\n# A tibble: 151 × 2\n Division Party \n <int> <chr> \n 1 1 Labor \n 2 2 Liberal \n 3 3 Green \n 4 4 National\n 5 5 Other \n 6 6 Labor \n 7 7 Other \n 8 8 Liberal \n 9 9 Green \n10 10 National\n# ℹ 141 more rows\n\n\nAt a certain point, your code will not run and you will want to ask others for help. Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub is more completely in Chapter 3, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on GitHub (Figure 2.4 (a)). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a “+” in the top right, and select “New gist” (Figure 2.4 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) GitHub sign-up screen\n\n\n\n\n\n\n\n\n\n\n\n(b) New GitHub Gist\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Create a public GitHub Gist to share code\n\n\n\n\n\n\n\nFigure 2.4: Creating a Gist to share code when asking for help\n\n\n\nFrom here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes “.R” at the end, for instance, “australian_elections.R”. In Figure 2.4 (c) it will turn out that we have incorrect capitalization, library(Tidyverse) instead of library(tidyverse).\nClick “Create public gist”. We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.\n\n\n2.2.3 Acquire\nNow we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC), which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to read_csv() from readr. We do not need to explicitly load readr because it is part of the tidyverse. The <- or “assignment operator” allocates the output of read_csv() to an object called “raw_elections_data”.\n\n#### Read in the data ####\nraw_elections_data <-\n read_csv(\n file = \n \"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv\",\n show_col_types = FALSE,\n skip = 1\n )\n\n# We have read the data from the AEC website. We may like to save\n# it in case something happens or they move it.\nwrite_csv(\n x = raw_elections_data,\n file = \"australian_voting.csv\"\n)\n\nWe can take a quick look at the dataset using head() which will show the first six rows, and tail() which will show the last six rows.\n\nhead(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Austral… ALP \n2 197 Aston VIC 36704 Alan TUDGE Liberal LP \n3 198 Ballarat VIC 36409 Catherine KING Austral… ALP \n4 103 Banks NSW 37018 David COLEMAN Liberal LP \n5 180 Barker SA 37083 Tony PASIN Liberal LP \n6 104 Barton NSW 36820 Linda BURNEY Austral… ALP \n\ntail(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 152 Wentworth NSW 37451 Allegra SPENDER Indepen… IND \n2 153 Werriwa NSW 36810 Anne Maree STANLEY Austral… ALP \n3 150 Whitlam NSW 36811 Stephen JONES Austral… ALP \n4 178 Wide Bay QLD 37506 Llew O'BRIEN Liberal… LNP \n5 234 Wills VIC 36452 Peter KHALIL Austral… ALP \n6 316 Wright QLD 37500 Scott BUCHHOLZ Liberal… LNP \n\n\nWe need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using clean_names() from janitor.\n\n#### Basic cleaning ####\nraw_elections_data <-\n read_csv(\n file = \"australian_voting.csv\",\n show_col_types = FALSE\n )\n\n\n# Make the names easier to type\ncleaned_elections_data <-\n clean_names(raw_elections_data)\n\n# Have a look at the first six rows\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 8\n division_id division_nm state_ab candidate_id given_nm surname party_nm \n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Australian …\n2 197 Aston VIC 36704 Alan TUDGE Liberal \n3 198 Ballarat VIC 36409 Catherine KING Australian …\n4 103 Banks NSW 37018 David COLEMAN Liberal \n5 180 Barker SA 37083 Tony PASIN Liberal \n6 104 Barton NSW 36820 Linda BURNEY Australian …\n# ℹ 1 more variable: party_ab <chr>\n\n\nThe names are faster to type because RStudio will auto-complete them. To do this, we begin typing the name of a variable and then use the “tab” key to complete it.\nThere are many variables in the dataset, and we are primarily interested in two: “division_nm” and “party_nm”. We can choose certain variables of interest with select() from dplyr which we loaded as part of the tidyverse. The “pipe operator”, |>, pushes the output of one line to be the first input of the function on the next line.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n select(\n division_nm,\n party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division_nm party_nm \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nSome of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with names(). And we can change the names using rename() from dplyr.\n\nnames(cleaned_elections_data)\n\n[1] \"division_nm\" \"party_nm\" \n\n\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n rename(\n division = division_nm,\n elected_party = party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nWe could now look at the unique values in the “elected_party” column using unique().\n\ncleaned_elections_data$elected_party |>\n unique()\n\n[1] \"Australian Labor Party\" \n[2] \"Liberal\" \n[3] \"Liberal National Party of Queensland\"\n[4] \"The Greens\" \n[5] \"The Nationals\" \n[6] \"Independent\" \n[7] \"Katter's Australian Party (KAP)\" \n[8] \"Centre Alliance\" \n\n\nAs there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using case_match() from dplyr.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n mutate(\n elected_party =\n case_match(\n elected_party,\n \"Australian Labor Party\" ~ \"Labor\",\n \"Liberal National Party of Queensland\" ~ \"Liberal\",\n \"Liberal\" ~ \"Liberal\",\n \"The Nationals\" ~ \"Nationals\",\n \"The Greens\" ~ \"Greens\",\n \"Independent\" ~ \"Other\",\n \"Katter's Australian Party (KAP)\" ~ \"Other\",\n \"Centre Alliance\" ~ \"Other\"\n )\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party\n <chr> <chr> \n1 Adelaide Labor \n2 Aston Liberal \n3 Ballarat Labor \n4 Banks Liberal \n5 Barker Liberal \n6 Barton Labor \n\n\nOur data now matches our plan (Figure 2.2 (a)). For every electoral division we have the party of the person that won it.\nHaving now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.\n\nwrite_csv(\n x = cleaned_elections_data,\n file = \"cleaned_elections_data.csv\"\n)\n\n\n\n2.2.4 Explore\nWe may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in Figure 2.2 (b).\nFirst, we read in the dataset that we just created.\n\n#### Read in the data ####\ncleaned_elections_data <-\n read_csv(\n file = \"cleaned_elections_data.csv\",\n show_col_types = FALSE\n )\n\nWe can get a quick count of how many seats each party won using count() from dplyr.\n\ncleaned_elections_data |>\n count(elected_party)\n\n# A tibble: 5 × 2\n elected_party n\n <chr> <int>\n1 Greens 4\n2 Labor 77\n3 Liberal 48\n4 Nationals 10\n5 Other 12\n\n\nTo build the graph that we are interested in, we use ggplot2 which is part of the tidyverse. The key aspect of this package is that we build graphs by adding layers using “+”, which we call the “add operator”. In particular we will create a bar chart using geom_bar() from ggplot2 (Figure 2.5 (a)).\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) + # aes abbreviates \"aesthetics\" \n geom_bar()\n\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) +\n geom_bar() +\n theme_minimal() + # Make the theme neater\n labs(x = \"Party\", y = \"Number of seats\") # Make labels more meaningful\n\n\n\n\n\n\n\n\n\n\n\n(a) Default options\n\n\n\n\n\n\n\n\n\n\n\n(b) Improved theme and labels\n\n\n\n\n\n\n\nFigure 2.5: Number of seats won, by political party, at the 2022 Australian Federal Election\n\n\n\nFigure 2.5 (a) accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (Figure 2.5 (b)).\n\n\n2.2.5 Share\nTo this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.\n\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.\nWe downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R (R Core Team 2023) including the tidyverse (Wickham et al. 2019) and janitor (Firke 2023). We then created a graph of the number of seats that each political party won (Figure 2.5).\nWe found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.\nThe distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.\n\nOne aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, Cardoso (2020) and Bronner (2020).",
"crumbs": [
"Foundations",
"2 Drinking from a fire hose "
@@ -324,7 +324,7 @@
"href": "03-workflow.html#efficiency",
"title": "3 Reproducible workflows",
"section": "3.6 Efficiency",
- "text": "3.6 Efficiency\nGenerally in this book we are, and will continue to be, concerned with just getting something done. Not necessarily getting it done in the best or most efficient way, because to a large extent, being worried about that is a waste of time. For the most part one is better off just pushing things into the cloud, letting them run for a reasonable time, and using that time to worry about other aspects of the pipeline. But that eventually becomes unfeasible. At a certain point, and this differs depending on context, efficiency becomes important. Eventually ugly or slow code, and dogmatic insistence on a particular way of doing things, have an effect. And it is at that point that one needs to be open to new approaches to ensure efficiency. There is rarely a most common area for obvious performance gains. Instead, it is important to develop the ability to measure, evaluate, and think.\nOne of the best ways to improve the efficiency of our code is preparing it in such a way that we can bring in a second pair of eyes. To make the most of their time, it is important that our code easy to read. So we start with “code linting” and “styling”. This does not speed up our code, per se, but instead makes it more efficient when another person comes to it, or we revisit it. This enables formal code review and refactoring, which is where we rewrite code to make it better, while not changing what it does (it does the same thing, but in a different way). We then turn to measurement of run time, and introduce parallel processing, where we allow our computer to run code for multiple processes at the same time\n\n3.6.1 Sharing a code environment\nWe have discussed at length the need to share code, and we have put forward an approach to this using GitHub. And in Chapter 10, we will discuss sharing data. But, there is another requirement to enable other people to run our code. In Chapter 2 we discussed how R itself, as well as R packages update from time to time, as new functionality is developed, errors fixed, and other general improvements made. Online Appendix A describes how one advantage of the tidyverse is that it can update faster than base R, because it is more specific. But this could mean that even if we were to share all the code and data that we use, it is possible that the software versions that have become available would cause errors.\nThe solution to this is to detail the environment that was used. There are a large number of ways to do this, and they can add complexity. We just focus on documenting the version of R and R packages that were used, and making it easier for others to install that exact version. Essentially we are just isolating the set-up that we used because that will help with reproducibility (Perkel2023?). In R we can use renv to do this.\nOnce renv is installed and loaded, we use init() to get the infrastructure set-up that we will need. We are going to create a file that will record the packages and versions used. We then use snapshot() to actually document what we are using. This creates a “lockfile” that records the information.\nIf we want to see which packages we are using in the R Project, then we can use dependencies(). Doing this for the example folder indicates that the following packages are used: rmarkdown, bookdown, knitr, rmarkdown, bookdown, knitr, palmerpenguins, tidyverse, renv, haven, readr, and tidyverse.\nWe could open the lockfile file—“renv.lock”—to see the exact versions if we wanted. The lockfile also documents all the other packages that were installed and where they were downloaded from. Someone coming to this project from outside could then use restore() which would install the exact version of the packages that we used.\n\n\n3.6.2 Code linting and styling\nBeing fast is valuable but it is mostly about being able to iterate fast, not necessarily having code that runs fast. Backus (1981, 26) describes how even in 1954 a programmer cost at least as much as a computer, and these days additional computational power is usually much cheaper than a programmer. Performant code is important, but it is also important to use other people’s time efficiently. Code is rarely only written once. Instead we typically have to come back to it, even if to just fix mistakes, and this means that code must be able to be read by humans (Matsumoto 2007, 478). If this is not done then there will be an efficiency cost.\nLinting and styling is the process of checking code, mostly for stylistic issues, and re-arranging code to make it easier to read. (There is another aspect of linting, which is dealing with programming errors, such as forgetting a closing bracket, but here we focus on stylistic issues.) Often the best efficiency gain comes from making it easier for others to read our code, even if this is just ourselves returning to the code after a break. Jane Street, a US proprietary trading firm, places a very strong focus on ensuring their code is readable, as a core part of risk mitigation (Minsky 2011). While we may not all have billions of dollars under the potentially mercurial management of code, we all would likely prefer that our code does not produce errors.\nWe use lint() from lintr to lint our code. For instance, consider the following R code (saved as “linting_example.R”).\n\nSIMULATED_DATA <-\n tibble(\n division = c(1:150, 151),\n party = sample(\n x = c(\"Liberal\"),\n size = 151,\n replace = T\n )\n )\n\n\nlint(filename = \"linting_example.R\")\n\nThe result is that the file “linting_example.R” is opened and the issues that lint() found are printed in “Markers” (Figure 3.6). It is then up to you to deal with the issues.\n\n\n\n\n\n\nFigure 3.6: Linting results from example R code\n\n\n\nMaking the recommended changes results in code that is more readable, and consistent with best practice, as defined by Wickham (2021).\n\nsimulated_data <-\n tibble(\n division = c(1:150, 151),\n party = sample(\n x = c(\"Liberal\"),\n size = 151,\n replace = TRUE\n )\n )\n\nAt first it may seem that some aspects that the linter is identifying, like trailing whitespace and only using double quotes are small and inconsequential. But they distract from being able to fix bigger issues. Further, if we are not able to get small things right, then how could anyone trust that we could get the big things right? Therefore, it is important to have dealt with all the small aspects that a linter identifies.\nIn addition to lintr we also use styler. This will automatically adjust style issues, in contrast to the linter, which gave a list of issues to look at. To run this we use style_file().\n\nstyle_file(path = \"linting_example.R\")\n\nThis will automatically make changes, such as spacing and indentation. As such this should be done regularly, rather than only once at the end of a project, so as to be able to review the changes and make sure no errors have been introduced.\n\n\n3.6.3 Code review\nHaving dealt with all of these aspects of style, we can turn to code review. This is the process of having another person go through and critique the code. Many professional writers have editors, and code review is the closest that we come to that in data science. Code review is a critical part of writing code, and Irving et al. (2021, 465) describe it as “the most effective way to find bugs”. It is especially helpful, although quite daunting, when learning to code because getting feedback is a great way to improve.\nGo out of your way to be polite and collegial when reviewing another person’s code. Small aspects to do with style, things like spacing and separation, should have been taken care of by a linter and styler, but if not, then make a general recommendation about that. Most of your time as a code reviewer in data science should be spent on aspects such as:\n\nIs there an informative README and how could it be improved?\nAre the file names and variable names consistent, informative, and meaningful?\nDo the comments allow you to understand why something is being done?\nAre the tests both appropriate and sufficient? Are there edge cases or corner solutions that are not considered? Similarly, are there unnecessary tests that could be removed?\nAre there magic numbers that could be changed to variables and explained?\nIs there duplicated code that could be changed?\nAre there any outstanding warnings that should be addressed?\nAre there any especially large functions or pipes that could be separated into smaller ones?\nIs the structure of the project appropriate?\nCan we change any of the code to data (Irving et al. 2021, 462)?\n\nFor instance, consider some code that looked for the names of prime ministers and presidents. When we first wrote this code we likely added the relevant names directly into the code. But as part of code review, we might instead recommend that this be changed. We might recommend creating a small dataset of relevant names, and then re-writing the code to have it look up that dataset.\nCode review ensures that the code can be understood by at least one other person. This is a critical part of building knowledge about the world. At Google, code review is not primarily about finding defects, although that may happen, but is instead about ensuring readability and maintainability as well as education (Sadowski et al. 2018). This is also the case at Jane Street where they use code review to catch bugs, share institutional knowledge, assist with training, and oblige staff to write code that can be read (Minsky 2015).\nFinally, code review does not have to, and should not, be an onerous days-consuming process of reading all the code. The best code review is a quick review of just one file, focused on suggesting changes to just a handful of lines. Indeed, it may be better to have a review done by a small team of people rather than one individual. Do not review too much code at any one time. At most a few hundred lines, which should take around an hour, because any more than that has been found to be associated with reduced efficacy (Cohen, Teleki, and Brown 2006, 79).\n\n\n3.6.4 Code refactoring\nTo refactor code means to rewrite it so that the new code achieves the same outcome as the old code, but the new code does it better. For instance, Chawla (2020) discuss how the code underpinning an important UK Covid model was initially written by epidemiologists, and months later clarified and cleaned up by a team from the Royal Society, Microsoft, and GitHub. This was valuable because it provided more confidence in the model, even though both versions produced the same outputs, given the same inputs.\nWe typically refer to code refactoring in relation to code that someone else wrote. (Although it may be that we actually wrote the code, and it was just that it was some time ago.) When we start to refactor code, we want to make sure that the rewritten code achieves the same outcomes as the original code. This means that we need a suite of appropriate tests written that we can depend on. If these do not exist, then we may need to create them.\nWe rewrite code to make it easier for others to understand, which in turn allows more confidence in our conclusions. But before we can do that, we need to understand what the existing code is doing. One way to get started is to go through the code and add extensive comments. These comments are different to normal comments. They are our active process of trying to understand what is each code chunk trying to do and how could this be improved.\nRefactoring code is an opportunity to ensure that it satisfies best practice. Trisovic et al. (2022) details some core recommendations based on examining 9,000 R scripts including:\n\nRemove setwd() and any absolute paths, and ensure that only relative paths, in relation to the “.Rproj” file, are used.\nEnsure there is a clear order of execution. We have recommended using numbers in filenames to achieve this initially, but eventually more sophisticated approaches, such as targets (Landau 2021), could be used instead.\nEnsure that code can run on a different computer.\n\nFor instance, consider the following code:\n\nsetwd(\"/Users/rohanalexander/Documents/telling_stories\")\n\nlibrary(tidyverse)\n\nd = read_csv(\"cars.csv\")\n\nmtcars =\n mtcars |> \n mutate(K_P_L = mpg / 2.352)\n\nlibrary(datasauRus)\n\ndatasaurus_dozen\n\nWe could change that, starting by creating an R Project which enables us to remove setwd(), grouping all the library() calls at the top, using “<-” instead of “=”, and being consistent with variable names:\n\nlibrary(tidyverse)\nlibrary(datasauRus)\n\ncars_data <- read_csv(\"cars.csv\")\n\nmpg_to_kpl_conversion_factor <- 2.352\n\nmtcars <-\n mtcars |> \n mutate(kpl = mpg / mpg_to_kpl_conversion_factor)\n\n\n\n3.6.5 Parallel processing\nSometimes code is slow because the computer needs to do the same thing many times. We may be able to take advantage of this and enable these jobs to be done at the same time using parallel processing. This will be especially useful starting from Chapter 12 for modeling.\nAfter installing and loading tictoc we can use tic() and toc() to time various aspects of our code. This is useful with parallel processing, but also more generally, to help us find out where the largest delays are.\n\ntic(\"First bit of code\")\nprint(\"Fast code\")\n\n[1] \"Fast code\"\n\ntoc()\n\nFirst bit of code: 0.001 sec elapsed\n\ntic(\"Second bit of code\")\nSys.sleep(3)\nprint(\"Slow code\")\n\n[1] \"Slow code\"\n\ntoc()\n\nSecond bit of code: 3.007 sec elapsed\n\n\nAnd so we know that there is something slowing down the code. (In this artificial case it is Sys.sleep() causing a delay of three seconds.)\nWe could use parallel which is part of base R to run functions in parallel. We could also use future which brings additional features. After installing and loading future we use plan() to specify whether we want to run things sequentially (“sequential”) or in parallel (“multisession”). We then wrap what we want this applied to within future().\nTo see this in action we will create a dataset and then implement a function on a row-wise basis.\n\nsimulated_data <-\n tibble(\n random_draws = runif(n = 1000000, min = 0, max = 1000) |> round(),\n more_random_draws = runif(n = 1000000, min = 0, max = 1000) |> round()\n )\n\nplan(sequential)\n\ntic()\nsimulated_data <-\n simulated_data |>\n rowwise() |>\n mutate(which_is_smaller =\n min(c(random_draws,\n more_random_draws)))\ntoc()\n\nplan(multisession)\n\ntic()\nsimulated_data <-\n future(simulated_data |>\n rowwise() |>\n mutate(which_is_smaller =\n min(c(\n random_draws,\n more_random_draws\n ))))\ntoc()\n\nThe sequential approach takes about 5 seconds, while the multisession approach takes about 0.3 seconds.",
+ "text": "3.6 Efficiency\nGenerally in this book we are, and will continue to be, concerned with just getting something done. Not necessarily getting it done in the best or most efficient way, because to a large extent, being worried about that is a waste of time. For the most part one is better off just pushing things into the cloud, letting them run for a reasonable time, and using that time to worry about other aspects of the pipeline. But that eventually becomes unfeasible. At a certain point, and this differs depending on context, efficiency becomes important. Eventually ugly or slow code, and dogmatic insistence on a particular way of doing things, have an effect. And it is at that point that one needs to be open to new approaches to ensure efficiency. There is rarely a most common area for obvious performance gains. Instead, it is important to develop the ability to measure, evaluate, and think.\nOne of the best ways to improve the efficiency of our code is preparing it in such a way that we can bring in a second pair of eyes. To make the most of their time, it is important that our code easy to read. So we start with “code linting” and “styling”. This does not speed up our code, per se, but instead makes it more efficient when another person comes to it, or we revisit it. This enables formal code review and refactoring, which is where we rewrite code to make it better, while not changing what it does (it does the same thing, but in a different way). We then turn to measurement of run time, and introduce parallel processing, where we allow our computer to run code for multiple processes at the same time\n\n3.6.1 Sharing a code environment\nWe have discussed at length the need to share code, and we have put forward an approach to this using GitHub. And in Chapter 10, we will discuss sharing data. But, there is another requirement to enable other people to run our code. In Chapter 2 we discussed how R itself, as well as R packages update from time to time, as new functionality is developed, errors fixed, and other general improvements made. Online Appendix A describes how one advantage of the tidyverse is that it can update faster than base R, because it is more specific. But this could mean that even if we were to share all the code and data that we use, it is possible that the software versions that have become available would cause errors.\nThe solution to this is to detail the environment that was used. There are a large number of ways to do this, and they can add complexity. We just focus on documenting the version of R and R packages that were used, and making it easier for others to install that exact version. Essentially we are just isolating the set-up that we used because that will help with reproducibility (Perkel2023?). In R we can use renv to do this.\nOnce renv is installed and loaded, we use init() to get the infrastructure set-up that we will need. We are going to create a file that will record the packages and versions used. We then use snapshot() to actually document what we are using. This creates a “lockfile” that records the information.\nIf we want to see which packages we are using in the R Project, then we can use dependencies(). Doing this for the example folder indicates that the following packages are used: rmarkdown, bookdown, knitr, rmarkdown, bookdown, knitr, palmerpenguins, tidyverse, renv, haven, readr, and tidyverse.\nWe could open the lockfile file—“renv.lock”—to see the exact versions if we wanted. The lockfile also documents all the other packages that were installed and where they were downloaded from. Someone coming to this project from outside could then use restore() which would install the exact version of the packages that we used.\n\n\n3.6.2 Code linting and styling\nBeing fast is valuable but it is mostly about being able to iterate fast, not necessarily having code that runs fast. Backus (1981, 26) describes how even in 1954 a programmer cost at least as much as a computer, and these days additional computational power is usually much cheaper than a programmer. Performant code is important, but it is also important to use other people’s time efficiently. Code is rarely only written once. Instead we typically have to come back to it, even if to just fix mistakes, and this means that code must be able to be read by humans (Matsumoto 2007, 478). If this is not done then there will be an efficiency cost.\nLinting and styling is the process of checking code, mostly for stylistic issues, and re-arranging code to make it easier to read. (There is another aspect of linting, which is dealing with programming errors, such as forgetting a closing bracket, but here we focus on stylistic issues.) Often the best efficiency gain comes from making it easier for others to read our code, even if this is just ourselves returning to the code after a break. Jane Street, a US proprietary trading firm, places a very strong focus on ensuring their code is readable, as a core part of risk mitigation (Minsky 2011). While we may not all have billions of dollars under the potentially mercurial management of code, we all would likely prefer that our code does not produce errors.\nWe use lint() from lintr to lint our code. For instance, consider the following R code (saved as “linting_example.R”).\n\nSIMULATED_DATA <-\n tibble(\n division = c(1:150, 151),\n party = sample(\n x = c(\"Liberal\"),\n size = 151,\n replace = T\n )\n )\n\n\nlint(filename = \"linting_example.R\")\n\nThe result is that the file “linting_example.R” is opened and the issues that lint() found are printed in “Markers” (Figure 3.6). It is then up to you to deal with the issues.\n\n\n\n\n\n\nFigure 3.6: Linting results from example R code\n\n\n\nMaking the recommended changes results in code that is more readable, and consistent with best practice, as defined by Wickham (2021).\n\nsimulated_data <-\n tibble(\n division = c(1:150, 151),\n party = sample(\n x = c(\"Liberal\"),\n size = 151,\n replace = TRUE\n )\n )\n\nAt first it may seem that some aspects that the linter is identifying, like trailing whitespace and only using double quotes are small and inconsequential. But they distract from being able to fix bigger issues. Further, if we are not able to get small things right, then how could anyone trust that we could get the big things right? Therefore, it is important to have dealt with all the small aspects that a linter identifies.\nIn addition to lintr we also use styler. This will automatically adjust style issues, in contrast to the linter, which gave a list of issues to look at. To run this we use style_file().\n\nstyle_file(path = \"linting_example.R\")\n\nThis will automatically make changes, such as spacing and indentation. As such this should be done regularly, rather than only once at the end of a project, so as to be able to review the changes and make sure no errors have been introduced.\n\n\n3.6.3 Code review\nHaving dealt with all of these aspects of style, we can turn to code review. This is the process of having another person go through and critique the code. Many professional writers have editors, and code review is the closest that we come to that in data science. Code review is a critical part of writing code, and Irving et al. (2021, 465) describe it as “the most effective way to find bugs”. It is especially helpful, although quite daunting, when learning to code because getting feedback is a great way to improve.\nGo out of your way to be polite and collegial when reviewing another person’s code. Small aspects to do with style, things like spacing and separation, should have been taken care of by a linter and styler, but if not, then make a general recommendation about that. Most of your time as a code reviewer in data science should be spent on aspects such as:\n\nIs there an informative README and how could it be improved?\nAre the file names and variable names consistent, informative, and meaningful?\nDo the comments allow you to understand why something is being done?\nAre the tests both appropriate and sufficient? Are there edge cases or corner solutions that are not considered? Similarly, are there unnecessary tests that could be removed?\nAre there magic numbers that could be changed to variables and explained?\nIs there duplicated code that could be changed?\nAre there any outstanding warnings that should be addressed?\nAre there any especially large functions or pipes that could be separated into smaller ones?\nIs the structure of the project appropriate?\nCan we change any of the code to data (Irving et al. 2021, 462)?\n\nFor instance, consider some code that looked for the names of prime ministers and presidents. When we first wrote this code we likely added the relevant names directly into the code. But as part of code review, we might instead recommend that this be changed. We might recommend creating a small dataset of relevant names, and then re-writing the code to have it look up that dataset.\nCode review ensures that the code can be understood by at least one other person. This is a critical part of building knowledge about the world. At Google, code review is not primarily about finding defects, although that may happen, but is instead about ensuring readability and maintainability as well as education (Sadowski et al. 2018). This is also the case at Jane Street where they use code review to catch bugs, share institutional knowledge, assist with training, and oblige staff to write code that can be read (Minsky 2015).\nFinally, code review does not have to, and should not, be an onerous days-consuming process of reading all the code. The best code review is a quick review of just one file, focused on suggesting changes to just a handful of lines. Indeed, it may be better to have a review done by a small team of people rather than one individual. Do not review too much code at any one time. At most a few hundred lines, which should take around an hour, because any more than that has been found to be associated with reduced efficacy (Cohen, Teleki, and Brown 2006, 79).\n\n\n3.6.4 Code refactoring\nTo refactor code means to rewrite it so that the new code achieves the same outcome as the old code, but the new code does it better. For instance, Chawla (2020) discuss how the code underpinning an important UK Covid model was initially written by epidemiologists, and months later clarified and cleaned up by a team from the Royal Society, Microsoft, and GitHub. This was valuable because it provided more confidence in the model, even though both versions produced the same outputs, given the same inputs.\nWe typically refer to code refactoring in relation to code that someone else wrote. (Although it may be that we actually wrote the code, and it was just that it was some time ago.) When we start to refactor code, we want to make sure that the rewritten code achieves the same outcomes as the original code. This means that we need a suite of appropriate tests written that we can depend on. If these do not exist, then we may need to create them.\nWe rewrite code to make it easier for others to understand, which in turn allows more confidence in our conclusions. But before we can do that, we need to understand what the existing code is doing. One way to get started is to go through the code and add extensive comments. These comments are different to normal comments. They are our active process of trying to understand what is each code chunk trying to do and how could this be improved.\nRefactoring code is an opportunity to ensure that it satisfies best practice. Trisovic et al. (2022) details some core recommendations based on examining 9,000 R scripts including:\n\nRemove setwd() and any absolute paths, and ensure that only relative paths, in relation to the “.Rproj” file, are used.\nEnsure there is a clear order of execution. We have recommended using numbers in filenames to achieve this initially, but eventually more sophisticated approaches, such as targets (Landau 2021), could be used instead.\nEnsure that code can run on a different computer.\n\nFor instance, consider the following code:\n\nsetwd(\"/Users/rohanalexander/Documents/telling_stories\")\n\nlibrary(tidyverse)\n\nd = read_csv(\"cars.csv\")\n\nmtcars =\n mtcars |> \n mutate(K_P_L = mpg / 2.352)\n\nlibrary(datasauRus)\n\ndatasaurus_dozen\n\nWe could change that, starting by creating an R Project which enables us to remove setwd(), grouping all the library() calls at the top, using “<-” instead of “=”, and being consistent with variable names:\n\nlibrary(tidyverse)\nlibrary(datasauRus)\n\ncars_data <- read_csv(\"cars.csv\")\n\nmpg_to_kpl_conversion_factor <- 2.352\n\nmtcars <-\n mtcars |> \n mutate(kpl = mpg / mpg_to_kpl_conversion_factor)\n\n\n\n3.6.5 Parallel processing\nSometimes code is slow because the computer needs to do the same thing many times. We may be able to take advantage of this and enable these jobs to be done at the same time using parallel processing. This will be especially useful starting from Chapter 12 for modeling.\nAfter installing and loading tictoc we can use tic() and toc() to time various aspects of our code. This is useful with parallel processing, but also more generally, to help us find out where the largest delays are.\n\ntic(\"First bit of code\")\nprint(\"Fast code\")\n\n[1] \"Fast code\"\n\ntoc()\n\nFirst bit of code: 0 sec elapsed\n\ntic(\"Second bit of code\")\nSys.sleep(3)\nprint(\"Slow code\")\n\n[1] \"Slow code\"\n\ntoc()\n\nSecond bit of code: 3.008 sec elapsed\n\n\nAnd so we know that there is something slowing down the code. (In this artificial case it is Sys.sleep() causing a delay of three seconds.)\nWe could use parallel which is part of base R to run functions in parallel. We could also use future which brings additional features. After installing and loading future we use plan() to specify whether we want to run things sequentially (“sequential”) or in parallel (“multisession”). We then wrap what we want this applied to within future().\nTo see this in action we will create a dataset and then implement a function on a row-wise basis.\n\nsimulated_data <-\n tibble(\n random_draws = runif(n = 1000000, min = 0, max = 1000) |> round(),\n more_random_draws = runif(n = 1000000, min = 0, max = 1000) |> round()\n )\n\nplan(sequential)\n\ntic()\nsimulated_data <-\n simulated_data |>\n rowwise() |>\n mutate(which_is_smaller =\n min(c(random_draws,\n more_random_draws)))\ntoc()\n\nplan(multisession)\n\ntic()\nsimulated_data <-\n future(simulated_data |>\n rowwise() |>\n mutate(which_is_smaller =\n min(c(\n random_draws,\n more_random_draws\n ))))\ntoc()\n\nThe sequential approach takes about 5 seconds, while the multisession approach takes about 0.3 seconds.",
"crumbs": [
"Foundations",
"3 Reproducible workflows "
@@ -509,210 +509,210 @@
{
"objectID": "06-farm.html",
"href": "06-farm.html",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "",
"text": "6.1 Introduction\nAs we think of our world, and telling stories about it, one of the most difficult aspects is to reduce the beautiful complexity of it into a dataset that we can use. We need to know what we give up when we do this. And be deliberate and thoughtful as we proceed. Some datasets are so large that one specific data point does not matter—it could be swapped for another without any effect (Crawford 2021, 94). But this is not always reasonable: how different would your life be if you had a different mother?\nWe are often interested in understanding the implications of some dataset; making forecasts based on it or using that dataset to make claims about broader phenomena. Regardless of how we turn our world into data, we will usually only ever have a sample of the data that we need. Statistics provides formal approaches that we use to keep these issues front of mind and understand the implications. But it does not provide definitive guidance about broader issues, such as considering who profits from the data that were collected, and whose power it reflects.\nIn this chapter we first discuss measurement, and some of the concerns that it brings. We then turn to censuses, in which we typically try to obtain data about an entire population. We also discuss other government official statistics, and long-standing surveys. We describe datasets of this type as “farmed data”. Farmed datasets are typically well put together, thoroughly documented, and the work of collecting, preparing, and cleaning these datasets is mostly done for us. They are also, usually, conducted on a known release cycle. For instance, many countries release unemployment and inflation datasets monthly, GDP quarterly, and a census every five to ten years.\nWe then introduce statistical notions around sampling to provide a foundation that we will continually return to. Over the past one hundred years or so, statisticians have developed considerable sophistication for thinking about samples, and dealt with many controversies (Brewer 2013). In this chapter we consider probability and non-probability sampling and introduce certain key terminology and concepts.\nThis chapter is about data that are made available for us. Data are not neutral. For instance, archivists are now careful to consider archives not only as a source of fact, but also as part of the production of fact which occurred within a particular context especially constructed by the state (Stoler 2002). Thinking clearly about who is included in the dataset, and who is systematically excluded, is critical. Understanding, capturing, classifying, and naming data is an exercise in building a world and reflects power (Crawford 2021, 121), be that social, historical, financial, or legal.\nFor instance, we can consider the role of sex and gender in survey research. Sex is based on biological attributes and is assigned at birth, while gender is socially constructed and has both biological and cultural aspects (Lips 2020, 7). We may be interested in the relationship between gender, rather than sex, and some outcome. But the move toward a nuanced concept of gender in official statistics has only happened recently. Surveys that insist on a binary gender variable that is the same as sex, will not reflect those respondents who do not identify as such. Kennedy et al. (2022) provide a variety of aspects to consider when deciding what to do with gender responses, including: ethics, accuracy, practicality, and flexibility. But there is no universal best solution. Ensuring respect for the survey respondent should be the highest priority (Kennedy et al. 2022, 16).\nWhy do we even need classifications and groupings if it causes such concerns? Scott (1998) positions much of this as an outcome of the state, for its own purposes, wanting to make society legible and considers this a defining feature of modern states. For instance, Scott (1998) sees the use of surnames as arising because of the state’s desire for legible lists to use for taxation, property ownership, conscription, and censuses. The state’s desire for legibility also required imposing consistency on measurement. The modern form of metrology, which is “the study of how measurements are made, and how data are compared” (Plant and Hanisch 2020), began in the French Revolution when various measurements were standardized. This later further developed as part of Napoleonic state building (Scott 1998, 30). Prévost and Beaud (2015, 154) describe the essence of the change as one where knowledge went from being “singular, local, idiosyncratic\\(\\dots\\) and often couched in literary form” to generalized, standardized, and numeric. That all said, it would be difficult to collect data without categorizable, measurable scales. A further concern is reification, where we forget that these measures must be constructed.\nAll datasets have shortcomings. In this chapter we develop comfort with “farmed data”. We use that term to refer to a dataset that has been developed specifically for the purpose of being used as data.\nEven though these farmed datasets are put together for us to use, and can generally be easily obtained, it is nonetheless especially important for us to develop a thorough understanding of their construction. James Mill, the nineteenth century Scottish author, famously wrote The History of British India without having set foot in the country. He claimed:\nIt may seem remarkable that he was considered an expert and his views had influence. Yet today, many will, say, use inflation statistics without ever having tried to track a few prices, use the responses from political surveys without themselves ever having asked a respondent a question, or use ImageNet without the experience of hand-labeling some of the images. We should always throw ourselves into the details of the data.",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "06-farm.html#introduction",
"href": "06-farm.html#introduction",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "",
"text": "Whatever is worth seeing or hearing in India, can be expressed in writing. As soon as every thing of importance is expressed in writing, a man who is duly qualified may attain more knowledge of India, in one year, in his closet in England, than he could obtain during the course of the longest life, by the use of his eyes and his ears in India.\nMill (1817, xv)",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "06-farm.html#measurement",
"href": "06-farm.html#measurement",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "6.2 Measurement",
"text": "6.2 Measurement\nMeasurement is an old concern. Even Aristotle distinguished between quantities and qualities (Tal 2020). Measurement, and especially, the comparison of measurements, underpins all quantitative analysis. But deciding what to measure, and how to do it, is challenging.\nMeasurement is trickier than it seems. For instance, in music, David Peterson, Professor of Political Science, Iowa State University, make it clear how difficult it is to define a one-hit wonder. A surprising number of artists that may immediately come to mind, turn out to have at least one or two other songs that did reasonably well in terms of making it onto charts (Molanphy 2012). Should an analysis of all one-term governments include those that did not make it through a full term? How about those that only lasted a month, or even a week? How do we even begin to measure the extent of government transfers when so much of these are in-kind benefits (Garfinkel, Rainwater, and Smeeding 2006)? How can we measure how well represented a person is in a democracy despite that being the fundamental concern (Achen 1978)? And why should the standard definition used by the World Health Organization (WHO) of pregnancy-related and maternal deaths only include those that occur within 42 days of delivery, termination, or abortion when this has a substantial effect on the estimate (Gazeley et al. 2022)?\nPhilosophy brings more nuance and depth to their definitions of measurement (Tal 2020), but the International Organization Of Legal Metrology (2007, 44) define measurement as the “process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity”, where a quantity is a “number and reference together”. It implies “comparison of quantities, including counting of entities”, and “presupposes a description of the quantity commensurate with the intended use of a measurement result, a measurement procedure, and a calibrated measuring system\\(\\dots\\)”. This definition of measurement makes clear that we have a variety of concerns including instrumentation and units, and that we are interested in measurements that are valid and reliable.\nInstrumentation refers to what we use to conduct the measurement. Thorough consideration of instrumentation is important because it determines what we can measure. For instance, Morange (2016, 63) describes how the invention of microscopes in the sixteenth century led to the observation of capillaries by Marcello Malpighi in 1661, cells by Robert Hooke in 1665, and bacteria by Antonie van Leeuwenhoek in 1677 (Lane 2015). And consider the measurement of time. Again we see the interaction between instrumentation and measurement. With a sundial it was difficult to be much more specific about elapsed time than an hour or so. But the gradual development of more accurate instruments of timekeeping would eventually enable some sports to differentiate competitors to the thousandth of the second, and through GPS, allow navigation that is accurate to within meters.\n\n\n\n\n\n\nOh, you think we have good data on that!\n\n\n\nKnowing the time is a critical measurement. For instance, Formula 1 times laps to the thousandth of a second. And Michael Phelps, an American swimmer, won a gold medal at the Beijing Olympics by only one-hundredth of a second. Timing allows us to distinguish between outcomes even when the event does not happen concurrently. For instance, think back to the discussion of swimmers by Chambliss (1989) and how this would be impossible without knowing how long each event took each swimmer. Timing is also critical in finance where we need market participants to agree on whether an asset is available for sale. But the answer to “What time is it?” can be difficult to answer. The time, according to some individual, can be set to different sources, and so will differ depending on who you ask. Since the 1970s the definitive answer has been to use atomic time. A cesium second is defined by “9192 631 770 cycles of the hyperfine transition frequency in the ground state of cesium 133” (Levine, Tavella, and Milton 2022, 4). But the problem of clock synchronization—how to have all the non-atomic clocks match atomic time and each other—remains. Hopper (2022) provides an overview of how the Network Time Protocol (NTP) of Mills (1991) enables clock synchronization and some of the difficulties that exist for computer networks to discover atomic time. Another measure of time is astronomical time, which is based on the rotation of the Earth. But because the Earth spins inconsistently and other issues, adjustments have been made to ensure atomic and astronomical time match. This has resulted in the inclusions of positive leap seconds, and the possibility of a negative leap second, which have created problems (Levine, Tavella, and Milton 2022). As a result, at some point in the future astronomical and atomic time will be allowed to diverge (Gibney 2022; Mitchell 2022b).\n\n\nA common instrument of measurement is a survey, and we discuss these further in Chapter 8. Another commonly-used instrument is sensors. For instance, climate scientists may be interested in temperature, humidity, or pressure. Much analysis of animal movement, such as Leos-Barajas et al. (2016), uses accelerometers. Sensors placed on satellites may be particularly concerned with images, and such data are available from the Landsat Program. Physicists are very concerned with measurement, and can be constrained not only by their instrumentation, but also storage capacity. For instance, the ATLAS detector at CERN is focused on the collision of particles, but not all of the measurements can be saved because that would result in 80TB per second (Colombo et al. 2016)! And in the case of A/B testing, which we discuss in Chapter 8, extensive use is made of cookies, beacons, system settings, and behavioral patterns. Another aspect of instrumentation is delivery. For instance, if using surveys, then should they be mailed or online? Should they be filled out by the respondent or by an enumerator?\nThe definition of measurement, provided by metrology, makes it clear that the second fundamental concern is a reference, which we refer to as units. The choice of units is related to both the research question of interest and available instrumentation. For instance, in the Tutorial in Chapter 1 we were concerned with measuring the growth of plants. This would not be well served by using kilometers or miles as a unit. If we were using a ruler, then we may be able to measure millimeters, but with calipers, we might be able to consider tens of micrometers.\n\n6.2.1 Properties of measurements\nValid measurements are those where the quantity that we are measuring is related to the estimand and research question of interest. It speaks to appropriateness. Recall, from Chapter 4, that an estimand is the actual effect, such as the (unknowable) actual effect of smoking on life expectancy. It can be useful to think about estimands as what is actually of interest. This means that we need to ensure that we are measuring relevant aspects of an individual. For instance, the number of cigarettes that they smoked, and the number of years they lived, rather than, say, their opinion about smoking.\nFor some units, such as a meter or a second, there is a clear definition. And when that definition evolves it is widely agreed on (Mitchell 2022a). But for other aspects that we may wish to measure it is less clear and so the validity of the measurement becomes critical. At one point in the fourteenth century attempts were made to measure grace and virtue (Crosby 1997, 14)! More recently, we try to measure intelligence or even the quality of a university. That is not to say there are not people with more or less grace, virtue, and intelligence than others, and there are certainly better and worse universities. But the measurement of these is difficult.\nThe U.S. News and World Report tries to quantify university quality based on aspects such as class size, number of faculty with a PhD, and number of full-time faculty. But an issue with such constructed measures, especially in social settings, is that it changes the incentives of those being measured. For instance, Columbia University increased from 18th in 1988 to 2nd in 2022. But Michael Thaddeus, Professor of Mathematics, Columbia University, showed how there was a difference, in Columbia’s favor, between what Columbia reported to U.S. News and World Report and what was available through other sources (Hartocollis 2022).\nSuch concerns are of special importance in psychology because there is no clear measure of many fundamental concepts. Fried, Flake, and Robinaugh (2022) review the measurement of depression and find many concerns including a lack of validity and reliability. This is not to say that we should not try to measure such things, but we should ensure transparency about measurement decisions. For instance, Flake and Fried (2020) recommend answering various clarifying questions whenever measurements have to be constructed. These include questioning the underlying construct of interest, the decision process that led to the measure, what alternatives were considered, the quantification process, and the scale. These questions are especially important when the measure is being constructed for a particular purpose, rather than being adopted from elsewhere. This is because of the concern that the measure will be constructed in a way that provides a pre-ordained outcome.\nReliability draws on the part of the definition of measurement that reads “process of experimentally obtaining\\(\\dots\\)”. It implies some degree of consistency and means that multiple measurements of one particular aspect, at one particular time, should be essentially the same. If two enumerators count the number of shops on a street, then we would hope that their counts are the same. And if they were different then we would hope we could understand the reason for the difference. For instance, perhaps one enumerator misunderstood the instructions and incorrectly counted only shops that were open. To consider another example, demographers are often concerned with the migration of people between countries, and economists are often concerned with international trade. It is concerning the number of times that the in-migration or imports data of Country A from Country B do not match the out-migration or exports data of Country B to Country A.\n\n\n\n\n\n\nOh, you think we have good data on that!\n\n\n\nIt is common for the pilot of a plane to announce the altitude to their passengers. But the notion and measurement of altitude is deceptively complicated, and underscores the fact that measurement occurs within a broader context (Vanhoenacker 2015). For instance, if we are interested in how many meters there are between the plane and the ground, then should we measure the difference between the ground and where the pilot is sitting, which would be most relevant for the announcement, or to the bottom of the wheels, which would be most relevant for landing? What happens if we go over a mountain? Even if the plane has not descended, such a measure—the number of meters between the plane and the ground—would claim a reduction in altitude and make it hard to vertically separate multiple planes. We may be interested in a comparison to sea level. But sea level changes because of the tide, and is different at different locations. As such, a common measure of altitude is flight level, which is determined by the amount of air pressure. And because air pressure is affected by weather, season, and location, the one flight level may be associated with very different numbers of meters to the ground over the course of a flight. The measures of altitude used by planes serve their purpose of enabling relatively safe air travel.\n\n\n\n\n6.2.2 Measurement error\nMeasurement error is the difference between the value we observe and the actual value. Sometimes it is possible to verify certain responses. If the difference is consistent between the responses that we can verify and those that we cannot, then we are able to estimate the extent of overall measurement error. For instance, Sakshaug, Yan, and Tourangeau (2010) considered a survey of university alumni and compared replies about a respondent’s grades with their university record. They find that the mode of the survey—telephone interview conducted by a human, telephone interview conducted by a computer, or an internet survey—affected the extent of the measurement error.\nSuch error can be particularly pervasive when an enumerator fills out the survey form on behalf of the respondent. This is especially of concern around race. For instance, Davis (1997, 177) describes how Black people in the United States may limit the extent to which they describe their political and racial belief to white interviewers.\nAnother example is censored data, which is when we have some partial knowledge of the actual value. Right-censored data is when we know that the actual value is above some observed value, but we do not know by how much. For instance, immediately following the Chernobyl disaster in 1986, the only available instruments to measure radiation had a certain maximum limit. While the radiation was measured as being at that (maximum) level, the implication was that the actual value was much higher.\nRight-censored data are often seen in medical studies. For instance, say some experiment is conducted, and then patients are followed for ten years. At the end of that ten-year period all we know is whether a patient lived at least ten years, not the exact length of their life. Left-censored data is the opposite situation. For instance, consider a thermometer that only went down to freezing. Even when the actual temperature was less, the thermometer would still register that as freezing.\nA slight variation of censored data is winsorizing data. This occurs when we observe the actual value, but we change it to a less extreme one. For instance, if we were considering age then we may change the age of anyone older than 100 to be 100. We may do this if we are worried that values that were too large would have too significant of an effect.\nTruncated data is a slightly different situation in which we do not even record those values. For instance, consider a situation in which we were interested in the relationship between a child’s age and height. Our first question might be “what is your age?” and if it turns out the respondent is an adult, then we would not continue to ask height. Truncated data are especially closely related to selection bias. For instance, consider a student who drops a course—their opinion is not measured on course evaluations.\nTo illustrate the difference between these concepts, consider a situation in which the actual distribution of newborn baby weight has a normal distribution, centered around 3.5kg. Imagine there is some defect with the scale, such that any value less than or equal to 2.75kg is assigned 2.75kg. And imagine there is some rule such that any baby expected to weigh more than 4.25kg is transferred to a different hospital to be born. These three scenarios are illustrated in Figure 6.1. We may also be interested in considering the mean weight, which highlights the bias (Table 6.1).\n\nset.seed(853)\n\nnewborn_weight <-\n tibble(\n weight = rep(\n x = rnorm(n = 1000, mean = 3.5, sd = 0.5), \n times = 3),\n measurement = rep(\n x = c(\"Actual\", \"Censored\", \"Truncated\"),\n each = 1000)\n )\n\nnewborn_weight <-\n newborn_weight |>\n mutate(\n weight = case_when(\n weight <= 2.75 & measurement == \"Censored\" ~ 2.75,\n weight >= 4.25 & measurement == \"Truncated\" ~ NA_real_,\n TRUE ~ weight\n )\n )\n\nnewborn_weight |>\n ggplot(aes(x = weight)) +\n geom_histogram(bins = 50) +\n facet_wrap(vars(measurement)) +\n theme_minimal()\n\n\n\n\n\n\n\nFigure 6.1: Comparison of actual weights with censored and truncated weights\n\n\n\n\n\n\nnewborn_weight |>\n summarise(mean = mean(weight, na.rm = TRUE),\n .by = measurement) |>\n kable(\n col.names = c(\"Measurement\", \"Mean\"),\n digits = 3\n )\n\n\n\nTable 6.1: Comparing the means of the different scenarios identifies the bias\n\n\n\n\n\n\nMeasurement\nMean\n\n\n\n\nActual\n3.521\n\n\nCensored\n3.530\n\n\nTruncated\n3.455\n\n\n\n\n\n\n\n\n\n\n6.2.3 Missing data\nRegardless of how good our data acquisition process is, there will be missing data. That is, observations that we know we do not have. But a variable must be measured, or at least thought about and considered, in order to be missing. With insufficient consideration, there is the danger of missing data that we do not even know are missing because the variables were never considered. They are missing in a “dog that did not bark” sense. This is why it is so important to think about the situation, sketch and simulate, and work with subject-matter experts.\nNon-response could be considered a variant of measurement error whereby we observe a null, even though there should be an actual value. But it is usually considered in its own right. And there are different extents of non-response: from refusing to even respond to the survey, through to just missing one question. Non-response is a key issue, especially with non-probability samples, because there is usually good reason to consider that people who do not respond are systematically different to those who do. And this serves to limit the extent to which the survey can be used to speak to more than just survey respondents. Gelman et al. (2016) go so far as to say that much of the changes in public opinion that are reported in the lead-up to an election are not people changing their mind, but differential non-response. That is, individual choosing whether to respond to a survey at all depending on the circumstances, not just choosing which survey response to choose. The use of pre-notification and reminders may help address non-response in some circumstances (Koitsalu et al. 2018; Frandell et al. 2021).\nData might be missing because a respondent did not want to respond to one particular question, a particular collection of related questions, or the entire survey, although these are not mutually exclusive nor collectively exhaustive (Newman 2014). In an ideal situation data are Missing Completely At Random (MCAR). This rarely occurs, but if it does, then inference should still be reflective of the broader population. It is more likely that data are Missing At Random (MAR) or Missing Not At Random (MNAR). The extent to which we must worry about that differs. For instance, if we are interested in the effect of gender on political support, then it may be that men are less likely to respond to surveys, but this is not related to who they will support. If that differential response is only due to being a man, and not related to political support, then we may be able to continue, provided we include gender in the regression, or are able to post-stratify based on gender. That said, the likelihood of this independence holding is low, and it is more likely, as in Gelman et al. (2016), that there is a relationship between responding to the survey and political support. In that more likely case, we may have a more significant issue. One approach would be to consider additional explanatory variables. It is tempting to drop incomplete cases, but this may further bias the sample, and requires justification and the support of simulation. Data imputation could be considered, but again may bias the sample. Ideally we could rethink, and improve, the data collection process.\nWe return to missing data in Chapter 11.",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "06-farm.html#censuses-and-other-government-data",
"href": "06-farm.html#censuses-and-other-government-data",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "6.3 Censuses and other government data",
"text": "6.3 Censuses and other government data\nThere are a variety of sources of data that have been produced for the purposes of being used as datasets. One thinks here especially of censuses of population. Whitby (2020, 30–31) describes how the earliest censuses for which we have written record are from China’s Yellow River Valley. One motivation for censuses was taxation, and Jones (1953) describes census records from the late third or early fourth century A.D. which enabled a new system of taxation. Detailed records, such as censuses, have also been abused. For instance, Luebke and Milton (1994, 25) set out how the Nazis used censuses and police registration datasets to “locate groups eventually slated for deportation and death”. And Bowen (2022, 17) describes how the United States Census Bureau provided information that contributed to the internship of Japanese Americans. President Clinton apologized for this in the 1990s.\nAnother source of data deliberately put together to be a dataset include official statistics like surveys of economic conditions such as unemployment, inflation, and GDP. Interestingly, Rockoff (2019) describes how these economic statistics were not actually developed by the federal government, even though governments typically eventually took over that role. Censuses and other government-run surveys have the power, and financial resources, of the state behind them, which enables them to be thorough in a way that other datasets cannot be. For instance, the 2020 United States Census is estimated to have cost US$15.6 billion (Hawes 2020). But this similarly brings a specific perspective. Census data, like all data, are not unimpeachable. Common errors include under- and over-enumeration, as well as misreporting (Steckel 1991). There are various measures and approaches used to assess quality (Statistics Canada 2023).\n\n\n\n\n\n\nOh, you think we have good data on that!\n\n\n\nCensuses of population are critical, but not unimpeachable. Anderson and Fienberg (1999) describe how the history of the census in the United States is one of undercount, and that even George Washington complained about this in the 1790s. The extent of the undercount was estimated due to the Selective Service registration system used for conscription in World War II. Those records were compared with census records, and it was found that there were about half a million more men recorded for conscription purposes than in the census. This was race-specific, with an average undercount of around 3 per cent, but an undercount of Black men of draft age of around 13 per cent (Anderson and Fienberg 1999, 29). This became a political issue in the 1960s, and race and ethnicity related questions were of special concern in the 1990s. Nobles (2002, 47) discusses how counting by race first requires that race exists, but that this may be biologically difficult to establish. Despite how fundamental race is to the United States census it is not something that is “fixed” and “objective” but instead has influences from class, social, legal, structural, and political aspects (Nobles 2002, 48).\n\n\n\n\n\n\n\n\nShoulders of giants\n\n\n\nMargo Anderson is Distinguished Professor, History and Urban Studies, at University of Wisconsin–Milwaukee. After earning a PhD in History from Rutgers University in 1978, she joined the University of Wisconsin–Milwaukee, and was promoted to professor in 1987. In addition to Anderson and Fienberg (1999), another important book that she wrote is Anderson ([1988] 2015). She was appointed a Fellow of the American Statistical Association in 1998.\n\n\nAnother similarly large and established source of data are from long-running large surveys. These are conducted on a regular basis, and while not usually directly conducted by the government, they are usually funded, one way or another, by the government. For instance, here we often think of electoral surveys, such as the Canadian Election Study, which has run in association with every federal election since 1965, and similarly the British Election Study which has been associated with every general election since 1964.\nMore recently there has been a large push toward open data in government. The underlying principle—that the government should make available the data that it has—is undeniable. But the term has become a little contentious because of how it has occurred in practice. Governments only provide data that they want to provide. We may even sometimes see manipulation of data to suit a government’s narrative (Kalgin 2014; Zhang et al. 2019; Berdine, Geloso, and Powell 2018). One way to get data that the government has, but does not necessarily want to provide, is to submit a Freedom of Information (FOI) request (Walby and Luscombe 2019). For instance, Cardoso (2020) use data from FOI to find evidence of systematic racism in the Canadian prison system.\nWhile farmed datasets have always been useful, they were developed for a time when much analysis was conducted without the use of programming languages. Many R packages have been developed to make it easier to get these datasets into R. Here we cover a few that are especially useful.\n\n6.3.1 Canada\nThe first census in Canada was conducted in 1666. This was also the first modern census where every individual was recorded by name, although it does not include Aboriginal peoples (Godfrey 1918, 179). There were 3,215 inhabitants that were counted, and the census asked about age, sex, marital status, and occupation (Statistics Canada 2023). In association with Canadian Confederation, in 1867 a decennial census was required so that political representatives could be allocated for the new Parliament. Regular censuses have occurred since then.\nWe can explore some data on languages spoken in Canada from the 2016 Census using canlang. This package is not on CRAN, but can be installed from GitHub with: install.packages(\"devtools\") then devtools::install_github(\"ttimbers/canlang\").\nAfter loading canlang we can use the can_lang dataset. This provides the number of Canadians who use each of 214 languages.\n\ncan_lang\n\n# A tibble: 214 × 6\n category language mother_tongue most_at_home most_at_work lang_known\n <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n 1 Aboriginal langu… Aborigi… 590 235 30 665\n 2 Non-Official & N… Afrikaa… 10260 4785 85 23415\n 3 Non-Official & N… Afro-As… 1150 445 10 2775\n 4 Non-Official & N… Akan (T… 13460 5985 25 22150\n 5 Non-Official & N… Albanian 26895 13135 345 31930\n 6 Aboriginal langu… Algonqu… 45 10 0 120\n 7 Aboriginal langu… Algonqu… 1260 370 40 2480\n 8 Non-Official & N… America… 2685 3020 1145 21930\n 9 Non-Official & N… Amharic 22465 12785 200 33670\n10 Non-Official & N… Arabic 419890 223535 5585 629055\n# ℹ 204 more rows\n\n\nWe can quickly see the top-ten most common languages to have as a mother tongue.\n\ncan_lang |>\n slice_max(mother_tongue, n = 10) |>\n select(language, mother_tongue)\n\n# A tibble: 10 × 2\n language mother_tongue\n <chr> <dbl>\n 1 English 19460850\n 2 French 7166700\n 3 Mandarin 592040\n 4 Cantonese 565270\n 5 Punjabi (Panjabi) 501680\n 6 Spanish 458850\n 7 Tagalog (Pilipino, Filipino) 431385\n 8 Arabic 419890\n 9 German 384040\n10 Italian 375635\n\n\nWe could combine two datasets: region_lang and region_data, to see if the five most common languages differ between the largest region, Toronto, and the smallest, Belleville.\n\nregion_lang |>\n left_join(region_data, by = \"region\") |>\n slice_max(c(population)) |>\n slice_max(mother_tongue, n = 5) |>\n select(region, language, mother_tongue, population) |>\n mutate(prop = mother_tongue / population)\n\n# A tibble: 5 × 5\n region language mother_tongue population prop\n <chr> <chr> <dbl> <dbl> <dbl>\n1 Toronto English 3061820 5928040 0.516 \n2 Toronto Cantonese 247710 5928040 0.0418\n3 Toronto Mandarin 227085 5928040 0.0383\n4 Toronto Punjabi (Panjabi) 171225 5928040 0.0289\n5 Toronto Italian 151415 5928040 0.0255\n\nregion_lang |>\n left_join(region_data, by = \"region\") |>\n slice_min(c(population)) |>\n slice_max(mother_tongue, n = 5) |>\n select(region, language, mother_tongue, population) |>\n mutate(prop = mother_tongue / population)\n\n# A tibble: 5 × 5\n region language mother_tongue population prop\n <chr> <chr> <dbl> <dbl> <dbl>\n1 Belleville English 93655 103472 0.905 \n2 Belleville French 2675 103472 0.0259 \n3 Belleville German 635 103472 0.00614\n4 Belleville Dutch 600 103472 0.00580\n5 Belleville Spanish 350 103472 0.00338\n\n\nWe can see a considerable difference between the proportions, with a little over 50 per cent of those in Toronto having English as their mother tongue, compared with around 90 per cent of those in Belleville.\nIn general, data from Canadian censuses are not as easily available through the relevant government agency as in other countries, although the Integrated Public Use Microdata Series (IPUMS), which we discuss later, provides access to some. Statistics Canada, which is the government agency that is responsible for the census and other official statistics, freely provides an “Individuals File” from the 2016 census as a Public Use Microdata File (PUMF), but only in response to request. And while it is a 2.7 per cent sample from the 2016 census, this PUMF provides limited detail.\nAnother way to access data from the Canadian census is to use cancensus. It requires an API key, which can be requested by creating an account and then going to “edit profile”. The package has a helper function that makes it easier to add the API key to an “.Renviron” file, which we will explain in more detail in Chapter 7.\nAfter installing and loading cancensus we can use get_census() to get census data. We need to specify a census of interest, and a variety of other arguments. For instance, we could get data from the 2016 census about Ontario, which is the largest Canadian province by population.\n\nset_api_key(\"ADD_YOUR_API_KEY_HERE\", install = TRUE)\n\nontario_population <-\n get_census(\n dataset = \"CA16\",\n level = \"Regions\",\n vectors = \"v_CA16_1\",\n regions = list(PR = c(\"35\"))\n )\n\nontario_population\n\n\n\n# A tibble: 1 × 9\n GeoUID Type `Region Name` `Area (sq km)` Population Dwellings Households\n <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl>\n1 35 PR Ontario 986722. 13448494 5598391 5169174\n# ℹ 2 more variables: C_UID <chr>, `v_CA16_1: Age Stats` <dbl>\n\n\nData for censuses since 1996 are available, and list_census_datasets() provides the metadata that we need to provide to get_census() to access these. Data are available based on a variety of regions, and list_census_regions() provides the metadata that we need. Finally, list_census_vectors() provides the metadata about the variables that are available.\n\n\n6.3.2 United States\n\n6.3.2.1 Census\nThe requirement for a census is included in the United States Constitution, although births and deaths were legally required to be registered in what became Massachusetts as early as 1639 (Gutman 1958). After installing and loading it we can use tidycensus to get started with access to United States census data. As with cancensus, we first need to obtain an API key from the Census Bureau API and store it locally using a helper function.\nHaving set that up, we can use get_decennial() to obtain data on variables of interest. As an example, we could gather data about the average household size in 2010 overall, and by owner or renter, for certain states (Figure 6.2).\n\ncensus_api_key(\"ADD_YOUR_API_KEY_HERE\")\n\nus_ave_household_size_2010 <-\n get_decennial(\n geography = \"state\",\n variables = c(\"H012001\", \"H012002\", \"H012003\"),\n year = 2010\n )\n\nus_ave_household_size_2010 |>\n filter(NAME %in% c(\"District of Columbia\", \"Utah\", \"Massachusetts\")) |>\n ggplot(aes(y = NAME, x = value, color = variable)) +\n geom_point() +\n theme_minimal() +\n labs(\n x = \"Average household size\", y = \"State\", color = \"Household type\"\n ) +\n scale_color_brewer(\n palette = \"Set1\", labels = c(\"Total\", \"Owner occupied\", \"Renter occupied\")\n )\n\n\n\n\n\n\n\n\n\nFigure 6.2: Comparing average household size in DC, Utah, and Massachusetts, by household type\n\n\n\n\n\nWalker (2022) provides further detail about analyzing United States census data with R.\n\n\n6.3.2.2 American Community Survey\nThe United States is in the enviable situation where there is usually a better approach than using the census and there is a better way than having to use government statistical agency websites. IPUMS provides access to a wide range of datasets, including international census microdata. In the specific case of the United States, the American Community Survey (ACS) is a survey whose content is comparable to the questions asked on many censuses, but it is available on an annual basis, compared with a census which could be quite out-of-date by the time the data are available. It ends up with millions of responses each year. Although the ACS is smaller than a census, the advantage is that it is available on a more timely basis. We access the ACS through IPUMS.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nSteven Ruggles is Regents Professor of History and Population Studies at the University of Minnesota and is in charge of IPUMS. After earning a PhD in historical demography from the University of Pennsylvania in 1984, he was appointed as an assistant professor at the University of Minnesota, and promoted to full professor in 1995. The initial IPUMS data release was in 1993 (Sobek and Ruggles 1999). Since then it has grown and now includes social and economic data from many countries. Ruggles was awarded a MacArthur Foundation Fellowship in 2022.\n\n\nGo to IPUMS, then “IPUMS USA”, and click “Get Data”. We are interested in a sample, so go to “SELECT SAMPLE”. Un-select “Default sample from each year” and instead select “2019 ACS” and then “SUBMIT SAMPLE SELECTIONS” (Figure 6.3 (a)).\n\n\n\n\n\n\n\n\n\n\n\n(a) Selecting a sample from IPUMS USA and specifying interest in the 2019 ACS\n\n\n\n\n\n\n\n\n\n\n\n(b) Specifying that we are interested in the state\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Adding STATEICP to the cart\n\n\n\n\n\n\n\n\n\n\n\n(d) Beginning the checkout process\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Specifying that we are interested in .dta files\n\n\n\n\n\n\n\n\n\n\n\n(f) Reducing the sample size from three million responses to half a million\n\n\n\n\n\n\n\nFigure 6.3: An overview of the steps involved in getting data from IPUMS\n\n\n\nWe might be interested in data based on state. We would begin by looking at “HOUSEHOLD” variables and selecting “GEOGRAPHIC” (Figure 6.3 (b)). We add “STATEICP” to our “cart” by clicking the plus, which will then turn into a tick (Figure 6.3 (c)). We might then be interested in data on a “PERSON” basis, for instance, “DEMOGRAPHIC” variables such as “AGE”, which we should add to our cart. We also want “SEX” and “EDUC” (both are in “PERSON”).\nWhen we are done, we can click “VIEW CART”, and then click “CREATE DATA EXTRACT” (Figure 6.3 (d)). At this point there are two aspects that we likely want to change:\n\nChange the “DATA FORMAT” from “.dat” to “.dta” (Figure 6.3 (e)).\nCustomize the sample size as we likely do not need three million responses, and could just change it to, say, 500,000 (Figure 6.3 (f)).\n\nBriefly check the dimensions of the request. It should not be much more than around 40MB. If it is then check whether there are variables accidentally selected that are not needed or further reduce the number of observations.\nFinally, we want to include a descriptive name for the extract, for instance, “2023-05-15: State, age, sex, education”, which specifies the date we made the extract and what is in the extract. After that we can click “SUBMIT EXTRACT”.\nWe will be asked to log in or create an account, and after doing that will be able to submit the request. IPUMS will email when the extract is available, after which we can download it and read it into R in the usual way. We assume the dataset has been saved locally as “usa_00015.dta” (your dataset may have a slightly different filename).\nIt is critical that we cite this dataset when we use it. For instance we can use the following BibTeX entry for Ruggles et al. (2021).\n@misc{ipumsusa,\n author = {Ruggles, Steven and Flood, Sarah and Foster, Sophia and Goeken, Ronald and Pacas, Jose and Schouweiler, Megan and Sobek, Matthew},\n year = 2021,\n title = {IPUMS USA: Version 11.0},\n publisher = {Minneapolis, MN: IPUMS},\n doi = {10.18128/d010.v11.0},\n url = {https://usa.ipums.org},\n language = {en},\n}\nWe will briefly tidy and prepare this dataset because we will use it in Chapter 16. Our code is based on Mitrovski, Yang, and Wankiewicz (2020).\n\nipums_extract <- read_dta(\"usa_00015.dta\")\n\nipums_extract <- \n ipums_extract |>\n select(stateicp, sex, age, educd) |>\n to_factor()\n\nipums_extract\n\n\n\n# A tibble: 500,221 × 4\n stateicp sex age educd \n * <fct> <fct> <fct> <fct> \n 1 alabama male 77 grade 9 \n 2 alabama male 62 1 or more years of college credit, no degree\n 3 alabama male 25 ged or alternative credential \n 4 alabama female 20 1 or more years of college credit, no degree\n 5 alabama male 37 1 or more years of college credit, no degree\n 6 alabama female 19 regular high school diploma \n 7 alabama female 67 regular high school diploma \n 8 alabama female 20 1 or more years of college credit, no degree\n 9 alabama male 66 grade 8 \n10 alabama male 58 regular high school diploma \n# ℹ 500,211 more rows\n\n\n\ncleaned_ipums <-\n ipums_extract |>\n mutate(age = as.numeric(age)) |>\n filter(age >= 18) |>\n rename(gender = sex) |>\n mutate(\n age_group = case_when(\n age <= 29 ~ \"18-29\",\n age <= 44 ~ \"30-44\",\n age <= 59 ~ \"45-59\",\n age >= 60 ~ \"60+\",\n TRUE ~ \"Trouble\"\n ),\n education_level = case_when(\n educd %in% c(\n \"nursery school, preschool\", \"kindergarten\", \"grade 1\",\n \"grade 2\", \"grade 3\", \"grade 4\", \"grade 5\", \"grade 6\",\n \"grade 7\", \"grade 8\", \"grade 9\", \"grade 10\", \"grade 11\",\n \"12th grade, no diploma\", \"regular high school diploma\",\n \"ged or alternative credential\", \"no schooling completed\"\n ) ~ \"High school or less\",\n educd %in% c(\n \"some college, but less than 1 year\",\n \"1 or more years of college credit, no degree\"\n ) ~ \"Some post sec\",\n educd %in% c(\"associate's degree, type not specified\",\n \"bachelor's degree\") ~ \"Post sec +\",\n educd %in% c(\n \"master's degree\",\n \"professional degree beyond a bachelor's degree\",\n \"doctoral degree\"\n ) ~ \"Grad degree\",\n TRUE ~ \"Trouble\"\n )\n ) |>\n select(gender, age_group, education_level, stateicp) |>\n mutate(across(c(\n gender, stateicp, education_level, age_group),\n as_factor)) |>\n mutate(age_group =\n factor(age_group, levels = c(\"18-29\", \"30-44\", \"45-59\", \"60+\")))\n\ncleaned_ipums\n\n# A tibble: 407,354 × 4\n gender age_group education_level stateicp\n <fct> <fct> <fct> <fct> \n 1 male 60+ High school or less alabama \n 2 male 60+ Some post sec alabama \n 3 male 18-29 High school or less alabama \n 4 female 18-29 Some post sec alabama \n 5 male 30-44 Some post sec alabama \n 6 female 18-29 High school or less alabama \n 7 female 60+ High school or less alabama \n 8 female 18-29 Some post sec alabama \n 9 male 60+ High school or less alabama \n10 male 45-59 High school or less alabama \n# ℹ 407,344 more rows\n\n\nWe will draw on this dataset in Chapter 16, so we will save it.\n\nwrite_csv(x = cleaned_ipums,\n file = \"cleaned_ipums.csv\")\n\nWe can also have a look at some of the variables (Figure 6.4).\n\ncleaned_ipums |>\n ggplot(mapping = aes(x = age_group, fill = gender)) +\n geom_bar(position = \"dodge2\") +\n theme_minimal() +\n labs(\n x = \"Age-group of respondent\",\n y = \"Number of respondents\",\n fill = \"Education\"\n ) +\n facet_wrap(vars(education_level)) +\n guides(x = guide_axis(angle = 90)) +\n theme(legend.position = \"bottom\") +\n scale_fill_brewer(palette = \"Set1\")\n\n\n\n\n\n\n\nFigure 6.4: Examining counts of the IPUMS ACS sample, by age, gender and education\n\n\n\n\n\nFull count data—that is the entire census—are available through IPUMS for United States censuses conducted between 1850 and 1940, with the exception of 1890. Most of the 1890 census records were destroyed due to a fire in 1921. One per cent samples are available for all censuses through to 1990. ACS data are available from 2000.",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "06-farm.html#sampling-essentials",
"href": "06-farm.html#sampling-essentials",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "6.4 Sampling essentials",
"text": "6.4 Sampling essentials\nStatistics is at the heart of telling stories with data because it is almost never possible to get all the data that we would like. Statisticians have spent considerable time and effort thinking about the properties that various samples of data will have and how they enable us to speak to implications for the broader population.\nLet us say that we have some data. For instance, a particular toddler goes to sleep at 6:00pm every night. We might be interested to know whether that bedtime is common among all toddlers, or if we have an unusual toddler. If we only had one toddler then our ability to use their bedtime to speak about all toddlers would be limited.\nOne approach would be to talk to friends who also have toddlers. And then talk to friends-of-friends. How many friends, and friends-of-friends, do we have to ask before we can begin to feel comfortable that we can speak about some underlying truth of toddler bedtime?\nWu and Thompson (2020, 3) describe statistics as “the science of how to collect and analyze data and draw statements and conclusions about unknown populations.” Here “population” is used in a statistical sense and refers to some infinite group that we can never know exactly, but that we can use the probability distributions of random variables to describe the characteristics of. We discuss probability distributions in more detail in Chapter 12. Fisher ([1925] 1928, 41) goes further and says:\n\n[t]he idea of an infinite population distributed in a frequency distribution in respect of one or more characters is fundamental to all statistical work. From a limited experience,\\(\\dots\\) we may obtain some idea of the infinite hypothetical population from which our sample is drawn, and so of the probable nature of future samples to which our conclusions are to be applied.\n\nAnother way to say this is that statistics involves getting some data and trying to say something sensible based on it even though we can never have all of the data.\nThree pieces of critical terminology are:\n\n“Target population”: The collection of all items about which we would like to speak.\n“Sampling frame”: A list of all the items from the target population that we could get data about.\n“Sample”: The items from the sampling frame that we get data about.\n\nA target population is a finite set of labelled items, of size \\(N\\). For instance, in theory we could add a label to all the books in the world: “Book 1”, “Book 2”, “Book 3”, \\(\\dots\\), “Book \\(N\\)”. There is a difference between the use of the term population here, and that of everyday usage. For instance, one sometimes hears those who work with census data say that they do not need to worry about sampling because they have the whole population of the country. This is a conflation of the terms, as what they have is the sample gathered by the census of the population of a country. While the goal of a census is to get every unit—and if this was achieved then sampling error would be less of an issue—there would still be many other issues. Even if a census was done perfectly and we got data about every unit in the target population, there are still issues, for instance due to measurement error, and it being a sample at a particular time. Groves and Lyberg (2010) provide a discussion of the evolution of total survey error.\nIn the same way that we saw how difficult it can be to define what to measure, it can be difficult to define a target population. For instance, say we have been asked to find out about the consumption habits of university students. How can we define that target population? If someone is a student, but also works full time, then are they in the population? What about mature-aged students, who might have different responsibilities? Some aspects that we might be interested in are formally defined to an extent that is not always commonly realized. For instance, whether an area is classified as urban or rural is often formally defined by a country’s statistical agency. But other aspects are less clear. Gelman, Hill, and Vehtari (2020, 24) discuss the difficulty of how we might classify someone as a “smoker”. If a 15-year-old has had 100 cigarettes over their lifetime, then we need to treat them differently than if they have had none. But if a 90-year-old has had 100 cigarettes over their lifetime, then are they likely different to a 90-year-old who has had none? At what age, and number of cigarettes, do these answers change?\nConsider if we want to speak to the titles of all the books ever written. Our target population is all books ever written. But it is almost impossible for us to imagine that we could get information about the title of a book that was written in the nineteenth century, but that the author locked in their desk and never told anyone about. One sampling frame could be all books in the Library of Congress Online Catalog, another could be the 25 million books that were digitized by Google (Somers 2017). Our sample may be the tens of thousands of books that are available through Project Gutenberg, which we will use in later chapters.\nTo consider another example, consider wanting to speak of the attitudes of all Brazilians who live in Germany. The target population is all Brazilians who live in Germany. One possible source of information would be Facebook and so in that case, the sampling frame might be all Brazilians who live in Germany who have Facebook. And then our sample might be all Brazilians who live in Germany who have Facebook who we can gather data about. The target population and the sampling frame will be different because not all Brazilians who live in Germany will have Facebook. And the sampling frame will be different to the sample because we will likely not be able to gather data about all Brazilians who live in Germany and have Facebook.\n\n6.4.1 Sampling in Dublin and Reading\nTo be clearer, we consider two examples: a 1798 count of the number of inhabitants of Dublin, Ireland (Whitelaw 1805), and a 1912 count of working-class households in Reading, England (Bowley 1913).\n\n6.4.1.1 Survey of Dublin in 1798\nIn 1798 the Reverend James Whitelaw conducted a survey of Dublin, Ireland, to count its population. Whitelaw (1805) describes how population estimates at the time varied considerably. For instance, the estimated size of London at the time ranged from 128,570 to 300,000 people. Whitelaw expected that the Lord Mayor of Dublin could compel the person in charge of each house to affix a list of the inhabitants of that house to the door, and then Whitelaw could simply use this.\nInstead, he found that the lists were “frequently illegible, and generally short of the actual number by a third, or even one-half”. And so instead he recruited assistants, and they went door-to-door making their own counts. The resulting estimates are particularly informative (Figure 6.5). The total population of Dublin in 1798 was estimated at 182,370.\n\n\n\n\n\n\nFigure 6.5: Extract of the results that Whitelaw found in 1798\n\n\n\nOne aspect worth noticing is that Whitelaw includes information about class. It is difficult to know how that was determined, but it played a large role in the data collection. Whitelaw describes how the houses of “the middle and upper classes always contained some individual who was competent to the task [of making a list]”. But that “among the lower class, which forms the great mass of the population of this city, the case was very different”. It is difficult to see how Whitelaw could have known that without going into the houses of both upper and lower classes. But it is also difficult to imagine Whitelaw going into the houses of the upper class and counting their number. It may be that different approaches were needed.\nWhitelaw attempted to construct a full sample of the inhabitants of Dublin without using much in the way of statistical machinery to guide his choices. We will now consider a second example, conducted in 1912, where they were able to start to use sampling approaches that we still use today.\n\n\n6.4.1.2 Survey of working-class households in Reading in 1912\nA little over one hundred years after Whitelaw (1805), Bowley (1913) was interested in counting the number of working-class households in Reading, England. Bowley selected the sample using the following procedure (Bowley 1913, 672):\n\nOne building in ten was marked throughout the local directory in alphabetical order of streets, making about 1,950 in all. Of those about 300 were marked as shops, factories, institutions and non-residential buildings, and about 300 were found to be indexed among Principal Residents, and were so marked. The remaining 1,350 were working-class houses\\(\\dots\\) [I]t was decided to take only one house in 20, rejecting the incomplete information as to the intermediate tenths. The visitors were instructed never to substitute another house for that marked, however difficult it proved to get information, or whatever the type of house.\n\nBowley (1913) says that they were able to obtain information about 622 working-class households. For instance, they were able to estimate how much rent was paid each week (Figure 6.6).\n\n\n\n\n\n\nFigure 6.6: Extract of the results that Bowley found about rent paid by the working-class in Reading, England\n\n\n\nThen, having judged from the census that there were about 18,000 households in Reading, Bowley (1913) applied a multiplier of 21 to the sample, resulting in estimates for Reading overall. The key aspect that ensures the resulting estimates are reasonable is that the sampling was done in a random way. This is why Bowley (1913) was so insistent that the visitors go to the actual house that was selected, and not substitute it for another.\n\n\n\n6.4.2 Probabilistic sampling\nHaving identified a target population and a sampling frame, we need to distinguish between probability and non-probability sampling:\n\n“Probability sampling”: Every unit in the sampling frame has some known chance of being sampled and the specific sample is obtained randomly based on these chances. The chance of being sampled does not necessarily need to be same for each unit.\n“Non-probability sampling”: Units from the sampling frame are sampled based on convenience, quotas, judgement, or other non-random processes.\n\nOften the difference between probability and non-probability sampling is one of degree. For instance, we usually cannot forcibly obtain data, and so there is almost always an aspect of volunteering on the part of a respondent. Even when there are penalties for not providing data, such as the case for completing a census form in many countries, it is difficult for even a government to force people to fill it out completely or truthfully—famously in the 2001 New Zealand census more than one per cent of the population listed their religion as “Jedi” (Taylor 2015). The most important aspect to be clear about with probability sampling is the role of uncertainty. This allows us to make claims about the population, based on our sample, with known amounts of error. The trade-off is that probability sampling is often expensive and difficult.\nWe will consider four types of probability sampling:\n\nsimple random;\nsystematic;\nstratified; and\ncluster.\n\nTo add some more specificity to our discussion, in a way that is also used by Lohr ([1999] 2022, 27), it may help to consider the numbers one to 100 as our target population. With simple random sampling, every unit has the same chance of being included. In this case let us say it is 20 per cent. That means we would expect to have around 20 units in our sample, or around one in five compared with our target population.\n\nset.seed(853)\n\nillustrative_sampling <- tibble(\n unit = 1:100,\n simple_random_sampling =\n sample(x = c(\"In\", \"Out\"), \n size = 100, \n replace = TRUE, \n prob = c(0.2, 0.8))\n )\n\nillustrative_sampling |>\n count(simple_random_sampling)\n\n# A tibble: 2 × 2\n simple_random_sampling n\n <chr> <int>\n1 In 14\n2 Out 86\n\n\nWith systematic sampling, as was used by Bowley (1913), we proceed by selecting some value, and we then sample every fifth unit to obtain a 20 per cent sample. To begin, we randomly pick a starting point from units one to five, say three. And so sampling every fifth unit would mean looking at the third, the eighth, the thirteenth, and so on.\n\nset.seed(853)\n\nstarting_point <- sample(x = c(1:5), size = 1)\n\nillustrative_sampling <-\n illustrative_sampling |>\n mutate(\n systematic_sampling =\n if_else(unit %in% seq.int(from = starting_point, to = 100, by = 5), \n \"In\", \n \"Out\"\n )\n )\n\nillustrative_sampling |>\n count(systematic_sampling)\n\n# A tibble: 2 × 2\n systematic_sampling n\n <chr> <int>\n1 In 20\n2 Out 80\n\n\nWhen we consider our population, it will typically have some grouping. This may be as straight forward as a country having states, provinces, counties, or statistical districts; a university having faculties and departments; and humans having age-groups. A stratified structure is one in which we can divide the population into mutually exclusive, and collectively exhaustive, sub-populations called “strata”.\nWe use stratification to help with the efficiency of sampling or with the balance of the survey. For instance, the population of the United States is around 335 million, with around 40 million people in California and around half a million people in Wyoming. Even a survey of 10,000 responses would only expect to have 15 responses from Wyoming, which could make inference about Wyoming difficult. We could use stratification to ensure there are, say, 200 responses from each state. We could use random sampling within each state to select the person about whom data will be gathered.\nIn our case, we will stratify our illustration by considering that our strata are the tens, that is, one to ten is one stratum, 11 to 20 is another, and so on. We will use simple random sampling within these strata to select two units from each.\n\nset.seed(853)\n\npicked_in_strata <-\n illustrative_sampling |>\n mutate(strata = (unit - 1) %/% 10) |>\n slice_sample(n = 2, by = strata) |>\n pull(unit)\n\nillustrative_sampling <-\n illustrative_sampling |>\n mutate(stratified_sampling = \n if_else(unit %in% picked_in_strata, \"In\", \"Out\"))\n\nillustrative_sampling |>\n count(stratified_sampling)\n\n# A tibble: 2 × 2\n stratified_sampling n\n <chr> <int>\n1 In 20\n2 Out 80\n\n\nAnd finally, we can also take advantage of some clusters that may exist in our dataset. Like strata, clusters are collectively exhaustive and mutually exclusive. Our examples from earlier of states, departments, and age-groups remain valid as clusters. However, it is our intention toward these groups that is different. Specifically, with cluster sampling, we do not intend to collect data from every cluster, whereas with stratified sampling we do. With stratified sampling we look at every stratum and conduct simple random sampling within each strata to select the sample. With cluster sampling we select clusters of interest. We can then either sample every unit in those selected clusters or use simple random sampling, within the selected clusters, to select units. That all said, this difference can become less clear in practice, especially after the fact. Rose et al. (2006) gather mortality data for North Darfur, Sudan, in 2005. They find that both cluster and systematic sampling provide similar results, and they point out that systematic sampling requires less training of the survey teams. In general, cluster sampling can be cheaper because of the focus on geographically close locations.\nIn our case, we will cluster our illustration again based on the tens. We will use simple random sampling to select two clusters for which we will use the entire cluster.\n\nset.seed(853)\n\npicked_clusters <-\n sample(x = c(0:9), size = 2)\n\nillustrative_sampling <-\n illustrative_sampling |>\n mutate(\n cluster = (unit - 1) %/% 10,\n cluster_sampling = if_else(cluster %in% picked_clusters, \"In\", \"Out\")\n ) |>\n select(-cluster)\n\nillustrative_sampling |>\n count(cluster_sampling)\n\n# A tibble: 2 × 2\n cluster_sampling n\n <chr> <int>\n1 In 20\n2 Out 80\n\n\nAt this point we can illustrate the differences between our approaches (Figure 6.7). We could also consider it visually, by pretending that we randomly sample using the different methods from different parts of the world (Figure 6.8). \n\nnew_labels <- c(\n simple_random_sampling = \"Simple random sampling\",\n systematic_sampling = \"Systematic sampling\",\n stratified_sampling = \"Stratified sampling\",\n cluster_sampling = \"Cluster sampling\"\n)\n\nillustrative_sampling_long <-\n illustrative_sampling |>\n pivot_longer(\n cols = names(new_labels), names_to = \"sampling_method\",\n values_to = \"in_sample\"\n ) |>\n mutate(sampling_method = \n factor(sampling_method,levels = names(new_labels)))\n\nillustrative_sampling_long |>\n filter(in_sample == \"In\") |>\n ggplot(aes(x = unit, y = in_sample)) +\n geom_point() +\n facet_wrap(vars(sampling_method), dir = \"v\", ncol = 1, \n labeller = labeller(sampling_method = new_labels)\n ) +\n theme_minimal() +\n labs(x = \"Unit\", y = \"Is included in sample\") +\n theme(axis.text.y = element_blank())\n\n\n\n\n\n\n\nFigure 6.7: Illustrative example of simple random sampling, systematic sampling, stratified sampling, and cluster sampling over the numbers from 1 to 100\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n(a) The world\n\n\n\n\n\n\n\n\n\n\n\n(b) Systematic sampling\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Stratified sampling\n\n\n\n\n\n\n\n\n\n\n\n(d) Cluster sampling\n\n\n\n\n\n\n\nFigure 6.8: Illustrative example of simple random sampling, systematic sampling, stratified sampling, and cluster sampling across different parts of the world\n\n\n\nFigure 6.7 and Figure 6.8 illustrate the trade-offs between the different methods, and the ways in which they will be differently appropriate. For instance, we see that systematic sampling provides a useful picture of the world in Figure 6.8, but if we were interested only in, say, only land, we would still be left with many samples that were not informative. Stratified sampling and cluster sampling enable us to focus on aspects of interest, but at the cost of a more holistic picture.\nA good way to appreciate the differences between these approaches is to consider them in practice. Au (2022) provides a number of examples. One in particular is in the context of counting raptors where Fuller and Mosher (1987) compares simple random sampling, stratified sampling, systematic sampling and cluster sampling, as well as additional considerations.\n\n6.4.2.1 Inference for probability samples\nHaving established our sample, we typically want to use it to make claims about the population. Neyman (1934, 561) goes further and says that “\\(\\dots\\)the problem of the representative method is par excellence the problem of statistical estimation. We are interested in characteristics of a certain population, say \\(\\pi\\), which it is either impossible or at least very difficult to study in detail, and we try to estimate these characteristics basing our judgment on the sample.”\nIn particular, we would typically be interested to estimate a population mean and variance. We introduced the idea of estimators, estimands, and estimates in Chapter 4. We can construct an estimator to estimate the population mean and variance. For instance, if we were using simple random sampling with a sample of size \\(n\\), then the sample mean and variance (which we return to in Chapter 12) could be constructed to produce estimates of the population mean and variance:\n\\[\n\\begin{aligned}\n\\hat{\\mu} &= \\frac{1}{n} \\times \\sum_{i = 1}^{n}x_i\\\\\n\\hat{\\sigma}^2 &= \\frac{1}{n-1} \\times \\sum_{i = 1}^{n}\\left(x_i - \\hat{\\mu}\\right)^2\n\\end{aligned}\n\\]\nWe can use the approaches that we have used so far to simulate various types of survey designs. There are also packages that can help, including DeclareDesign (Blair et al. 2019) and survey (Lumley 2020).\nScaling up estimates can be used when we are interested in using a count from our sample to imply some total count for the target population. We saw this in Bowley (1913) where the ratio of the number of households in the sample, compared with the number of households known from the census, was 21 and this information was used to scale up the sample.\nTo consider an example, perhaps we were interested in the sum of the numbers from one to 100. Returning to our example illustrating different ways to sample from these number, we know that our samples are of size 20, and so need to be scaled up five times (Table 6.2).\n\n\n\n\nTable 6.2: Sum of the numbers in each sample, and implied sum of population\n\n\n\n\n\n\nSampling method\nSum of sample\nImplied population sum\n\n\n\n\nSystematic sampling\n970\n4,850\n\n\nStratified sampling\n979\n4,895\n\n\nCluster sampling\n910\n4,550\n\n\nSimple random sampling\n840\n4,200\n\n\n\n\n\n\n\n\nThe actual sum of the population is 5,050.1 While the specifics are unique to this sample, our estimate of the population sum, based on the scaling, are revealing. The closest is stratified sample, closely followed by systematic sampling. Cluster sampling is a little over 10 per cent off, while simple random sampling is a little further away. To get close to the true sum, it is important that our sampling method gets as many of the higher values as possible. And so stratified and systematic sampling, both of which ensured that we had outcomes from the larger numbers did particularly well. The performance of cluster and simple random sampling would depend on the particular clusters, and units, selected. In this case, stratified and systematic sampling ensured that our estimate of the sum of the population would not be too far away from the actual population sum. Here, we might think of implications for the construction and evaluation of measures, such as GDP and other constructions that are summed, and the effect on the total of the different strata based on their size.\nThis approach has a long history. For instance, Stigler (1986, 163) describes how by 1826 Adolphe Quetelet, the nineteenth century astronomer, had become involved in the statistical bureau, which was planning for a census. Quetelet argued that births and deaths were well known, but migration was not. He proposed an approach based on counts in specific geographies, which could then be scaled up to the whole country. The criticism of the plan focused on the difficulty of selecting appropriate geographies, which we saw also in our example of cluster sampling. The criticism was reasonable, and even today, some 200 years later, something that we should keep front of mind, (Stigler 1986):\n\nHe [Quetelet] was acutely aware of the infinite number of factors that could affect the quantities he wished to measure, and he lacked the information that could tell him which were indeed important. He\\(\\dots\\) was reluctant to group together as homogenous, data that he had reason to believe was not\\(\\dots\\) To be aware of a myriad of potentially important factors, without knowing which are truly important and how their effect may be felt, is often to fear the worst\\(\\dots\\) He [Quetelet] could not bring himself to treat large regions as homogeneous, [and so] he could not think of a single rate as applying to a large area.\n\nWe are able to do this scaling up when we know the population total, but if we do not know that, or we have concerns around the precision of that approach then we may use a ratio estimator.\nRatio estimators were used in 1802 by Pierre-Simon Laplace to estimate the total population of France, based on the ratio of the number of registered births, which was known throughout the country, to the number of inhabitants, which was only know for certain communes. He calculated this ratio for the three communes, and then scaled it, based on knowing the number of births across the whole country to produce an estimate of the population of France (Lohr [1999] 2022).\n\n\n\n\n\n\n\n\nA ratio estimator of some population parameter is the ratio of two means. For instance, imagine that we knew the total number of hours that a toddler slept for a 30-day period, and we want to know how many hours the parents slept over that same period. We may have some information on the number of hours that a toddler sleeps overnight, \\(x\\), and the number of hours their parents sleep overnight, \\(y\\), over a 30-day period.\n\nset.seed(853)\n\nsleep <-\n tibble(\n toddler_sleep = sample(x = c(2:14), size = 30, replace = TRUE),\n difference = sample(x = c(0:3), size = 30, replace = TRUE),\n parent_sleep = toddler_sleep - difference\n ) |>\n select(toddler_sleep, parent_sleep, difference)\n\nsleep\n\n# A tibble: 30 × 3\n toddler_sleep parent_sleep difference\n <int> <int> <int>\n 1 10 9 1\n 2 11 11 0\n 3 14 12 2\n 4 2 0 2\n 5 6 5 1\n 6 14 12 2\n 7 3 3 0\n 8 5 3 2\n 9 4 1 3\n10 4 3 1\n# ℹ 20 more rows\n\n\nThe average of each is:\n\nsleep |>\n summarise(\n toddler_sleep_average = mean(toddler_sleep),\n parent_sleep_average = mean(parent_sleep)\n ) |>\n kable(\n col.names = c(\"Toddler sleep average\", \"Parent sleep average\"),\n format.args = list(big.mark = \",\"),\n digits = 2\n )\n\n\n\n\nToddler sleep average\nParent sleep average\n\n\n\n\n6.17\n4.9\n\n\n\n\n\nThe ratio of the proportion of sleep that a parent gets compared with their toddler is:\n\\[\\hat{B} = \\frac{\\bar{y}}{\\bar{x}} = \\frac{4.9}{6.16} \\approx 0.8.\\]\nGiven the toddler slept 185 hours over that 30-day period, our estimate of the number of hours that the parents slept is \\(185 \\times 0.8 = 148\\). This turns out to be almost exactly right, as the sum is 147. In this example, the estimate was not needed because we were able to sum the data, but say some other set of parents only recorded the number of hours that their toddler slept, not how long they slept, then we could use this to estimate how much they had slept.\nOne variant of the ratio estimator that is commonly used is capture and recapture, which is one of the crown jewels of data gathering. It is commonly used in ecology where we know we can never gather data about all animals. Instead, a sample is captured, marked, and released. The researchers return after some time to capture another sample. Assuming enough time passes that the initially captured animals had time to integrate back into the population, but not so much time has passed that there are insurmountable concerns around births, deaths, and migration, then we can use these values to estimate a population size. The key is what proportion in this second sample have been recaptured. This proportion can be used to estimate the size of the whole population. Interestingly, in the 1990s there was substantial debate about whether to use a capture-recapture model to adjust the 1990 US census due to concerns about methodology. The combination of Breiman (1994) and Gleick (1990) provides an overview of the concerns at the time, those of censuses more generally, and helpful background on capture and recapture methods. More recently we have seen capture and recapture combined with web scraping, which we consider in Chapter 7, for the construction of survey frames (Hyman, Sartore, and Young 2021).\n\n\n\n6.4.3 Non-probability samples\nWhile acknowledging that it is a spectrum, much of statistics was developed based on probability sampling. But a considerable amount of modern sampling is done using non-probability sampling. One approach is to use social media and other advertisements to recruit a panel of respondents, possibly in exchange for compensation. This panel is then the group that is sent various surveys as necessary. But think for a moment about the implications of this. For instance, what type of people are likely to respond to such an advertisement? Is the richest person in the world likely to respond? Are especially young or especially old people likely to respond? In some cases, it is possible to do a census. Governments typically do one every five to ten years. But there is a reason that it is generally governments that do them—they are expensive, time-consuming, and surprisingly, they are sometimes not as accurate as we may hope because of how general they need to be.\nNon-probability samples have an important role to play because they are typically cheaper and quicker to obtain than probability samples. Beaumont (2020) describes a variety of factors in favor of non-probability samples including declining response rates to probability samples, and increased demand for real-time statistics. Further, as we have discussed, the difference between probability and non-probability samples is sometimes one of degree, rather than dichotomy. Non-probability samples are legitimate and appropriate for some tasks provided one is clear about the trade-offs and ensures transparency (Baker et al. 2013). Low response rates mean that true probability samples are rare, and so grappling with the implications of non-probability sampling is important.\nConvenience sampling involves gathering data from a sample that is easy to access. For instance, one often asks one’s friends and family to fill out a survey as a way of testing it before wide-scale distribution. If we were to analyze such a sample, then we would likely be using convenience sampling.\nThe main concern with convenience sampling is whether it is able to speak to the broader population. There are also tricky ethical considerations, and typically a lack of anonymity which may further bias the results. On the other hand, it can be useful to cheaply get a quick sense of a situation.\nQuota sampling occurs when we have strata, but we do not use random sampling within those strata to select the unit. For instance, if we again stratified the United States based on state, but then instead of ensuring that everyone in Wyoming had the chance to be chosen for that stratum, just picked people at Jackson Hole. There are some advantages to this approach, especially in terms of speed and cost, but the resulting sample may be biased in various ways. That is not to say they are without merit. For instance, the Bank of Canada runs a non-probability survey focused on the method of payment for goods and services. They use quota sampling, and various adjustment methods. This use of non-probability sampling enables them to deliberately focus on hard-to-reach aspects of the population (H. Chen, Felt, and Henry 2018).\nAs the saying goes, birds of a feather flock together. And we can take advantage of that in our sampling. Although Handcock and Gile (2011) describe various uses before this, and it is notoriously difficult to define attribution in multidisciplinary work, snowball sampling is nicely defined by Goodman (1961). Following Goodman (1961), to conduct snowball sampling, we first draw a random sample from the sampling frame. Each of these is asked to name \\(k\\) others also in the sample population, but not in that initial draw, and these form the “first stage”. Each individual in the first stage is then similarly asked to name \\(k\\) others who are also in the sample population, but again not in the random draw or the first stage, and these form the “second stage”. We need to have specified the number of stages, \\(s\\), and also \\(k\\) ahead of time.\nRespondent-driven sampling was developed by Heckathorn (1997) to focus on hidden populations, which are those where:\n\nthere is no sampling frame; and\nbeing known to be in the sampling population could have a negative effect.\n\nFor instance, we could imagine various countries in which it would be difficult to sample from, say, the gay population or those who have had abortions. Respondent-driven sampling differs from snowball sampling in two ways:\n\nIn addition to compensation for their own response, as is the case with snowball sampling, respondent-driven sampling typically also involves compensation for recruiting others.\nRespondents are not asked to provide information about others to the investigator, but instead recruit them into the study. Selection into the sample occurs not from sampling frame, but instead from the networks of those already in the sample (Salganik and Heckathorn 2004).",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "06-farm.html#exercises",
"href": "06-farm.html#exercises",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "6.5 Exercises",
"text": "6.5 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: Every day for a year two people—Mark and Lauren—record the amount of snow that fell that day in the two different states they are from. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation with every variable independent of each other. Then write five tests based on the simulated data.\n(Acquire) Please obtain some actual data about snowfall and add a script updating the simulated tests to these actual data.\n(Explore) Build a graph and table using the real data.\n(Communicate) Please write some text to accompany the graph and table. Separate the code appropriately into R files and a Quarto doc. Submit a link to a high-quality GitHub repo.\n\n\n\nQuiz\n\nWhat is a primary challenge when converting some phenomena in the world into a dataset (pick one)?\n\nThe high cost of data storage solutions.\nThe overabundance of unbiased data.\nThe lack of available data collection tools.\nDeciding what to measure and how to measure it appropriately.\n\nWith reference to Daston (2000), please discuss whether GDP and counts of population are invented or discovered?\nAccording to metrology, which of the following best defines measurement (pick one)?\n\nThe estimation of unknown variables using predictive models.\nThe calculation of statistical significance in data analysis.\nThe act of assigning numbers to objects arbitrarily.\nThe process of experimentally obtaining quantity values that can be attributed to a property of a phenomenon, body, or substance.\n\nIn at least two paragraphs, and using your own words, please define measurement error and provide an example from your own experience.\nWith reference to Gargiulo (2022), please discuss challenges of measurement in the real world.\nWhat does the “validity” of a measurement refer to (pick one)?\n\nThe statistical significance of the measurement results.\nThe degree to which a measurement accurately reflects the concept it is intended to measure.\nThe speed at which data can be collected.\nThe precision with which a measurement can be replicated.\n\nHow do Kennedy et al. (2022) define ethics (pick one)?\n\nRespecting the perspectives and dignity of individual survey respondents.\nGenerating estimates of the general population and for subpopulations of interest.\nUsing more complicated procedures only when they serve some useful function.\n\nWhich of the following best describes “measurement error” (pick one)?\n\nThe difference between the observed value and the true value of what is being measured.\nAn error that occurs only when data is not normally distributed.\nA type of error that can be eliminated with better instruments.\nA deliberate alteration of data to mislead analysis.\n\nWhat is censored data (pick one)?\n\nData that have been corrupted and is unreadable.\nData that have been intentionally omitted due to privacy concerns.\nData where the value of an observation is only partially known.\nData collected from unauthorized sources.\n\nHow do truncated data differ from censored data (pick one)?\n\nTruncated data deal with overestimation, censored data with underestimation.\nIn truncated data, certain values are omitted from the dataset, whereas in censored data, the values are partially known but incomplete.\nTruncated data are less accurate than censored data.\n\nWhat is missing completely at random (MCAR) (pick one)?\n\nMissing data that can be easily predicted using statistical models.\nMissing data where the likelihood of being missing is related to the unobserved data.\nMissing data where the likelihood of being missing is related to the observed data.\nMissing data where the missingness is entirely random and unrelated to any data, observed or unobserved.\n\nWhy are censuses considered crucial datasets (pick one)?\n\nThey are conducted infrequently and thus have novel information.\nThey aim to collect data on every unit in a population, providing comprehensive datasets designed for analysis.\nThey are private datasets that offer exclusive insights.\nThey focus exclusively on agricultural data, which is vital for economic planning.\n\nBased on Statistics Canada (2023), why is the quality of census data evaluated (pick one)?\n\nTo reduce the cost of future censuses.\nTo limit data dissemination.\nTo improve data privacy.\nTo ensure census data are reliable and meet user needs.\n\nBased on Statistics Canada (2023), what is the primary source of sampling error in the Census of Population (pick one)?\n\nNon-response from individuals.\nThe use of a sample.\nData capture errors during processing.\nMisclassification of dwellings.\n\nBased on Statistics Canada (2023), which error occurs when people or dwellings are omitted or counted more than once (pick one)?\n\nNon-response error.\nCoverage error.\nSampling error.\nProcessing error.\n\nBased on Statistics Canada (2023), which type of error is related to misunderstanding or misreporting by respondents or enumerators (pick one)?\n\nSampling error.\nResponse error.\nProcessing error.\nCoverage error.\n\nBased on Statistics Canada (2023), what is the purpose of the Dwelling Classification Survey (DCS) (pick one)?\n\nTo improve the long-form questionnaire.\nTo classify new housing developments.\nTo collect data on household income.\nTo study classification errors in dwellings.\n\nBased on Statistics Canada (2023), what is the Census Undercoverage Study (CUS) designed to estimate (pick one)?\n\nThe number of people omitted from the census.\nThe variance of sampling errors.\nImputation rates for missing data.\nNon-response rates for long-form questionnaires.\n\nBased on Statistics Canada (2023), what does the Census Overcoverage Study (COS) identify (pick one)?\n\nDwellings that were misclassified.\nCases where individuals were counted more than once.\nPeople who were missed by the census.\nInvalid data entries during processing.\n\nBased on Statistics Canada (2023), wow is the Total Non-Response (TNR) rate for the census defined (pick one)?\n\nThe number of imputed values in the census data.\nThe percentage of incorrect responses in the census.\nThe percentage of questionnaires with partial responses.\nThe proportion of dwellings where the questionnaires does not meet the minimum content.\n\nWith reference to W. Chen et al. (2019) and Martı́nez (2022), to what extent do you think we can trust government statistics? Please write at least a page and compare at least two governments in your answer.\nThe 2021 census in Canada asked, firstly, “What was this person’s sex at birth? Sex refers to sex assigned at birth. Male/Female”, and then “What is this person’s gender? Refers to current gender which may be different from sex assigned at birth and may be different from what is indicated on legal documents. Male/Female/Or please specify this person’s gender (space for a typed or handwritten answer)”. With reference to Statistics Canada (2020), please discuss the extent to which you think this is an appropriate way for the census to have proceeded. You are welcome to discuss the case of a different country if you are more familiar with that.\nPlease use IPUMS to access the 2020 ACS. Making use of the codebook, how many respondents were there in California (STATEICP) that had a Doctoral degree as their highest educational attainment (EDUC) (pick one)?\n\n2,007\n732\n5,765\n4,684\n\nPlease use IPUMS to access the 1940 1% sample. Making use of the codebook, how many respondents were there in California (STATEICP) with 5+ years of college as their highest educational attainment (EDUC) (pick one)?\n\n532\n1,056\n904\n1,789\n\nWith reference to Dean (2022), please discuss the difference between probability and non-probability sampling.\nWhat is a target population (pick one)?\n\nThe entire group about which the researcher wants to draw conclusions.\nThe list of all units in the target population from which a sample can be drawn.\nA subset of the population that is easily accessible for sampling.\nThe list of individuals who have agreed to participate in the study.\n\nWhat is a sampling frame (pick one)?\n\nThe method used to collect data from respondents.\nThe list of all units in the target population from which a sample can be drawn.\nThe timeframe during which data collection occurs.\nThe entire group about which the researcher wants to draw conclusions.\n\nWhat is the main difference between “probability” and “non-probability” sampling (pick one)?\n\nProbability sampling does not require a sampling frame.\nProbability sampling is always more cost-effective than non-probability sampling.\nIn probability sampling, every unit has a known chance of being selected, whereas in non-probability sampling, selection is not based on probabilities.\nNon-probability sampling methods are more accurate than probability sampling methods.\n\nWith reference to Beaumont (2020), do you think that probability surveys will disappear, and why or why not (please write a paragraph or two)?\nWhich sampling method involves selecting units such that every observation in the sampling frame has an equal chance of being chosen (pick one)?\n\nSystematic sampling.\nStratified sampling.\nSimple random sampling.\nCluster sampling.\n\nIn which sampling method is the first unit selected at random, and subsequent units are selected at regular intervals (pick one)?\n\nSystematic sampling.\nConvenience sampling.\nStratified sampling.\nCluster sampling.\n\nWhat characterizes stratified sampling (pick one)?\n\nThe population is divided into subgroups, and a random sample is taken from each subgroup.\nParticipants recruit other participants through their networks.\nEntire clusters are randomly selected, and all units within them are sampled.\nBy randomly selecting entire groups or clusters, then sampling all or some units within them.\n\nHow are units selected in cluster sampling (pick one)?\n\nThe population is divided into subgroups, and a random sample is taken from each subgroup.\nBy randomly selecting entire groups or clusters, then sampling all or some units within them.\nBy selecting every nth unit from a list.\nBy choosing units based on specific quotas.\n\nPlease name some reasons why you may wish to use cluster sampling (select all that apply)?\n\nBalance in responses in terms of sub-populations.\nAdministrative convenience.\nEfficiency in terms of money.\n\nPlease consider the integers [1:100]. If I were interested in implementing a sampling approach, based on a sample of only 10, to estimate the median, which approach would I choose (pick one)?\n\nSimple random sampling.\nSystematic sampling.\nCluster sampling.\nStratified sampling.\n\nWrite R code that considers the numbers 1 to 100, and estimates the mean, based on a cluster sample of 20 numbers. Re-run this code one hundred times, noting the estimate of the mean each time, and then plot the histogram. What do you notice about the graph? Add a paragraph of explanation and discussion.\nFrom Bowley (1913), how was the sample for the study of Reading’s households selected (pick one)?\n\nBy random selection of streets.\nBy selecting households based on income.\nBy marking one building in ten from the local directory.\nBy interviewing every fifth household.\n\nWhat is the sampling approach used by Bowley (1913) (pick one)?\n\nCluster sampling.\nSimple random sampling.\nStratified sampling.\nSystematic sampling.\n\nFrom Bowley (1913), which method was employed to provide estimates for the whole of Reading, based on the sample data (pick one)?\n\nUsing proportional census data.\nCalculating median income levels.\nApplying a multiplier.\n\nFrom Bowley (1913), what was the method of collecting rent and earnings data from the working-class households in Reading (pick one)?\n\nBy interviewing landlords.\nBy inspecting census records.\nBy collecting data from tax records.\nThrough volunteer interviews with households.\n\nPlease discuss the following statement from Bowley (1913, 673) “It may appear to persons who are not familiar with processes of sampling that a proportion of i in 21 is too small for any conclusion, and that in any case not more than a vague probability can be obtained. … [but] the precision of a sample depends not on its proportion to the whole, but on its own magnitude, if the conditions of random sampling are secured, as it is believed they have been in this inquiry.”\nWhich of the following best describes convenience sampling (pick one)?\n\nSampling that ensures every subgroup is represented proportionally.\nSelecting participants based on random numbers.\nChoosing participants who are easiest to access.\nUsing algorithms to select a sample.\n\nWhat is snowball sampling commonly used for (pick one)?\n\nStudying well-defined populations with comprehensive sampling frames.\nResearching hidden or hard-to-reach populations by having existing participants recruit future participants.\nEnsuring equal representation across different demographic groups.\nEstimating the total population size using capture-recapture methods.\n\nWhat is respondent-driven sampling (pick one)?\n\nA sampling technique that uses automated systems to select respondents.\nA form of non-probability sampling where respondents refer other respondents, often used for hidden populations, and includes incentives for recruitment.\nA type of random sampling where respondents are selected based on a probability mechanism.\nA sampling method that relies on respondents volunteering without any recruitment efforts.\n\nFrom Neyman (1934), what is the primary goal of stratified sampling (pick one)?\n\nTo reduce the need for randomization.\nTo ensure all strata of the population are represented.\nTo increase bias in sample selection.\n\nWhat was the main focus of Neyman (1934) (pick one)?\n\nIntroduction of the simple random sample.\nDevelopment of quota sampling techniques.\nElimination of all biases in sample selection.\nThe distinction between stratified sampling and non-probability sampling.\n\nFrom Neyman (1934), what is one advantage of stratified sampling over simple random sampling (pick one)?\n\nIt eliminates the need for a sampling frame.\nIt requires fewer samples.\nIt is cheaper.\nIt may increase the precision of estimates.\n\nFrom Neyman (1934, sec. V), which approach allows a consistent estimate of the average collective character of a population, whatever the properties of the population (pick one)?\n\nRandom sampling.\nPurposive sampling.\n\nPretend that we have conducted a survey of everyone in Canada, where we asked for age, sex, and gender. Your friend claims that there is no need to worry about uncertainty “because we have the whole population”. Is your friend right or wrong, and why?\nWith reference to Meng (2018), please discuss the claim: “When you have one million responses, you do not need to worry about randomization”.\nImagine you take a job at a bank and they already have a dataset for you to use. What are some questions that you should explore when deciding whether that data will be useful to you?\n\n\n\nActivity\nThe purpose of this activity is to develop comfort with dealing with larger datasets, understand ratio estimators, and sampling. Please use IPUMS to access the 2022 ACS. Making use of the codebook, how many respondents were there in each state (STATEICP) that had a doctoral degree as their highest educational attainment (EDUC)? (Hint: Make this a column in a tibble.)\nIf there were 391,171 respondents in California (STATEICP) across all levels of education, then can you please use the ratio estimators approach of Laplace to estimate the total number of respondents in each state i.e. take the ratio that you worked out for California and apply it to the rest of the states. (Hint: You can now work out the ratio between the number of respondents with doctoral degrees in a state and number of respondents in a state and then apply that ratio to your column of the number of respondents with a doctoral degree in each state.) Compare it to the actual number of respondents in each state.\nWrite a short (2ish pages + appendices + references) paper using Quarto. Submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Do not forget to cite the data (but don’t upload the raw data to GitHub). Your paper should cover at least:\n\nInstructions on how to obtain the data (in the appendix).\nA brief overview of the ratio estimators approach.\nYour estimates and the actual number of respondents.\nSome explanation of why you think they are different i.e. the strengths and weaknesses of using ratio estimators.\nA discussion of the strengths and weaknesses of the sampling approaches used by the ACS. (Hint: One fun thing could be to find the actual population of one state, then given the number of respondents in that and every state, use the ratio estimators approach to estimate the population of every state and then compare it with the actual populations.)\n\nNote:\n\nIf you have issues opening the zipped file, open terminal, navigate to the folder then use: gunzip usa_00004.csv.gz (change the filename to whatever yours is).\nTo focus on doctoral degrees you need EDUCD rather than EDUC (but you can select that one once downloaded the data.)\nDo not upload the raw data to GitHub - it is too big and also IPUMS asks that you do not.\n\n\n\n\n\nAchen, Christopher. 1978. “Measuring Representation.” American Journal of Political Science 22 (3): 475–510. https://doi.org/10.2307/2110458.\n\n\nAnderson, Margo. (1988) 2015. The American Census: A Social History. 2nd ed. Yale University Press.\n\n\nAnderson, Margo, and Stephen Fienberg. 1999. Who Counts?: The Politics of Census-Taking in Contemporary America. Russell Sage Foundation. http://www.jstor.org/stable/10.7758/9781610440059.\n\n\nAu, Randy. 2022. “Celebrating Everyone Counting Things,” February. https://counting.substack.com/p/celebrating-everyone-counting-things.\n\n\nBaker, Reg, Michael Brick, Nancy Bates, Mike Battaglia, Mick Couper, Jill Dever, Krista Gile, and Roger Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1 (2): 90–143. https://doi.org/10.1093/jssam/smt008.\n\n\nBeaumont, Jean-Francois. 2020. “Are Probability Surveys Bound to Disappear for the Production of Official Statistics?” Survey Methodology 46 (1): 1–29.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex Deckmyn. 2022. maps: Draw Geographical Maps. https://CRAN.R-project.org/package=maps.\n\n\nBerdine, Gilbert, Vincent Geloso, and Benjamin Powell. 2018. “Cuban Infant Mortality and Longevity: Health Care or Repression?” Health Policy and Planning 33 (6): 755–57. https://doi.org/10.1093/heapol/czy033.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” American Political Science Review 113 (3): 838–59. https://doi.org/10.1017/S0003055419000194.\n\n\nBowen, Claire McKay. 2022. Protecting Your Privacy in a Data-Driven World. 1st ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781003122043.\n\n\nBowley, Arthur Lyon. 1913. “Working-Class Households in Reading.” Journal of the Royal Statistical Society 76 (7): 672–701. https://doi.org/10.2307/2339708.\n\n\nBreiman, Leo. 1994. “The 1991 Census Adjustment: Undercount or Bad Data?” Statistical Science 9 (4). https://doi.org/10.1214/ss/1177010259.\n\n\nBrewer, Ken. 2013. “Three Controversies in the History of Survey Sampling.” Survey Methodology 39 (2): 249–63.\n\n\nCardoso, Tom. 2020. “Bias behind bars: A Globe investigation finds a prison system stacked against Black and Indigenous inmates.” The Globe and Mail, October. https://www.theglobeandmail.com/canada/article-investigation-racial-bias-in-canadian-prison-risk-assessments/.\n\n\nChambliss, Daniel. 1989. “The Mundanity of Excellence: An Ethnographic Report on Stratification and Olympic Swimmers.” Sociological Theory 7 (1): 70–86. https://doi.org/10.2307/202063.\n\n\nChen, Heng, Marie-Hélène Felt, and Christopher Henry. 2018. “2017 Methods-of-Payment Survey: Sample Calibration and Variance Estimation.” Bank of Canada. https://doi.org/10.34989/tr-114.\n\n\nChen, Wei, Xilu Chen, Chang-Tai Hsieh, and Zheng Song. 2019. “A Forensic Examination of China’s National Accounts.” Brookings Papers on Economic Activity, 77–127. https://www.jstor.org/stable/26798817.\n\n\nColombo, Tommaso, Holger Fröning, Pedro Javier Garcı̀a, and Wainer Vandelli. 2016. “Optimizing the Data-Collection Time of a Large-Scale Data-Acquisition System Through a Simulation Framework.” The Journal of Supercomputing 72 (12): 4546–72. https://doi.org/10.1007/s11227-016-1764-1.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nCrosby, Alfred. 1997. The Measure of Reality: Quantification in Western Europe, 1250-1600. Cambridge: Cambridge University Press.\n\n\nDaston, Lorraine. 2000. “Why Statistics Tend Not Only to Describe the World but to Change It.” London Review of Books 22 (8). https://www.lrb.co.uk/the-paper/v22/n08/lorraine-daston/why-statistics-tend-not-only-to-describe-the-world-but-to-change-it.\n\n\nDavis, Darren. 1997. “Nonrandom Measurement Error and Race of Interviewer Effects Among African Americans.” The Public Opinion Quarterly 61 (1): 183–207. https://doi.org/10.1086/297792.\n\n\nDean, Natalie. 2022. “Tracking COVID-19 Infections: Time for Change.” Nature 602 (7896): 185. https://doi.org/10.1038/d41586-022-00336-8.\n\n\nFisher, Ronald. (1925) 1928. Statistical Methods for Research Workers. 2nd ed. London: Oliver; Boyd.\n\n\nFlake, Jessica, and Eiko Fried. 2020. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” Advances in Methods and Practices in Psychological Science 3 (4): 456–65. https://doi.org/10.1177/2515245920952393.\n\n\nFrandell, Ashlee, Mary Feeney, Timothy Johnson, Eric Welch, Lesley Michalegko, and Heyjie Jung. 2021. “The Effects of Electronic Alert Letters for Internet Surveys of Academic Scientists.” Scientometrics 126 (8): 7167–81. https://doi.org/10.1007/s11192-021-04029-3.\n\n\nFried, Eiko, Jessica Flake, and Donald Robinaugh. 2022. “Revisiting the Theoretical and Methodological Foundations of Depression Measurement.” Nature Reviews Psychology 1 (6): 358–68. https://doi.org/10.1038/s44159-022-00050-2.\n\n\nFuller, Mark, and James Mosher. 1987. “Raptor Survey Techniques.” In Raptor Management Techniques Manual, edited by Beth Pendleton, Brian Millsap, Keith Cline, and David Bird, 37–65. National Wildlife Federation. https://www.sandiegocounty.gov/content/dam/sdc/pds/ceqa/JVR/AdminRecord/IncorporatedByReference/Appendices/Appendix-D---Biological-Resources-Report/Fuller%20and%20Mosher%201987.pdf.\n\n\nGarfinkel, Irwin, Lee Rainwater, and Timothy Smeeding. 2006. “A Re-Examination of Welfare States and Inequality in Rich Nations: How in-Kind Transfers and Indirect Taxes Change the Story.” Journal of Policy Analysis and Management 25 (4): 897–919. https://doi.org/10.1002/pam.20213.\n\n\nGargiulo, Maria. 2022. “Statistical Biases, Measurement Challenges, and Recommendations for Studying Patterns of Femicide in Conflict.” Peace Review 34 (2): 163–76. https://doi.org/10.1080/10402659.2022.2049002.\n\n\nGazeley, Ursula, Georges Reniers, Hallie Eilerts-Spinelli, Julio Romero Prieto, Momodou Jasseh, Sammy Khagayi, and Veronique Filippi. 2022. “Women’s Risk of Death Beyond 42 Days Post Partum: A Pooled Analysis of Longitudinal Health and Demographic Surveillance System Data in Sub-Saharan Africa.” The Lancet Global Health 10 (11): e1582–89. https://doi.org/10.1016/s2214-109x(22)00339-4.\n\n\nGelman, Andrew, Sharad Goel, Douglas Rivers, and David Rothschild. 2016. “The Mythical Swing Voter.” Quarterly Journal of Political Science 11 (1): 103–30. https://doi.org/10.1561/100.00015031.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press. https://avehtari.github.io/ROS-Examples/.\n\n\nGibney, Elizabeth. 2022. “The leap second’s time is up: world votes to stop pausing clocks.” Nature 612 (7938): 18–18. https://doi.org/10.1038/d41586-022-03783-5.\n\n\nGleick, James. 1990. “The Census: Why We Can’t Count.” The New York Times, July. https://www.nytimes.com/1990/07/15/magazine/the-census-why-we-can-t-count.html.\n\n\nGodfrey, Ernest. 1918. “History and Development of Statistics in Canada.” In The History of Statistics–Their Development and Progress in Many Countries. New York: Macmillan, edited by John Koren, 179–98. Macmillan Company of New York.\n\n\nGoodman, Leo. 1961. “Snowball Sampling.” The Annals of Mathematical Statistics 32 (1): 148–70. https://doi.org/10.1214/aoms/1177705148.\n\n\nGroves, Robert, and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.\n\n\nGutman, Robert. 1958. “Birth and Death Registration in Massachusetts: II. The Inauguration of a Modern System, 1800-1849.” The Milbank Memorial Fund Quarterly 36 (4): 373–402.\n\n\nHandcock, Mark, and Krista Gile. 2011. “Comment: On the Concept of Snowball Sampling.” Sociological Methodology 41 (1): 367–71. https://doi.org/10.1111/j.1467-9531.2011.01243.x.\n\n\nHartocollis, Anemona. 2022. “U.S. News Ranked Columbia No. 2, but a Math Professor Has His Doubts.” The New York Times, March. https://www.nytimes.com/2022/03/17/us/columbia-university-rank.html.\n\n\nHawes, Michael. 2020. “Implementing Differential Privacy: Seven Lessons From the 2020 United States Census.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.353c6f99.\n\n\nHeckathorn, Douglas. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174–99. https://doi.org/10.2307/3096941.\n\n\nHopper, Nate. 2022. “The Thorny Problem of Keeping the Internet’s Time.” The New Yorker, September. https://www.newyorker.com/tech/annals-of-technology/the-thorny-problem-of-keeping-the-internets-time.\n\n\nHyman, Michael, Luca Sartore, and Linda J Young. 2021. “Capture-Recapture Estimation of Characteristics of U.S. Local Food Farms Using a Web-Scraped List Frame.” Journal of Survey Statistics and Methodology 10 (4): 979–1004. https://doi.org/10.1093/jssam/smab008.\n\n\nInternational Organization Of Legal Metrology. 2007. International Vocabulary of Metrology – Basic and General Concepts and Associated Terms. 3rd ed. https://www.oiml.org/en/files/pdf%5Fv/v002-200-e07.pdf.\n\n\nJones, Arnold. 1953. “Census Records of the Later Roman Empire.” The Journal of Roman Studies 43: 49–64. https://doi.org/10.2307/297781.\n\n\nKalgin, Alexander. 2014. “Implementation of Performance Management in Regional Government in Russia: Evidence of Data Manipulation.” Public Management Review 18 (1): 110–38. https://doi.org/10.1080/14719037.2014.965271.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKoitsalu, Marie, Martin Eklund, Jan Adolfsson, Henrik Grönberg, and Yvonne Brandberg. 2018. “Effects of Pre-Notification, Invitation Length, Questionnaire Length and Reminder on Participation Rate: A Quasi-Randomised Controlled Trial.” BMC Medical Research Methodology 18 (3): 1–5. https://doi.org/10.1186/s12874-017-0467-5.\n\n\nLane, Nick. 2015. “The Unseen World: Reflections on Leeuwenhoek (1677) ‘Concerning Little Animals’.” Philosophical Transactions of the Royal Society B: Biological Sciences 370 (1666): 20140344. https://doi.org/10.1098/rstb.2014.0344.\n\n\nLeos-Barajas, Vianey, Theoni Photopoulou, Roland Langrock, Toby Patterson, Yuuki Watanabe, Megan Murgatroyd, and Yannis Papastamatiou. 2016. “Analysis of Animal Accelerometer Data Using Hidden Markov Models.” Methods in Ecology and Evolution 8 (2): 161–73. https://doi.org/10.1111/2041-210x.12657.\n\n\nLevine, Judah, Patrizia Tavella, and Martin Milton. 2022. “Towards a Consensus on a Continuous Coordinated Universal Time.” Metrologia 60 (1): 014001. https://doi.org/10.1088/1681-7575/ac9da5.\n\n\nLips, Hilary. 2020. Sex and Gender: An Introduction. 7th ed. Illinois: Waveland Press.\n\n\nLohr, Sharon. (1999) 2022. Sampling: Design and Analysis. 3rd ed. Chapman; Hall/CRC.\n\n\nLuebke, David Martin, and Sybil Milton. 1994. “Locating the Victim: An Overview of Census-Taking, Tabulation Technology, and Persecution in Nazi Germany.” IEEE Annals of the History of Computing 16 (3): 25–39. https://doi.org/10.1109/MAHC.1994.298418.\n\n\nLumley, Thomas. 2020. “survey: analysis of complex survey samples.” https://cran.r-project.org/web/packages/survey/index.html.\n\n\nMartı́nez, Luis. 2022. “How Much Should We Trust the Dictator’s GDP Growth Estimates?” Journal of Political Economy 130 (10): 2731–69. https://doi.org/10.1086/720458.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nMill, James. 1817. The History of British India. 1st ed. https://books.google.ca/books?id=Orw_AAAAcAAJ.\n\n\nMills, David L. 1991. “Internet Time Synchronization: The Network Time Protocol.” IEEE Transactions on Communications 39 (10): 1482–93.\n\n\nMitchell, Alanna. 2022a. “Get Ready for the New, Improved Second.” The New York Times, April. https://www.nytimes.com/2022/04/25/science/time-second-measurement.html.\n\n\n———. 2022b. “Time Has Run Out for the Leap Second.” The New York Times, November. https://www.nytimes.com/2022/11/14/science/time-leap-second.html.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe Biden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMolanphy, Chris. 2012. “100 & Single: Three Rules to Define the Term ‘One-Hit Wonder’ in 2012.” The Village Voice, September. https://www.villagevoice.com/2012/09/10/100-single-three-rules-to-define-the-term-one-hit-wonder-in-2012/.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey: Princeton University Press.\n\n\nNewman, Daniel. 2014. “Missing Data: Five Practical Guidelines.” Organizational Research Methods 17 (4): 372–411. https://doi.org/10.1177/1094428114548590.\n\n\nNeyman, Jerzy. 1934. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.” Journal of the Royal Statistical Society 97 (4): 558–625. https://doi.org/10.2307/2342192.\n\n\nNobles, Melissa. 2002. “Racial Categorization and Censuses.” In Census and Identity: The Politics of Race, Ethnicity, and Language in National Censuses, edited by David Kertzer and Dominique Arel, 43–70. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511606045.003.\n\n\nPlant, Anne, and Robert Hanisch. 2020. “Reproducibility in Science: A Metrology Perspective.” Harvard Data Science Review 2 (4). https://doi.org/10.1162/99608f92.eb6ddee4.\n\n\nPrévost, Jean-Guy, and Jean-Pierre Beaud. 2015. Statistics, Public Debate and the State, 1800–1945: A Social, Political and Intellectual History of Numbers. Routledge.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRegister, Yim. 2020. “Introduction to Sampling and Randomization.” YouTube, November. https://youtu.be/U272FFxG8LE.\n\n\nRockoff, Hugh. 2019. “On the Controversies Behind the Origins of the Federal Economic Statistics.” Journal of Economic Perspectives 33 (1): 147–64. https://doi.org/10.1257/jep.33.1.147.\n\n\nRose, Angela, Rebecca Grais, Denis Coulombier, and Helga Ritter. 2006. “A Comparison of Cluster and Systematic Sampling Methods for Measuring Crude Mortality.” Bulletin of the World Health Organization 84: 290–96. https://doi.org/10.2471/blt.05.029181.\n\n\nRuggles, Steven, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler, and Matthew Sobek. 2021. “IPUMS USA: Version 11.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/d010.v11.0.\n\n\nSakshaug, Joseph, Ting Yan, and Roger Tourangeau. 2010. “Nonresponse Error, Measurement Error, and Mode of Data Collection: Tradeoffs in a Multi-Mode Survey of Sensitive and Non-Sensitive Items.” Public Opinion Quarterly 74 (5): 907–33. https://doi.org/10.1093/poq/nfq057.\n\n\nSalganik, Matthew, and Douglas Heckathorn. 2004. “Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling.” Sociological Methodology 34 (1): 193–240. https://doi.org/10.1111/j.0081-1750.2004.00152.x.\n\n\nScott, James. 1998. Seeing Like a State. Yale University Press.\n\n\nSobek, Matthew, and Steven Ruggles. 1999. “The IPUMS Project: An Update.” Historical Methods: A Journal of Quantitative and Interdisciplinary History 32 (3): 102–10. https://doi.org/10.1080/01615449909598930.\n\n\nSomers, James. 2017. “Torching the Modern-Day Library of Alexandria.” The Atlantic, April. https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/.\n\n\nStatistics Canada. 2020. “Sex at Birth and Gender: Technical Report on Changes for the 2021 Census.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-20-0002/982000022020002-eng.pdf.\n\n\n———. 2023. “Guide to the Census of Population, 2021.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/98-304-x2021001-eng.pdf.\n\n\nSteckel, Richard. 1991. “The Quality of Census Data for Historical Inquiry: A Research Agenda.” Social Science History 15 (4): 579–99. https://doi.org/10.2307/1171470.\n\n\nStigler, Stephen. 1986. The History of Statistics. Massachusetts: Belknap Harvard.\n\n\nStoler, Ann Laura. 2002. “Colonial Archives and the Arts of Governance.” Archival Science 2 (March): 87–109. https://doi.org/10.1007/bf02435632.\n\n\nTal, Eran. 2020. “Measurement in Science.” In The Stanford Encyclopedia of Philosophy, edited by Edward Zalta, Fall 2020. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/; Metaphysics Research Lab, Stanford University.\n\n\nTaylor, Adam. 2015. “New Zealand Says No to Jedis.” The Washington Post, September. https://www.washingtonpost.com/news/worldviews/wp/2015/09/29/new-zealand-says-no-to-jedis/.\n\n\nTimbers, Tiffany. 2020. canlang: Canadian Census language data. https://ttimbers.github.io/canlang/.\n\n\nVanhoenacker, Mark. 2015. Skyfaring: A Journey with a Pilot. 1st ed. Alfred A. Knopf.\n\n\nvon Bergmann, Jens, Dmitry Shkolnik, and Aaron Jacobs. 2021. cancensus: R package to access, retrieve, and work with Canadian Census data and geography. https://mountainmath.github.io/cancensus/.\n\n\nWalby, Kevin, and Alex Luscombe. 2019. Freedom of Information and Social Science Research Design. Routledge.\n\n\nWalker, Kyle. 2022. Analyzing US Census Data. Chapman; Hall/CRC. https://walker-data.com/census-r/index.html.\n\n\nWalker, Kyle, and Matt Herman. 2022. tidycensus: Load US Census Boundary and Attribute Data as “tidyverse” and “sf”-Ready Data Frames. https://CRAN.R-project.org/package=tidycensus.\n\n\nWhitby, Andrew. 2020. The Sum of the People. New York: Basic Books.\n\n\nWhitelaw, James. 1805. An Essay on the Population of Dublin. Being the Result of an Actual Survey Taken in 1798, with Great Care and Precision, and Arranged in a Manner Entirely New. Graisberry; Campbell.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWu, Changbao, and Mary Thompson. 2020. Sampling Theory and Practice. Springer.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nZhang, Ping, XunPeng Shi, YongPing Sun, Jingbo Cui, and Shuai Shao. 2019. “Have China’s provinces achieved their targets of energy intensity reduction? Reassessment based on nighttime lighting data.” Energy Policy 128 (May): 276–83. https://doi.org/10.1016/j.enpol.2019.01.014.",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "06-farm.html#footnotes",
"href": "06-farm.html#footnotes",
- "title": "6 Farm data",
+ "title": "6 Measurement, censuses, and sampling",
"section": "",
"text": "We can obtain this using a trick attributed to Carl Friedrich Gauss, the eighteenth century mathematician, who noticed that the sum of one to any number can be quickly obtained by finding the middle number and then multiplying that by one plus the number. In this case, we have \\(50 \\times 101\\). Alternatively we can use R: sum(1:100).↩︎",
"crumbs": [
"Acquisition",
- "6 Farm data "
+ "6 Measurement, censuses, and sampling "
]
},
{
"objectID": "07-gather.html",
"href": "07-gather.html",
- "title": "7 Gather data",
+ "title": "7 APIs, scraping, and parsing",
"section": "",
"text": "7.1 Introduction\nIn this chapter we consider data that we must gather ourselves. This means that although the observations exist, we must parse, pull, clean, and prepare them to get the dataset that we will consider. In contrast to farmed data, discussed in Chapter 6, often these observations are not being made available for the purpose of analysis. This means that we need to be especially concerned with documentation, inclusion and exclusion decisions, missing data, and ethical behavior.\nAs an example of such a dataset, consider Cummins (2022) who create a dataset using individual-level probate records from England between 1892 and 1992. They find that about one-third of the inheritance of “elites” is concealed. Similarly, Taflaga and Kerby (2019) construct a systematic dataset of job responsibilities based on Australian Ministerial telephone directories. They find substantial differences by gender. Neither wills nor telephone directories were created for the purpose of being included in a dataset. But with a respectful approach they enable insights that we could not get by other means. We term this “data gathering”—the data exist but we need to get them.\nDecisions need to be made at the start of a project about the values we want the project to have. For instance, Saulnier et al. (2022) value transparency, reproducibility, fairness, being self-critical, and giving credit. How might that affect the project? Valuing “giving credit” might mean being especially zealous about attribution and licensing. In the case of gathered data we should give special thought to this as the original, unedited data may not be ours.\nThe results of a data science workflow cannot be better than their underlying data (Bailey 2008). Even the most-sophisticated statistical analysis will struggle to adjust for poorly-gathered data. This means when working in a team, data gathering should be overseen and at least partially conducted by senior members of the team. And when working by yourself, try to give special consideration and care to this stage.\nIn this chapter we go through a variety of approaches for gathering data. We begin with the use of APIs and semi-structured data, such as JSON and XML. Using an API is typically a situation in which the data provider has specified the conditions under which they are comfortable providing access. An API allows us to write code to gather data. This is valuable because it can be efficient and scales well. Developing comfort with gathering data through APIs enables access to exciting datasets. For instance, Wong (2020) use the Facebook Political Ad API to gather 218,100 of the Trump 2020 campaign ads to better understand the campaign.\nWe then turn to web scraping, which we may want to use when there are data available on a website. As these data have typically not been put together for the purposes of being a dataset, it is especially important to have deliberate and definite values for the project. Scraping is a critical part of data gathering because there are many data sources where the priorities of the data provider mean they have not implemented an API. For instance, considerable use of web scraping was critical for creating COVID-19 dashboards in the early days of the pandemic (Eisenstein 2022).\nFinally, we consider gathering data from PDFs. This enables the construction of interesting datasets, especially those contained in government reports and old books. Indeed, while freedom of information legislation exists in many countries and require the government to make data available, these all too often result in spreadsheets being shared as PDFs, even when they were a CSV to begin with.\nGathering data can require more of us than using farmed data, but it allows us to explore datasets and answer questions that we could not otherwise. Some of the most exciting work in the world uses gathered data, but it is especially important that we approach it with respect.",
"crumbs": [
"Acquisition",
- "7 Gather data "
+ "7 APIs, scraping, and parsing "
]
},
{
"objectID": "07-gather.html#apis",
"href": "07-gather.html#apis",
- "title": "7 Gather data",
+ "title": "7 APIs, scraping, and parsing",
"section": "7.2 APIs",
"text": "7.2 APIs\nIn everyday language, and for our purposes, an Application Programming Interface (API) is a situation in which someone has set up specific files on their computer such that we can follow their instructions to get them. For instance, when we use a gif on Slack, one way it could work in the background is that Slack asks Giphy’s server for the appropriate gif, Giphy’s server gives that gif to Slack, and then Slack inserts it into the chat. The way in which Slack and Giphy interact is determined by Giphy’s API. More strictly, an API is an application that runs on a server that we access using the HTTP protocol.\nHere we focus on using APIs for gathering data. In that context an API is a website that is set up for another computer to be able to access it, rather than a person. For instance, we could go to Google Maps. And we could then scroll and click and drag to center the map on, say, Canberra, Australia. Or we could paste this link into the browser. By pasting that link, rather than navigating, we have mimicked how we will use an API: provide a URL and be given something back. In this case the result should be a map like Figure 7.1.\n\n\n\n\n\n\nFigure 7.1: Example of an API response from Google Maps, as of 12 February 2023\n\n\n\nThe advantage of using an API is that the data provider usually specifies the data that they are willing to provide, and the terms under which they will provide it. These terms may include aspects such as rate limits (i.e. how often we can ask for data), and what we can do with the data, for instance, we might not be allowed to use it for commercial purposes, or to republish it. As the API is being provided specifically for us to use it, it is less likely to be subject to unexpected changes or legal issues. Because of this it is clear that when an API is available, we should try to use it rather than web scraping.\nWe will now go through a few case studies of using APIs. In the first we deal directly with an API using httr. And then we access data from Spotify using spotifyr.\n\n7.2.1 arXiv, NASA, and Dataverse\nAfter installing and loading httr we use GET() to obtain data from an API directly. This will try to get some specific data and the main argument is “url”. This is similar to the Google Maps example in Figure 7.1 where the specific information that we were interested in was a map.\n\n7.2.1.1 arXiv\nIn this case study we will use an API provided by arXiv. arXiv is an online repository for academic papers before they go through peer review. These papers are typically referred to as “pre-prints”. We use GET() to ask arXiv to obtain some information about a pre-print by providing a URL.\n\narxiv <- GET(\"http://export.arxiv.org/api/query?id_list=2310.01402\")\n\nstatus_code(arxiv)\n\nWe can use status_code() to check our response. For instance, 200 means a success, while 400 means we received an error from the server. Assuming we received something back from the server, we can use content() to display it. In this case we have received XML formatted data. XML is a markup language where entries are identified by tags, which can be nested within other tags. After installing and loading xml2 we can read XML using read_xml(). XML is a semi-formatted structure, and it can be useful to start by having a look at it using html_structure().\n\ncontent(arxiv) |>\n read_xml() |>\n html_structure()\n\nWe might like to create a dataset based on extracting various aspects of this XML tree. For instance, we might look at “entry”, which is the eighth item, and in particular obtain the “title” and the “URL”, which are the fourth and ninth items, respectively, within “entry”.\n\ndata_from_arxiv <-\n tibble(\n title = content(arxiv) |>\n read_xml() |>\n xml_child(search = 8) |>\n xml_child(search = 4) |>\n xml_text(),\n link = content(arxiv) |>\n read_xml() |>\n xml_child(search = 8) |>\n xml_child(search = 9) |>\n xml_attr(\"href\")\n )\ndata_from_arxiv\n\n\n\n7.2.1.2 NASA Astronomy Picture of the Day\nTo consider another example, each day, NASA provides the Astronomy Picture of the Day (APOD) through its APOD API. We can use GET() to obtain the URL for the photo on particular dates and then display it.\n\nNASA_APOD_20190719 <-\n GET(\"https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY&date=2019-07-19\")\n\nExamining the returned data using content(), we can see that we are provided with various fields, such as date, title, explanation, and a URL.\n\n# APOD July 19, 2019\ncontent(NASA_APOD_20190719)$date\n\n[1] \"2019-07-19\"\n\ncontent(NASA_APOD_20190719)$title\n\n[1] \"Tranquility Base Panorama\"\n\ncontent(NASA_APOD_20190719)$explanation\n\n[1] \"On July 20, 1969 the Apollo 11 lunar module Eagle safely touched down on the Moon. It landed near the southwestern corner of the Moon's Mare Tranquillitatis at a landing site dubbed Tranquility Base. This panoramic view of Tranquility Base was constructed from the historic photos taken from the lunar surface. On the far left astronaut Neil Armstrong casts a long shadow with Sun is at his back and the Eagle resting about 60 meters away ( AS11-40-5961). He stands near the rim of 30 meter-diameter Little West crater seen here to the right ( AS11-40-5954). Also visible in the foreground is the top of the camera intended for taking stereo close-ups of the lunar surface.\"\n\ncontent(NASA_APOD_20190719)$url\n\n[1] \"https://apod.nasa.gov/apod/image/1907/apollo11TranquilitybasePan600h.jpg\"\n\n\nWe can provide that URL to include_graphics() from knitr to display it (Figure 7.2).\n\n\n\n\n\n\n\n\n\n(a) Tranquility Base Panorama (Image Credit: Neil Armstrong, Apollo 11, NASA)\n\n\n\n\n\nFigure 7.2: Images obtained from the NASA APOD API\n\n\n\n\n\n7.2.1.3 Dataverse\nFinally, another common API response in semi-structured form is JSON. JSON is a human-readable way to store data that can be parsed by machines. In contrast to, say, a CSV, where we are used to rows and columns, JSON uses key-value pairs.\n\n{\n \"firstName\": \"Rohan\",\n \"lastName\": \"Alexander\",\n \"age\": 36,\n \"favFoods\": {\n \"first\": \"Pizza\",\n \"second\": \"Bagels\",\n \"third\": null\n }\n}\n\n\nWe can parse JSON with jsonlite. To consider a specific example, we use a “Dataverse” which is a web application that makes it easier to share datasets. We can use an API to query a demonstration dataverse. For instance, we might be interested in datasets related to politics.\n\npolitics_datasets <-\n fromJSON(\"https://demo.dataverse.org/api/search?q=politics\")\n\npolitics_datasets\n\n$status\n[1] \"OK\"\n\n$data\n$data$q\n[1] \"politics\"\n\n$data$total_count\n[1] 4\n\n$data$start\n[1] 0\n\n$data$spelling_alternatives\nnamed list()\n\n$data$items\n name type\n1 CAP - United Kingdom dataverse\n2 China Archive Dataverse dataverse\n3 CAP - Australia dataverse\n4 České panelové šetření domácností 1. vlna - hlavní soubor dataset\n url\n1 https://demo.dataverse.org/dataverse/CAP_UK\n2 https://demo.dataverse.org/dataverse/china-archive\n3 https://demo.dataverse.org/dataverse/CAP_Australia\n4 https://doi.org/10.70122/FK2/V5KAMC\n image_url identifier\n1 https://demo.dataverse.org/api/access/dvCardImage/2058461 CAP_UK\n2 <NA> china-archive\n3 https://demo.dataverse.org/api/access/dvCardImage/2058462 CAP_Australia\n4 <NA> <NA>\n description\n1 The UK Policy Agendas Project seeks to develop systematic measures of the policy agenda of British government and politics over time. It applies the policy content coding system of the original Policy Agendas Project in the United States, founded by Frank Baumgartner and Bryan Jones, with the aim of creating a consistent record of the issues that are attended to at different points in time, across many of the main venues of British public policy and politics – namely in parliament, the media and public opinion. The reliability of these measures of policy attention are ensured through adherence to clearly defined coding rules and standards, which give us confidence that changes in the priorities of government can be tracked consistently over time and in different arenas of politics. Location: University of Edinburgh; University of Southampton Downloadable Data Series: 12 Time Span: 1910-2015 Total Observations: 125,539\n2 Introduction The China Archive is a data archive dedicated to support of scholarly, empirical research by anthropologists, economists, historians, political scientists, sociologists, and others in the fields of business, agriculture, and engineering. The goal of the Archive is to enable case research on Chinese domestic matters and China–U.S. relations, as well as to facilitate the inclusion of China in more broadly comparative studies. To that end, the Archive’s mission includes: acquiring and maintaining extant data sets and data sources on an ongoing basis, facilitating production of quantitative data from textual information when such is desirable and feasible, making all of the Archive’s data available on a user-friendly basis to scholars at and visiting Texas A&M University, and establishing web-based links to searchable electronic sources of information on China. As long-term goals, The China Archive is especially dedicated to: locating and acquiring difficult-to-obtain Chinese data such as public opinion data, making as many of the holdings as possible available online or via other computer media, and providing capability for converting textual data to numerical data suitable for quantitative analysis. In keeping with these goals, the Archive includes data sets collected by individuals and research centers/institutes as well as by government agencies. The Archive was planned by a faculty committee in 2002–2003, and is now a project of the Texas A&M University Libraries. A faculty committee continues to function as an advisory committee to the Libraries on matters related to the upkeep and expansion of The China Archive. The faculty committee is, in turn, advised by an External Advisory Panel, composed of China scholars from other academic institutions in the United States. Faculty Planning Committee Bob Harmel, Archive Planning Committee Chair; Political Science; Director, Program in the Cross-National Study of Politics Stephen Atkins, Sterling Evans Library Ben Crouch, Associate Dean, College of Liberal Arts Qi Li, Department of Economics Xinsheng Liu, Institute for Science, Technology and Public Policy, Bush School of Government and Public Service; School of Government, Peking University Rick Nader, Director, Institute for Pacific Asia Dudley Poston, Professor of Sociology and Abell Professor of Liberal Arts Raghavan (Srini) Srinivasan, Director, Spatial Sciences Laboratory, Texas Agriculture Experiment Station Di Wang, Assistant Professor of History Ben Wu, Associate Professor, Rangeland Ecology and Management Wei Zhao, Professor, Computer Science; Associate Vice President for Research Dianzhi (Dan) Sui, Department of Geography\n3 The Australian Policy Agendas Project collects and organizes data on Australian legislation, executive speeches, opposition questions, public opinion, media coverage, and High Court decisions. Some details are listed below. Data is forthcoming. Decisions of the High Court of Australia This dataset contains information on every case decided by the High Court of Australia between the years 1970 and 2015. Cases serve as the unit of analysis. Each case was coded in terms of its policy content and several other variables controlling for the nature of the case and the nature of the court. In coding for policy content, we utilized the Comparative Agendas Project’s topics coding scheme, where each case was assigned both a major topic and a sub topic depending on its policy content. A full description of these categories and their corresponding codes may be found in the codebook. Sydney Morning Herald - Front Page Articles This dataset contains information on each article published on the Sydney Morning Herald's front page for each day from 1990 through 2015. Front page articles serve as the unit of analysis. Each article was coded in terms of its policy content and other variables of interest controlling for location, political context, and key actors. In coding for policy content, we utilized the Comparative Agendas Project’s major topics coding scheme, where each article was assigned a major topic code. A full description of the policy content categories and their corresponding codes may be found in the major topics codebook. Dr. Keith Dowding (ANU), Dr. Aaron Martin (Melbourne), and Dr. Rhonda Evans (UT-Austin) lead the Australian Policy Agendas Project. Dr. Dowding and Dr. Martin coded legislation, executive speeches, opposition questions, public opinion, and media data, and Dr. Evans collected data on decisions of the High Court of Australia as well as additional media data. Data is forthcoming. Principal Investigator: Dr. Keith Dowding, Dr. Aaron Martin, Dr. Rhonda Evans Location: Australian National University, University of Melbourne, The University of Texas at Austin Downloadable Data Series: 1 Time Span: 1970-2015 Total Observations: 2,548 Sponsoring Institutions Dr. Dowding and Dr. Martin’s research was funded by the Australian Research Council Discovery Award DP 110102622. Dr. Evans’ research is funded by the Edward A. Clark Center for Australian and New Zealand Studies at The University of Texas at Austin.\n4 České panelové šetření domácností (CHPS) je výběrové šetření, v němž je v letech 2015-2018 opakovaně dotazován náhodně vybraný vzorek domácností žijících na území České republiky. V první vlně sběru dat, která proběhla od července do prosince 2015, bylo dotázáno 5 159 domácností. Do konce roku 2018 budou realizovány další 3 vlny sběru dat s roční periodicitou. Pro komunikaci s veřejností je pro výzkum využíván název Proměny české společnosti http://www.promenyceskespolecnosti.cz Cílem výzkumu je zmapovat životní podmínky českých domácností z dlouhodobé perspektivy, charakterizovat proces změny v životě domácností a jednotlivců a vztáhnout proces sociální změny ke vztahům a dění v domácnostech. CHPS je mezioborový výzkum využívající přístupy sociologie, ekonomie a politologie. Zahrnuje pět hlavních tematických okruhů - rodinný život, užívání času, zdraví, vzdělání a trh práce, sociální stratifikace, bydlení, politická participace a občanská společnost. Výzkum organizuje Sociologický ústav Akademie věd ČR, v.v.i., CERGE-EI (společné pracoviště Centra pro ekonomický výzkum a doktorské studium Univerzity Karlovy a Národohospodářského ústavu Akademie věd ČR) a Fakulta sociálních studií Masarykovy univerzity. Výzkumný tým složený ze členů těchto institucí vytvořil metodický plán výzkumu a dotazové instrumenty, vybral ve veřejné soutěži realizátory terénních prací, s realizátory terénních prací spolupracuje na kontrole a čištění dat, zpracovává konečné datové soubory a zajišťuje jejich uložení v Českém sociálně vědním datovém archivu Sociologického ústavu AV ČR, v.v.i. Sběr dat realizují agentury MEDIAN, s.r.o., a STEM/MARK, a.s., které patří k nejvýznamnějším agenturám pro výzkum trhu a veřejného mínění v České republice. Jsou zakládající členové organizace SIMAR (Sdružení agentur pro výzkum trhu a veřejného mínění) a členové ESOMAR (European Society for Opinion and Marketing Research). Mimo sběr dat provedli realizátoři výběr vzorku, spolupracují na kontrole a čištění dat a poskytují konzultace k metodice výzkumu. Výzkum finančně podpořila Grantová agentura ČR z grantu GB14-36154G (Dynamika změny v české společnosti).\n published_at publicationStatuses affiliation\n1 2023-06-06T17:18:53Z Published Comparative Agendas Project\n2 2016-12-09T20:13:22Z Published Texas A & M University\n3 2023-06-06T17:18:42Z Published Comparative Agendas Project\n4 2024-10-10T09:18:42Z Published <NA>\n parentDataverseName global_id\n1 Comparative Agendas Project (CAP) <NA>\n2 Demo Dataverse <NA>\n3 Comparative Agendas Project (CAP) <NA>\n4 <NA> doi:10.70122/FK2/V5KAMC\n publisher\n1 <NA>\n2 <NA>\n3 <NA>\n4 Czech Social Science Data Archive - TEST DATAVERSE\n citationHtml\n1 <NA>\n2 <NA>\n3 <NA>\n4 Lux, Martin; Hamplová, Dana; Linek, Lukáš; Chaloupková, Klímová, Jana; Sunega, Petr; Mysíková, Martina; Kudrnáč, Aleš; Krejčí, Jindřich; Röschová, Michaela; Hrubá , Lucie, 2024, \"České panelové šetření domácností 1. vlna - hlavní soubor\", <a href=\"https://doi.org/10.70122/FK2/V5KAMC\" target=\"_blank\">https://doi.org/10.70122/FK2/V5KAMC</a>, Demo Dataverse, V1\n identifier_of_dataverse name_of_dataverse\n1 <NA> <NA>\n2 <NA> <NA>\n3 <NA> <NA>\n4 CSDA_TEST Czech Social Science Data Archive - TEST DATAVERSE\n citation\n1 <NA>\n2 <NA>\n3 <NA>\n4 Lux, Martin; Hamplová, Dana; Linek, Lukáš; Chaloupková, Klímová, Jana; Sunega, Petr; Mysíková, Martina; Kudrnáč, Aleš; Krejčí, Jindřich; Röschová, Michaela; Hrubá , Lucie, 2024, \"České panelové šetření domácností 1. vlna - hlavní soubor\", https://doi.org/10.70122/FK2/V5KAMC, Demo Dataverse, V1\n storageIdentifier\n1 <NA>\n2 <NA>\n3 <NA>\n4 s3://10.70122/FK2/V5KAMC\n keywords\n1 NULL\n2 NULL\n3 NULL\n4 péče o dítě, rodiny, příjem, postoje, genderové role, volný čas - činnosti, zdraví, zdraví (osobní), kariéra, pracovní příležitosti, cíle vzdělávání, práce a zaměstnanost, pracovní podmínky, sociální nerovnost, sociální struktura, sociální mobilita, příjmy a bohatství, ekonomické podmínky, kulturní hodnoty, bydlení, volební preference, volební chování, sociální postoje, politické postoje\n subjects fileCount versionId versionState majorVersion minorVersion\n1 NULL NA NA <NA> NA NA\n2 NULL NA NA <NA> NA NA\n3 NULL NA NA <NA> NA NA\n4 Social Sciences 0 267076 RELEASED 1 0\n createdAt updatedAt\n1 <NA> <NA>\n2 <NA> <NA>\n3 <NA> <NA>\n4 2024-10-10T09:07:57Z 2024-10-10T13:00:06Z\n contacts\n1 NULL\n2 NULL\n3 NULL\n4 Röschová, Michaela, Sociologický ústav AV ČR, v.v.i.\n publications\n1 NULL\n2 NULL\n3 NULL\n4 Kudrnáč, A., Eger. A. M., Hjerm, M. 2023. Scapegoating Immigrants in Times of Personal and Collective Crises: Results from a Czech Panel Study. International Migration Review 0(0)1-20, Kudrnáčová M, Kudrnáč A (2023) Better sleep, better life? testing the role of sleep on quality of life. PLoS ONE 18(3): e0282085. https://doi.org/10.1371/journal.pone.0282085, Sládek, M., Klusáček, J., Hamplová, D., & Sumová, A. 2023. Population-representative study reveals cardiovascular and metabolic disease biomarkers associated with misaligned sleep schedules. Sleep, https://doi.org/10.1093/sleep/zsad037, Raudenská, P., D. Hamplová. 2022. The Effect of Parents' Education and Income on Children's School Performance: the Mediating Role of the Family Environment and Children's Characteristics, and Gender Differences. Polish Sociological Review 218: 247-271, https://doi.org/10.26412/psr218.06, Kudrnáčová, M., D. Hamplová. 2022. Social Jetlag in the Context of Work and Family. Sociológia / Slovak Sociological Review: 54(4): 295-324. 0049-1225. DOI: https://doi.org/10.31577/sociologia.2022.54.4.11, Kudrnáč, A., J. Klusáček. 2022. Dočasný nárůst důvěry ve vládu a dodržování protiepidemických opatření na počátku koronavirové krize. Sociologický časopis / Czech Sociological Review 58(2):119-150, https://doi.org/10.13060/csr.2022.016., Hamplová, D., Raudenská, P. 2021. Gender Differences in the Link between Family Scholarly Culture and Parental Educational Aspirations. Sociológia - Slovak Sociological Review 53 (5): 435-462, https://doi.org/10.31577/sociologia.2021.53.5.16, Raudenská, P., K. Bašná. 2021. Individual’s cultural capital: intergenerational transmission, partner effect, or individual merit? Poetics, https://doi.org/10.1016/j.poetic.2021.101575, Kudrnáč, A. 2021. A study of the effects of obesity and poor health on the relationship between distance to the polling station and the probability to vote. Party Politics 27(3):540-551, https://doi.org/10.1177/1354068819867414, Sládek, M., Kudrnáčová Röschová, M., Adámková, V., Hamplová, D., & Sumová, A. 2020. Chronotype assessment via a large scale socio-demographic survey favours yearlong Standard time over Daylight Saving Time in central Europe. Scientific reports, 10(1), 1419, https://doi.org/10.1038/s41598-020-58413-9, Nývlt, Ondřej. 2018. Socio-demografické determinanty studijních výsledků a začátku pracovní kariéry v České republice. Demografie 60 (2):111-123, Lux, M., P. Sunega, L. Kážmér. 2018. Intergenerational financial transfers and indirect reciprocity: determinants of the reproduction of homeownership in the postsocialist Czech Republic. Housing Studies, https://doi.org/10.1080/02673037.2018.1541441, Slepičková, Lenka, Petr Fučík. 2018. Využívání předškolního vzdělávání v České republice: Komu chybí místa ve školkách? Sociologický časopis / Czech Sociological Review 54 (1): 35-62, https://doi.org/10.13060/00380288.2018.54.1.395, Hrubá, L. 2017. Sociální determinanty vysokých vzdělanostních očekávání rodičů. Sociológia - Slovak Sociological Review 49 (5): 463-481, https://journals.sagepub.com/doi/10.1177/01979183231177971, https://doi.org/10.1371/journal.pone.0282085, https://doi.org/10.1093/sleep/zsad037, https://doi.org/10.26412/psr218.06, https://doi.org/10.31577/sociologia.2022.54.4.11, https://doi.org/10.13060/csr.2022.016, https://www.sav.sk/index.php?lang=sk&doc=journal-list&part=article_response_page&journal_article_no=26724, https://www.sciencedirect.com/science/article/pii/S0304422X21000590?via%3Dihub, https://journals.sagepub.com/doi/10.1177/1354068819867414, https://doi.org/10.1038/s41598-020-58413-9, https://www.czechdemography.cz/aktuality/demografie-c-2-2018/, https://doi.org/10.1080/02673037.2018.1541441, https://doi.org/10.13060/00380288.2018.54.1.395, https://www.ceeol.com/search/article-detail?id=583362\n producers\n1 NULL\n2 NULL\n3 NULL\n4 Sociologický ústav, Centrum pro ekonomický výzkum a doktorské studium – Národohospodářský ústav, Fakulta sociálních věd\n geographicCoverage\n1 NULL\n2 NULL\n3 NULL\n4 Czech Republic\n authors\n1 NULL\n2 NULL\n3 NULL\n4 Lux, Martin, Hamplová, Dana, Linek, Lukáš, Chaloupková, Klímová, Jana, Sunega, Petr, Mysíková, Martina, Kudrnáč, Aleš, Krejčí, Jindřich, Röschová, Michaela, Hrubá , Lucie\n\n$data$count_in_response\n[1] 4\n\n\nWe could look at the dataset using View(politics_datasets), which would allow us to expand the tree based on what we are interested in. We can even get the code that we need to focus on different aspects by hovering on the item and then clicking the icon with the green arrow (Figure 7.3).\n\n\n\n\n\n\nFigure 7.3: Example of hovering over a JSON element, “items”, where the icon with a green arrow can be clicked on to get the code that would focus on that element\n\n\n\nThis tells us how to obtain the dataset of interest.\n\nas_tibble(politics_datasets[[\"data\"]][[\"items\"]])\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n7.2.2 Spotify\nSometimes there is an R package built around an API and allows us to interact with it in ways that are similar what we have seen before. For instance, spotifyr is a wrapper around the Spotify API. When using APIs, even when they are wrapped in an R package, in this case spotifyr, it is important to read the terms under which access is provided.\nTo access the Spotify API, we need a Spotify Developer Account. This is free but will require logging in with a Spotify account and then accepting the Developer Terms (Figure 7.4).\n\n\n\n\n\n\nFigure 7.4: Spotify Developer Account Terms agreement page\n\n\n\nContinuing with the registration process, in our case, we “do not know” what we are building and so Spotify requires us to use a non-commercial agreement which is fine. To use the Spotify API we need a “Client ID” and a “Client Secret”. These are things that we want to keep to ourselves because otherwise anyone with the details could use our developer account as though they were us. One way to keep these details secret with minimum hassle is to keep them in our “System Environment”. In this way, when we push to GitHub they should not be included. To do this we will load and use usethis to modify our System Environment. In particular, there is a file called “.Renviron” which we will open and then add our “Client ID” and “Client Secret”.\n\nedit_r_environ()\n\nWhen we run edit_r_environ(), a “.Renviron” file will open and we can add our “Spotify Client ID” and “Client Secret”. Use the same names, because spotifyr will look in our environment for keys with those specific names. Being careful to use single quotes is important here even though we normally use double quotes in this book.\n\nSPOTIFY_CLIENT_ID = 'PUT_YOUR_CLIENT_ID_HERE'\nSPOTIFY_CLIENT_SECRET = 'PUT_YOUR_SECRET_HERE'\n\nSave the “.Renviron” file, and then restart R: “Session” \\(\\rightarrow\\) “Restart R”. We can now use our “Spotify Client ID” and “Client Secret” as needed. And functions that require those details as arguments will work without them being explicitly specified again.\nTo try this out we install and load spotifyr. We will get and save some information about Radiohead, the English rock band, using get_artist_audio_features(). One of the required arguments is authorization, but as that is set, by default, to look at the “.Renviron” file, we do not need to specify it here.\n\nradiohead <- get_artist_audio_features(\"radiohead\")\nsaveRDS(radiohead, \"radiohead.rds\")\n\n\nradiohead <- readRDS(\"radiohead.rds\")\n\nThere is a variety of information available based on songs. We might be interested to see whether their songs are getting longer over time (Figure 7.5). Following the guidance in Chapter 5 this is a nice opportunity to additionally use a boxplot to communicate summary statistics by album at the same time.\n\nradiohead <- as_tibble(radiohead)\n\nradiohead |>\n mutate(album_release_date = ymd(album_release_date)) |>\n ggplot(aes(\n x = album_release_date,\n y = duration_ms,\n group = album_release_date\n )) +\n geom_boxplot() +\n geom_jitter(alpha = 0.5, width = 0.3, height = 0) +\n theme_minimal() +\n labs(\n x = \"Album release date\",\n y = \"Duration of song (ms)\"\n )\n\n\n\n\n\n\n\nFigure 7.5: Length of each Radiohead song, over time, as gathered from Spotify\n\n\n\n\n\nOne interesting variable provided by Spotify about each song is “valence”. The Spotify documentation describes this as a measure between zero and one that signals “the musical positiveness” of the track with higher values being more positive. We might be interested to compare valence over time between a few artists, for instance, Radiohead, the American rock band The National, and the American singer Taylor Swift.\nFirst, we need to gather the data.\n\ntaylor_swift <- get_artist_audio_features(\"taylor swift\")\nthe_national <- get_artist_audio_features(\"the national\")\n\nsaveRDS(taylor_swift, \"taylor_swift.rds\")\nsaveRDS(the_national, \"the_national.rds\")\n\nThen we can bring them together and make the graph (Figure 7.6). This appears to show that while Taylor Swift and Radiohead have largely maintained their level of valence over time, The National has decreased theirs.\n\nrbind(taylor_swift, the_national, radiohead) |>\n select(artist_name, album_release_date, valence) |>\n mutate(album_release_date = ymd(album_release_date)) |> \n ggplot(aes( x = album_release_date, y = valence, color = artist_name)) +\n geom_point(alpha = 0.3) +\n geom_smooth() +\n theme_minimal() +\n facet_wrap(facets = vars(artist_name), dir = \"v\") +\n labs(\n x = \"Album release date\",\n y = \"Valence\",\n color = \"Artist\"\n ) +\n scale_color_brewer(palette = \"Set1\") +\n theme(legend.position = \"bottom\")\n\n\n\n\n\n\n\nFigure 7.6: Comparing valence, over time, for Radiohead, Taylor Swift, and The National\n\n\n\n\n\nHow amazing that we live in a world where all that information is available with very little effort or cost! And having gathered the data, there much that could be done. For instance, Pavlik (2019) uses an expanded dataset to classify musical genres and The Economist (2022) looks at how language is associated with music streaming on Spotify. Our ability to gather such data enables us to answer questions that had to be considered experimentally in the past. For instance, Salganik, Dodds, and Watts (2006) had to use experimental data to analyze the social aspect of what makes a hit song, rather than the observational data we can now access.\nThat said, it is worth thinking about what valence is purporting to measure. Little information is available in the Spotify documentation how it was created. It is doubtful that one number can completely represent how positive is a song. And what about the songs from these artists that are not on Spotify, or even publicly released? This is a nice example of how measurement and sampling pervade all aspects of telling stories with data.",
"crumbs": [
"Acquisition",
- "7 Gather data "
+ "7 APIs, scraping, and parsing "
]
},
{
"objectID": "07-gather.html#web-scraping",
"href": "07-gather.html#web-scraping",
- "title": "7 Gather data",
+ "title": "7 APIs, scraping, and parsing",
"section": "7.3 Web scraping",
"text": "7.3 Web scraping\n\n7.3.1 Principles\nWeb scraping is a way to get data from websites. Rather than going to a website using a browser and then saving a copy of it, we write code that does it for us. This opens considerable data to us, but on the other hand, it is not typically data that are being made available for these purposes. This means that it is especially important to be respectful. While generally not illegal, the specifics about the legality of web scraping depend on jurisdictions and what we are doing, and so it is also important to be mindful. Even if our use is not commercially competitive, of particular concern is the conflict between the need for our work to be reproducible with the need to respect terms of service that may disallow data republishing (Luscombe, Dick, and Walby 2021).\nPrivacy often trumps reproducibility. There is also a considerable difference between data being publicly available on a website and being scraped, cleaned, and prepared into a dataset which is then publicly released. For instance, Kirkegaard and Bjerrekær (2016) scraped publicly available OKCupid profiles and then made the resulting dataset easily available (Hackett 2016). Zimmer (2018) details some of the important considerations that were overlooked including “minimizing harm”, “informed consent”, and ensuring those in the dataset maintain “privacy and confidentiality”. While it is correct to say that OKCupid made data public, they did so in a certain context, and when their data was scraped that context was changed.\n\n\n\n\n\n\nOh, you think we have good data on that!\n\n\n\nPolice violence is particularly concerning because of the need for trust between the police and society. Without good data it is difficult to hold police departments accountable, or know whether there is an issue, but getting data is difficult (Thomson-DeVeaux, Bronner, and Sharma 2021). The fundamental problem is that there is no way to easily simplify an encounter that results in violence into a dataset. Two popular datasets draw on web scraping:\n\n“Mapping Police Violence”; and\n“Fatal Force Database”.\n\nBor et al. (2018) use “Mapping Police Violence” to examine police killings of Black Americans, especially when unarmed, and find a substantial effect on the mental health of Black Americans. Responses to the paper, such as Nix and Lozada (2020), have special concern with the coding of the dataset, and after re-coding draw different conclusions. An example of a coding difference is the unanswerable question, because it depends on context and usage, of whether to code an individual who was killed with a toy firearm as “armed” or “unarmed”. We may want a separate category, but some simplification is necessary for the construction of a quantitative dataset. The Washington Post writes many articles using the “Fatal Force Database” (The Washington Post 2023). Jenkins et al. (2022) describes their methodology and the challenges of standardization. Comer and Ingram (2022) compare the datasets and find similarities, but document ways in which the datasets are different.\n\n\nWeb scraping is an invaluable source of data. But they are typically datasets that can be created as a by-product of someone trying to achieve another aim. And web scraping imposes a cost on the website host, and so we should reduce this to the extent possible. For instance, a retailer may have a website with their products and their prices. That has not been created deliberately as a source of data, but we can scrape it to create a dataset. The following principles may be useful to guide web scraping.\n\nAvoid it. Try to use an API wherever possible.\nAbide by their desires. Some websites have a “robots.txt” file that contains information about what they are comfortable with scrapers doing. In general, if it exists, a “robots.txt” file can be accessed by appending “robots.txt” to the base URL. For instance, the “robots.txt” file for https://www.google.com, can be accessed at https://www.google.com/robots.txt. Note if there are folders listed against “Disallow:”. These are the folders that the website would not like to be scraped. And also note any instances of “Crawl-delay:”. This is the number of seconds the website would like you to wait between visits.\nReduce the impact.\n\nSlow down the scraper, for instance, rather than having it visit the website every second, slow it down using sys.sleep(). If you only need a few hundred files, then why not just have it visit the website a few times a minute, running in the background overnight?\nConsider the timing of when you run the scraper. For instance, if you are scraping a retailer then maybe set the script to run from 10pm through to the morning, when fewer customers are likely using the site. Similarly, if it is a government website and they have a regular monthly release, then it might be polite to avoid that day.\n\nTake only what is needed. For instance, you do not need to scrape the entirety of Wikipedia if all you need is the names of the ten largest cities in Croatia. This reduces the impact on the website, and allows us to more easily justify our actions.\nOnly scrape once. This means you should save everything as you go so that you do not have to re-collect data when the scraper inevitably fails at some point. For instance, you will typically spend considerable time getting a scraper working on one page, but typically the page structure will change at some point and the scraper will need to be updated. Once you have the data, you should save that original, unedited data separately to the modified data. If you need data over time then you will need to go back, but this is different than needlessly re-scraping a page.\nDo not republish the pages that were scraped (this contrasts with datasets that you create from it).\nTake ownership and ask permission if possible. At a minimum all scripts should have contact details in them. Depending on the circumstances, it may be worthwhile asking for permission before you scrape.\n\n\n\n7.3.2 HTML/CSS essentials\nWeb scraping is possible by taking advantage of the underlying structure of a webpage. We use patterns in the HTML/CSS to get the data that we want. To look at the underlying HTML/CSS we can either:\n\nopen a browser, right-click, and choose something like “Inspect”; or\nsave the website and then open it with a text editor rather than a browser.\n\nHTML/CSS is a markup language based on matching tags. If we want text to be bold, then we would use something like:\n<b>My bold text</b>\nSimilarly, if we want a list, then we start and end the list as well as indicating each item.\n<ul>\n <li>Learn webscraping</li>\n <li>Do data science</li>\n <li>Profit</li>\n</ul>\nWhen scraping we will search for these tags.\nTo get started, we can pretend that we obtained some HTML from a website, and that we want to get the name from it. We can see that the name is in bold, so we want to focus on that feature and extract it.\n\nwebsite_extract <- \"<p>Hi, I’m <b>Rohan</b> Alexander.</p>\"\n\nrvest is part of the tidyverse so it does not have to be installed, but it is not part of the core, so it does need to be loaded. After that, use read_html() to read in the data.\n\nrohans_data <- read_html(website_extract)\n\nrohans_data\n\n{html_document}\n<html>\n[1] <body><p>Hi, I’m <b>Rohan</b> Alexander.</p></body>\n\n\nThe language used by rvest to look for tags is “node”, so we focus on bold nodes. By default html_elements() returns the tags as well. We extract the text with html_text().\n\nrohans_data |>\n html_elements(\"b\")\n\n{xml_nodeset (1)}\n[1] <b>Rohan</b>\n\nrohans_data |>\n html_elements(\"b\") |>\n html_text()\n\n[1] \"Rohan\"\n\n\nWeb scraping is an exciting source of data, and we will now go through some examples. But in contrast to these examples, information is not usually all on one page. Web scraping quickly becomes a difficult art form that requires practice. For instance, we distinguish between an index scrape and a contents scrape. The former is scraping to build the list of URLs that have the content you want, while the latter is to get the content from those URLs. An example is provided by Luscombe, Duncan, and Walby (2022). If you end up doing much web scraping, then polite (Perepolkin 2022) may be helpful to better optimize your workflow. And using GitHub Actions to allow for larger and slower scrapes over time.\n\n\n7.3.3 Book information\nIn this case study we will scrape a list of books available here. We will then clean the data and look at the distribution of the first letters of author surnames. It is slightly more complicated than the example above, but the underlying workflow is the same: download the website, look for the nodes of interest, extract the information, and clean it.\nWe use rvest to download a website, and to then navigate the HTML to find the aspects that we are interested in. And we use the tidyverse to clean the dataset. We first need to go to the website and then save a local copy.\n\nbooks_data <- read_html(\"https://rohansbooks.com\")\n\nwrite_html(books_data, \"raw_data.html\")\n\nWe need to navigate the HTML to get the aspects that we want. And then try to get the data into a tibble as quickly as possible because this will allow us to more easily use dplyr verbs and other functions from the tidyverse.\nSee Online Appendix A if this is unfamiliar to you.\n\nbooks_data <- read_html(\"raw_data.html\")\n\n\nbooks_data\n\n{html_document}\n<html>\n[1] <head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8 ...\n[2] <body>\\n <h1>Books</h1>\\n\\n <p>\\n This is a list of books that ...\n\n\nTo get the data into a tibble we first need to use HTML tags to identify the data that we are interested in. If we look at the website then we know we need to focus on list items (Figure 7.7 (a)). And we can look at the source, focusing particularly on looking for a list (Figure 7.7 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) Books website as displayed\n\n\n\n\n\n\n\n\n\n\n\n(b) HTML for the top of the books website and the list of books\n\n\n\n\n\n\n\nFigure 7.7: Screen captures from the books website as at 16 June 2022\n\n\n\nThe tag for a list item is “li”, so we can use that to focus on the list.\n\ntext_data <-\n books_data |>\n html_elements(\"li\") |>\n html_text()\n\nall_books <-\n tibble(books = text_data)\n\nhead(all_books)\n\n# A tibble: 6 × 1\n books \n <chr> \n1 \"\\n Agassi, Andre, 2009, Open\\n \" \n2 \"\\n Cramer, Richard Ben, 1992, What It Takes: The Way to the White Hou…\n3 \"\\n DeWitt, Helen, 2000, The Last Samurai\\n \" \n4 \"\\n Gelman, Andrew and Jennifer Hill, 2007, Data Analysis Using Regres…\n5 \"\\n Halberstam, David, 1972, The Best and the Brightest\\n \" \n6 \"\\n Ignatieff, Michael, 2013, Fire and Ashes: Success and Failure in P…\n\n\nWe now need to clean the data. First we want to separate the title and the author using separate() and then clean up the author and title columns. We can take advantage of the fact that the year is present and separate based on that.\n\nall_books <-\n all_books |>\n mutate(books = str_squish(books)) |>\n separate(books, into = c(\"author\", \"title\"), sep = \"\\\\, [[:digit:]]{4}\\\\, \")\n\nhead(all_books)\n\n# A tibble: 6 × 2\n author title \n <chr> <chr> \n1 Agassi, Andre Open \n2 Cramer, Richard Ben What It Takes: The Way to the White House \n3 DeWitt, Helen The Last Samurai \n4 Gelman, Andrew and Jennifer Hill Data Analysis Using Regression and Multileve…\n5 Halberstam, David The Best and the Brightest \n6 Ignatieff, Michael Fire and Ashes: Success and Failure in Polit…\n\n\nFinally, we could make, say, a table of the distribution of the first letter of the names (Table 7.1).\n\nall_books |>\n mutate(\n first_letter = str_sub(author, 1, 1)\n ) |>\n count(.by = first_letter) |>\n kable(\n col.names = c(\"First letter\", \"Number of times\")\n )\n\n\n\nTable 7.1: Distribution of first letter of author names in a collection of books\n\n\n\n\n\n\nFirst letter\nNumber of times\n\n\n\n\nA\n1\n\n\nC\n1\n\n\nD\n1\n\n\nG\n1\n\n\nH\n1\n\n\nI\n1\n\n\nL\n1\n\n\nM\n1\n\n\nP\n3\n\n\nR\n1\n\n\nV\n2\n\n\nW\n4\n\n\nY\n1\n\n\n\n\n\n\n\n\n\n\n7.3.4 Prime Ministers of the United Kingdom\nIn this case study we are interested in how long prime ministers of the United Kingdom lived, based on the year they were born. We will scrape data from Wikipedia using rvest, clean it, and then make a graph. From time to time a website will change. This makes many scrapes largely bespoke, even if we can borrow some code from earlier projects. It is normal to feel frustrated at times. It helps to begin with an end in mind.\nTo that end, we can start by generating some simulated data. Ideally, we want a table that has a row for each prime minister, a column for their name, and a column each for the birth and death years. If they are still alive, then that death year can be empty. We know that birth and death years should be somewhere between 1700 and 1990, and that death year should be larger than birth year. Finally, we also know that the years should be integers, and the names should be characters. We want something that looks roughly like this:\n\nset.seed(853)\n\nsimulated_dataset <-\n tibble(\n prime_minister = babynames |>\n filter(prop > 0.01) |>\n distinct(name) |>\n unlist() |>\n sample(size = 10, replace = FALSE),\n birth_year = sample(1700:1990, size = 10, replace = TRUE),\n years_lived = sample(50:100, size = 10, replace = TRUE),\n death_year = birth_year + years_lived\n ) |>\n select(prime_minister, birth_year, death_year, years_lived) |>\n arrange(birth_year)\n\nsimulated_dataset\n\n# A tibble: 10 × 4\n prime_minister birth_year death_year years_lived\n <chr> <int> <int> <int>\n 1 Kevin 1813 1908 95\n 2 Karen 1832 1896 64\n 3 Robert 1839 1899 60\n 4 Bertha 1846 1915 69\n 5 Jennifer 1867 1943 76\n 6 Arthur 1892 1984 92\n 7 Donna 1907 2006 99\n 8 Emma 1957 2031 74\n 9 Ryan 1959 2053 94\n10 Tyler 1990 2062 72\n\n\nOne of the advantages of generating a simulated dataset is that if we are working in groups then one person can start making the graph, using the simulated dataset, while the other person gathers the data. In terms of a graph, we are aiming for something like Figure 7.8.\n\n\n\n\n\n\nFigure 7.8: Sketch of planned graph showing how long United Kingdom prime ministers lived\n\n\n\nWe are starting with a question that is of interest, which is how long each prime minister of the United Kingdom lived. As such, we need to identify a source of data. While there are plenty of data sources that have the births and deaths of each prime minister, we want one that we can trust, and as we are going to be scraping, we want one that has some structure to it. The Wikipedia page about prime ministers of the United Kingdom fits both these criteria. As it is a popular page the information is likely to be correct, and the data are available in a table.\nWe load rvest and then download the page using read_html(). Saving it locally provides us with a copy that we need for reproducibility in case the website changes, and means that we do not have to keep visiting the website. But it is not ours, and so this is typically not something that should be publicly redistributed.\n\nraw_data <-\n read_html(\n \"https://en.wikipedia.org/wiki/List_of_prime_ministers_of_the_United_Kingdom\"\n )\nwrite_html(raw_data, \"pms.html\")\n\nAs with the earlier case study, we are looking for patterns in the HTML that we can use to help us get closer to the data that we want. This is an iterative process and involves trial and error. Even simple examples will take time.\nOne tool that may help is the SelectorGadget. This allows us to pick and choose the elements that we want, and then gives us the input for html_element() (Figure 7.9). By default, SelectorGadget uses CSS selectors. These are not the only way to specify the location of the information you want, and using an alternative, such as XPath, can be a useful option to consider.\n\n\n\n\n\n\nFigure 7.9: Using the Selector Gadget to identify the tag, as at 12 February 2023\n\n\n\n\nraw_data <- read_html(\"pms.html\")\n\n\nparse_data_selector_gadget <-\n raw_data |>\n html_element(\".wikitable\") |>\n html_table()\n\nhead(parse_data_selector_gadget)\n\n# A tibble: 6 × 11\n Portrait Portrait Prime ministerOffice(L…¹ `Term of office` `Term of office`\n <chr> <chr> <chr> <chr> <chr> \n1 Portrait \"Portrait\" Prime ministerOffice(Li… start end \n2 \"\" Robert Walpole[27]MP fo… 3 April1721 11 February1742 \n3 \"\" Robert Walpole[27]MP fo… 3 April1721 11 February1742 \n4 \"\" Robert Walpole[27]MP fo… 3 April1721 11 February1742 \n5 \"\" Robert Walpole[27]MP fo… 3 April1721 11 February1742 \n6 \"\" Spencer Compton[28]1st … 16 February1742 2 July1743 \n# ℹ abbreviated name: ¹`Prime ministerOffice(Lifespan)`\n# ℹ 6 more variables: `Term of office` <chr>, `Mandate[a]` <chr>,\n# `Ministerial offices held as prime minister` <chr>, Party <chr>,\n# Government <chr>, MonarchReign <chr>\n\n\nIn this case there are many columns that we do not need, and some duplicated rows.\n\nparsed_data <-\n parse_data_selector_gadget |> \n clean_names() |> \n rename(raw_text = prime_minister_office_lifespan) |> \n select(raw_text) |> \n filter(raw_text != \"Prime ministerOffice(Lifespan)\") |> \n distinct() \n\nhead(parsed_data)\n\n# A tibble: 6 × 1\n raw_text \n <chr> \n1 Robert Walpole[27]MP for King's Lynn(1676–1745) \n2 Spencer Compton[28]1st Earl of Wilmington(1673–1743) \n3 Henry Pelham[29]MP for Sussex(1694–1754) \n4 Thomas Pelham-Holles[30]1st Duke of Newcastle(1693–1768)\n5 William Cavendish[31]4th Duke of Devonshire(1720–1764) \n6 Thomas Pelham-Holles[32]1st Duke of Newcastle(1693–1768)\n\n\nNow that we have the parsed data, we need to clean it to match what we wanted. We want a names column, as well as columns for birth year and death year. We use separate() to take advantage of the fact that it looks like the names and dates are distinguished by brackets. The argument in str_extract() is a regular expression. It looks for four digits in a row, followed by a dash, followed by four more digits in a row. We use a slightly different regular expression for those prime ministers who are still alive.\n\ninitial_clean <-\n parsed_data |>\n separate(\n raw_text, into = c(\"name\", \"not_name\"), sep = \"\\\\[\", extra = \"merge\",\n ) |> \n mutate(date = str_extract(not_name, \"[[:digit:]]{4}–[[:digit:]]{4}\"),\n born = str_extract(not_name, \"born[[:space:]][[:digit:]]{4}\")\n ) |>\n select(name, date, born)\n \nhead(initial_clean)\n\n# A tibble: 6 × 3\n name date born \n <chr> <chr> <chr>\n1 Robert Walpole 1676–1745 <NA> \n2 Spencer Compton 1673–1743 <NA> \n3 Henry Pelham 1694–1754 <NA> \n4 Thomas Pelham-Holles 1693–1768 <NA> \n5 William Cavendish 1720–1764 <NA> \n6 Thomas Pelham-Holles 1693–1768 <NA> \n\n\nFinally, we need to clean up the columns.\n\ncleaned_data <-\n initial_clean |>\n separate(date, into = c(\"birth\", \"died\"), \n sep = \"–\") |> # PMs who have died have their birth and death years \n # separated by a hyphen, but we need to be careful with the hyphen as it seems \n # to be a slightly odd type of hyphen and we need to copy/paste it.\n mutate(\n born = str_remove_all(born, \"born[[:space:]]\"),\n birth = if_else(!is.na(born), born, birth)\n ) |> # Alive PMs have slightly different format\n select(-born) |>\n rename(born = birth) |> \n mutate(across(c(born, died), as.integer)) |> \n mutate(Age_at_Death = died - born) |> \n distinct() # Some of the PMs had two goes at it.\n\nhead(cleaned_data)\n\n# A tibble: 6 × 4\n name born died Age_at_Death\n <chr> <int> <int> <int>\n1 Robert Walpole 1676 1745 69\n2 Spencer Compton 1673 1743 70\n3 Henry Pelham 1694 1754 60\n4 Thomas Pelham-Holles 1693 1768 75\n5 William Cavendish 1720 1764 44\n6 John Stuart 1713 1792 79\n\n\nOur dataset looks similar to the one that we said we wanted at the start (Table 7.2).\n\ncleaned_data |>\n head() |>\n kable(\n col.names = c(\"Prime Minister\", \"Birth year\", \"Death year\", \"Age at death\")\n )\n\n\n\nTable 7.2: UK Prime Ministers, by how old they were when they died\n\n\n\n\n\n\nPrime Minister\nBirth year\nDeath year\nAge at death\n\n\n\n\nRobert Walpole\n1676\n1745\n69\n\n\nSpencer Compton\n1673\n1743\n70\n\n\nHenry Pelham\n1694\n1754\n60\n\n\nThomas Pelham-Holles\n1693\n1768\n75\n\n\nWilliam Cavendish\n1720\n1764\n44\n\n\nJohn Stuart\n1713\n1792\n79\n\n\n\n\n\n\n\n\nAt this point we would like to make a graph that illustrates how long each prime minister lived (Figure 7.10). If they are still alive then we would like them to run to the end, but we would like to color them differently.\n\ncleaned_data |>\n mutate(\n still_alive = if_else(is.na(died), \"Yes\", \"No\"),\n died = if_else(is.na(died), as.integer(2023), died)\n ) |>\n mutate(name = as_factor(name)) |>\n ggplot(\n aes(x = born, xend = died, y = name, yend = name, color = still_alive)\n ) +\n geom_segment() +\n labs(\n x = \"Year of birth\", y = \"Prime minister\", color = \"PM is currently alive\"\n ) +\n theme_minimal() +\n scale_color_brewer(palette = \"Set1\") +\n theme(legend.position = \"bottom\")\n\n\n\n\n\n\n\nFigure 7.10: How long each prime minister of the United Kingdom lived\n\n\n\n\n\n\n\n7.3.5 Iteration\nConsidering text as data is exciting and allows us to explore many different research questions. We will draw on it in Chapter 17. Many guides assume that we already have a nicely formatted text dataset, but that is rarely actually the case. In this case study we will download files from a few different pages. While we have already seen two examples of web scraping, those were focused on just one page, whereas we often need many. Here we will focus on this iteration. We will use download.file() to do the download, and use purrr to apply this function across multiple sites. You do not need to install or load that package because it is part of the core tidyverse so it is loaded when you load the tidyverse.\nThe Reserve Bank of Australia (RBA) is Australia’s central bank. It has responsibility for setting the cash rate, which is the interest rate used for loans between banks. This interest rate is an especially important one and has a large impact on the other interest rates in the economy. Four times a year—February, May, August, and November—the RBA publishes a statement on monetary policy, and these are available as PDFs. In this example we will download two statements published in 2023.\nFirst we set up a tibble that has the information that we need. We will take advantage of commonalities in the structure of the URLs. We need to specify both a URL and a local file name for each state.\n\nfirst_bit <- \"https://www.rba.gov.au/publications/smp/2023/\"\nlast_bit <- \"/pdf/overview.pdf\"\n\nstatements_of_interest <-\n tibble(\n address =\n c(\n paste0(first_bit, \"feb\", last_bit),\n paste0(first_bit, \"may\", last_bit)\n ),\n local_save_name = c(\"2023-02.pdf\", \"2023-05.pdf\")\n )\n\n\nstatements_of_interest\n\n# A tibble: 2 × 2\n address local_save_name\n <chr> <chr> \n1 https://www.rba.gov.au/publications/smp/2023/feb/pdf/overview… 2023-02.pdf \n2 https://www.rba.gov.au/publications/smp/2023/may/pdf/overview… 2023-05.pdf \n\n\nWe want to apply the function download.files() to these two statements. To do this we write a function that will download the file, let us know that it was downloaded, wait a polite amount of time, and then go get the next file.\n\nvisit_download_and_wait <-\n function(address_to_visit,\n where_to_save_it_locally) {\n download.file(url = address_to_visit,\n destfile = where_to_save_it_locally)\n \n print(paste(\"Done with\", address_to_visit, \"at\", Sys.time()))\n \n Sys.sleep(sample(5:10, 1))\n }\n\nWe now apply that function to our tibble of URLs and save names using the function walk2().\n\nwalk2(\n statements_of_interest$address,\n statements_of_interest$local_save_name,\n ~ visit_download_and_wait(.x, .y)\n)\n\nThe result is that we have downloaded these PDFs and saved them to our computer. An alternative to writing these functions ourselves would be to use heapsofpapers (Alexander and Mahfouz 2021). This includes various helpful options for downloading lists of files, especially PDF, CSV, and txt files. For instance, Collins and Alexander (2022) use this to obtain thousands of PDFs and estimate the extent to which COVID-19 research was reproducible. In the next section we will build on this to discuss getting information from PDFs.",
"crumbs": [
"Acquisition",
- "7 Gather data "
+ "7 APIs, scraping, and parsing "
]
},
{
"objectID": "07-gather.html#pdfs",
"href": "07-gather.html#pdfs",
- "title": "7 Gather data",
+ "title": "7 APIs, scraping, and parsing",
"section": "7.4 PDFs",
"text": "7.4 PDFs\nPDF files were developed in the 1990s by the technology company Adobe. They are useful for documents because they are meant to display in a consistent way independent of the environment that created them or the environment in which they are being viewed. A PDF viewed on an iPhone should look the same as on an Android phone, as on a Linux desktop. One feature of PDFs is that they can include a variety of objects, for instance, text, photos, figures, etc. However, this variety can limit the capacity of PDFs to be used directly as data. The data first needs to be extracted from the PDF.\nIt is often possible to copy and paste the data from the PDF. This is more likely when the PDF only contains text or regular tables. In particular, if the PDF has been created by an application such as Microsoft Word, or another document- or form-creation system, then often the text data can be extracted in this way because they are actually stored as text within the PDF. We begin with that case. But it is not as easy if the text has been stored as an image which is then part of the PDF. This may be the case for PDFs produced through scans or photos of physical documents, and some older document preparation software. We go through that case later.\nIn contrast to an API, a PDF is usually produced for human rather than computer consumption. The nice thing about PDFs is that they are static and constant. And it is great that data are available. But the trade-off is that:\n\nIt is not overly useful to do larger-scale data.\nWe do not know how the PDF was put together so we do not know whether we can trust it.\nWe cannot manipulate the data to get results that we are interested in.\n\nThere are two important aspects to keep in mind when extracting data from a PDF:\n\nBegin with an end in mind. Plan and sketch what we want from a final dataset/graph/paper to limit time wastage.\nStart simple, then iterate. The quickest way to make something that needs to be complicated is often to first build a simple version and then add to it. Start with just trying to get one page of the PDF working or even just one line. Then iterate from there.\n\nWe will go through several examples and then go through a case study where we will gather data on United States Total Fertility Rate, by state.\n\n7.4.1 Jane Eyre\nFigure 7.11 is a PDF that consists of just the first sentence from Charlotte Brontë’s novel Jane Eyre taken from Project Gutenberg (Brontë 1847). You can get it here. If we assume that it was saved as “first_example.pdf”, then after installing and loading pdftools to get the text from this one-page PDF into R.\n\n\n\n\n\n\nFigure 7.11: First sentence of Jane Eyre\n\n\n\n\nfirst_example <- pdf_text(\"first_example.pdf\")\n\nfirst_example\n\nclass(first_example)\n\n\n\n[1] \"There was no possibility of taking a walk that day.\\n\"\n\n\n[1] \"character\"\n\n\nWe can see that the PDF has been correctly read in, as a character vector.\nWe will now try a slightly more complicated example that consists of the first few paragraphs of Jane Eyre (Figure 7.12). Now we have the chapter heading as well.\n\n\n\n\n\n\nFigure 7.12: First few paragraphs of Jane Eyre\n\n\n\nWe use the same function as before.\n\nsecond_example <- pdf_text(\"second_example.pdf\")\nclass(second_example)\nsecond_example\n\n\n\n[1] \"character\"\n\n\n[1] \"CHAPTER I\\nThere was no possibility of taking a walk that day. We had been wandering, indeed, in the\\nleafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no\\ncompany, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so\\npenetrating, that further out-door exercise was now out of the question.\\n\\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the\\ncoming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the\\nchidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to\\nEliza, John, and Georgiana Reed.\\n\\nThe said Eliza, John, and Georgiana were now clustered round their mama in the drawing-room:\\nshe lay reclined on a sofa by the fireside, and with her darlings about her (for the time neither\\nquarrelling nor crying) looked perfectly happy. Me, she had dispensed from joining the group;\\nsaying, “She regretted to be under the necessity of keeping me at a distance; but that until she\\nheard from Bessie, and could discover by her own observation, that I was endeavouring in good\\nearnest to acquire a more sociable and childlike disposition, a more attractive and sprightly\\nmanner—something lighter, franker, more natural, as it were—she really must exclude me from\\nprivileges intended only for contented, happy, little children.”\\n\\n“What does Bessie say I have done?” I asked.\\n\\n“Jane, I don’t like cavillers or questioners; besides, there is something truly forbidding in a child\\ntaking up her elders in that manner. Be seated somewhere; and until you can speak pleasantly,\\nremain silent.”\\n\\nA breakfast-room adjoined the drawing-room, I slipped in there. It contained a bookcase: I soon\\npossessed myself of a volume, taking care that it should be one stored with pictures. I mounted\\ninto the window-seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the\\nred moreen curtain nearly close, I was shrined in double retirement.\\n\\nFolds of scarlet drapery shut in my view to the right hand; to the left were the clear panes of\\nglass, protecting, but not separating me from the drear November day. At intervals, while\\nturning over the leaves of my book, I studied the aspect of that winter afternoon. Afar, it offered\\na pale blank of mist and cloud; near a scene of wet lawn and storm-beat shrub, with ceaseless\\nrain sweeping away wildly before a long and lamentable blast.\\n\"\n\n\nAgain, we have a character vector. The end of each line is signaled by “\\n”, but other than that it looks pretty good. Finally, we consider the first two pages.\n\nthird_example <- pdf_text(\"third_example.pdf\")\nclass(third_example)\nthird_example\n\n\n\n[1] \"character\"\n\n\n[1] \"CHAPTER I\\nThere was no possibility of taking a walk that day. We had been wandering, indeed, in the\\nleafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no\\ncompany, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so\\npenetrating, that further out-door exercise was now out of the question.\\n\\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the\\ncoming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the\\nchidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to\\nEliza, John, and Georgiana Reed.\\n\\nThe said Eliza, John, and Georgiana were now clustered round their mama in the drawing-room:\\nshe lay reclined on a sofa by the fireside, and with her darlings about her (for the time neither\\nquarrelling nor crying) looked perfectly happy. Me, she had dispensed from joining the group;\\nsaying, “She regretted to be under the necessity of keeping me at a distance; but that until she\\nheard from Bessie, and could discover by her own observation, that I was endeavouring in good\\nearnest to acquire a more sociable and childlike disposition, a more attractive and sprightly\\nmanner—something lighter, franker, more natural, as it were—she really must exclude me from\\nprivileges intended only for contented, happy, little children.”\\n\\n“What does Bessie say I have done?” I asked.\\n\\n“Jane, I don’t like cavillers or questioners; besides, there is something truly forbidding in a child\\ntaking up her elders in that manner. Be seated somewhere; and until you can speak pleasantly,\\nremain silent.”\\n\\nA breakfast-room adjoined the drawing-room, I slipped in there. It contained a bookcase: I soon\\npossessed myself of a volume, taking care that it should be one stored with pictures. I mounted\\ninto the window-seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the\\nred moreen curtain nearly close, I was shrined in double retirement.\\n\\nFolds of scarlet drapery shut in my view to the right hand; to the left were the clear panes of\\nglass, protecting, but not separating me from the drear November day. At intervals, while\\nturning over the leaves of my book, I studied the aspect of that winter afternoon. Afar, it offered\\na pale blank of mist and cloud; near a scene of wet lawn and storm-beat shrub, with ceaseless\\nrain sweeping away wildly before a long and lamentable blast.\\n\\nI returned to my book—Bewick’s History of British Birds: the letterpress thereof I cared little\\nfor, generally speaking; and yet there were certain introductory pages that, child as I was, I could\\nnot pass quite as a blank. They were those which treat of the haunts of sea-fowl; of “the solitary\\nrocks and promontories” by them only inhabited; of the coast of Norway, studded with isles from\\nits southern extremity, the Lindeness, or Naze, to the North Cape—\\n\\n“Where the Northern Ocean, in vast whirls,\\nBoils round the naked, melancholy isles\\n\"\n[2] \"Of farthest Thule; and the Atlantic surge\\nPours in among the stormy Hebrides.”\\n\\nNor could I pass unnoticed the suggestion of the bleak shores of Lapland, Siberia, Spitzbergen,\\nNova Zembla, Iceland, Greenland, with “the vast sweep of the Arctic Zone, and those forlorn\\nregions of dreary space,—that reservoir of frost and snow, where firm fields of ice, the\\naccumulation of centuries of winters, glazed in Alpine heights above heights, surround the pole,\\nand concentre the multiplied rigours of extreme cold.” Of these death-white realms I formed an\\nidea of my own: shadowy, like all the half-comprehended notions that float dim through\\nchildren’s brains, but strangely impressive. The words in these introductory pages connected\\nthemselves with the succeeding vignettes, and gave significance to the rock standing up alone in\\na sea of billow and spray; to the broken boat stranded on a desolate coast; to the cold and ghastly\\nmoon glancing through bars of cloud at a wreck just sinking.\\n\\nI cannot tell what sentiment haunted the quite solitary churchyard, with its inscribed headstone;\\nits gate, its two trees, its low horizon, girdled by a broken wall, and its newly-risen crescent,\\nattesting the hour of eventide.\\n\\nThe two ships becalmed on a torpid sea, I believed to be marine phantoms.\\n\\nThe fiend pinning down the thief’s pack behind him, I passed over quickly: it was an object of\\nterror.\\n\\nSo was the black horned thing seated aloof on a rock, surveying a distant crowd surrounding a\\ngallows.\\n\\nEach picture told a story; mysterious often to my undeveloped understanding and imperfect\\nfeelings, yet ever profoundly interesting: as interesting as the tales Bessie sometimes narrated on\\nwinter evenings, when she chanced to be in good humour; and when, having brought her ironing-\\ntable to the nursery hearth, she allowed us to sit about it, and while she got up Mrs. Reed’s lace\\nfrills, and crimped her nightcap borders, fed our eager attention with passages of love and\\nadventure taken from old fairy tales and other ballads; or (as at a later period I discovered) from\\nthe pages of Pamela, and Henry, Earl of Moreland.\\n\\nWith Bewick on my knee, I was then happy: happy at least in my way. I feared nothing but\\ninterruption, and that came too soon. The breakfast-room door opened.\\n\\n“Boh! Madam Mope!” cried the voice of John Reed; then he paused: he found the room\\napparently empty.\\n\\n“Where the dickens is she!” he continued. “Lizzy! Georgy! (calling to his sisters) Joan is not\\nhere: tell mama she is run out into the rain—bad animal!”\\n\\n“It is well I drew the curtain,” thought I; and I wished fervently he might not discover my hiding-\\nplace: nor would John Reed have found it out himself; he was not quick either of vision or\\nconception; but Eliza just put her head in at the door, and said at once—\\n\" \n\n\nNotice that the first page is the first element of the character vector, and the second page is the second element. As we are most familiar with rectangular data, we will try to get it into that format as quickly as possible. And then we can use functions from the tidyverse to deal with it.\nFirst we want to convert the character vector into a tibble. At this point we may like to add page numbers as well.\n\njane_eyre <- tibble(\n raw_text = third_example,\n page_number = c(1:2)\n)\n\nWe then want to separate the lines so that each line is an observation. We can do that by looking for “\\n” remembering that we need to escape the backslash as it is a special character.\n\njane_eyre <-\n separate_rows(jane_eyre, raw_text, sep = \"\\\\n\", convert = FALSE)\n\njane_eyre\n\n# A tibble: 93 × 2\n raw_text page_number\n <chr> <int>\n 1 \"CHAPTER I\" 1\n 2 \"There was no possibility of taking a walk that day. We had been… 1\n 3 \"leafless shrubbery an hour in the morning; but since dinner (Mr… 1\n 4 \"company, dined early) the cold winter wind had brought with it … 1\n 5 \"penetrating, that further out-door exercise was now out of the … 1\n 6 \"\" 1\n 7 \"I was glad of it: I never liked long walks, especially on chill… 1\n 8 \"coming home in the raw twilight, with nipped fingers and toes, … 1\n 9 \"chidings of Bessie, the nurse, and humbled by the consciousness… 1\n10 \"Eliza, John, and Georgiana Reed.\" 1\n# ℹ 83 more rows\n\n\n\n\n7.4.2 Total Fertility Rate in the United States\nThe United States Department of Health and Human Services Vital Statistics Report provides information about the Total Fertility Rate (TFR) for each state. The average number of births per woman if women experience the current age-specific fertility rates throughout their reproductive years. The data are available in PDFs. We can use the approaches above to get the data into a dataset.\nThe table that we are interested in is on page 40 of a PDF that is available here or here. The column of interest is labelled: “Total fertility rate” (Figure 7.13).\n\n\n\n\n\n\nFigure 7.13: Example Vital Statistics Report, from 2000\n\n\n\nThe first step when getting data out of a PDF is to sketch out what we eventually want. A PDF typically contains a considerable amount of information, and so we should be clear about what is needed. This helps keep you focused, and prevents scope creep, but it is also helpful when thinking about data checks. We literally write down on paper what we have in mind. In this case, what is needed is a table with a column for state, year, and total fertility rate (TFR) (Figure 7.14).\n\n\n\n\n\n\nFigure 7.14: Planned dataset of TFR for each US state\n\n\n\nWe are interested in a particular column in a particular table for this PDF. Unfortunately, there is nothing magical about what is coming. This first step requires finding the PDF online, working out the link for each, and searching for the page and column name that is of interest. We have built a CSV with the details that we need and can read that in.\n\nsummary_tfr_dataset <- read_csv(\n paste0(\"https://raw.githubusercontent.com/RohanAlexander/\",\n \"telling_stories/main/inputs/tfr_tables_info.csv\")\n )\n\n\n\n\n\nTable 7.3: Year and associated data for TFR tables\n\n\n\n\n\n\n\n\n\n\n\n\n\nYear\nPage\nTable\nColumn\nURL\n\n\n\n\n2000\n40\n10\nTotal fertility rate\nhttps://www.cdc.gov/nchs/data/nvsr/nvsr50/nvsr50_05.pdf\n\n\n\n\n\n\n\n\nWe first download and save the PDF using download.file().\n\ndownload.file(\n url = summary_tfr_dataset$url[1],\n destfile = \"year_2000.pdf\"\n)\n\nWe then read the PDF in as a character vector using pdf_text() from pdftools. And then convert it to a tibble, so that we can use familiar verbs on it.\n\ndhs_2000 <- pdf_text(\"year_2000.pdf\")\n\n\ndhs_2000_tibble <- tibble(raw_data = dhs_2000)\n\nhead(dhs_2000_tibble)\n\n# A tibble: 6 × 1\n raw_data \n <chr> \n1 \"Volume 50, Number 5 …\n2 \"2 National Vital Statistics Report, Vol. 50, No. 5, February 12, 2002\\n\\n\\…\n3 \" …\n4 \"4 National Vital Statistics Report, Vol. 50, No. 5, February 12, 2002\\n\\n\\…\n5 \" …\n6 \"6 National Vital Statistics Report, Vol. 50, No. 5, February 12, 2002\\n\\n …\n\n\nGrab the page that is of interest (remembering that each page is an element of the character vector, hence a row in the tibble).\n\ndhs_2000_relevant_page <-\n dhs_2000_tibble |>\n slice(summary_tfr_dataset$page[1])\n\nhead(dhs_2000_relevant_page)\n\n# A tibble: 1 × 1\n raw_data \n <chr> \n1 \"40 National Vital Statistics Report, Vol. 50, No. 5, Revised May 15, 20022\\n…\n\n\nWe want to separate the rows and use separate_rows() from tidyr, which is part of the core tidyverse.\n\ndhs_2000_separate_rows <-\n dhs_2000_relevant_page |>\n separate_rows(raw_data, sep = \"\\\\n\", convert = FALSE)\n\nhead(dhs_2000_separate_rows)\n\n# A tibble: 6 × 1\n raw_data \n <chr> \n1 \"40 National Vital Statistics Report, Vol. 50, No. 5, Revised May 15, 20022\" \n2 \"\" \n3 \"Table 10. Number of births, birth rates, fertility rates, total fertility ra…\n4 \"United States, each State and territory, 2000\" \n5 \"[By place of residence. Birth rates are live births per 1,000 estimated popu…\n6 \"estimated in each area; total fertility rates are sums of birth rates for 5-…\n\n\nWe are searching for patterns that we can use. Let us look at the first ten lines of content (ignoring aspects such as headings and page numbers at the top of the page).\n\ndhs_2000_separate_rows[13:22, ] |>\n mutate(raw_data = str_remove(raw_data, \"\\\\.{40}\"))\n\n# A tibble: 10 × 1\n raw_data \n <chr> \n 1 \" State …\n 2 \" …\n 3 \" …\n 4 \"\" \n 5 \"\" \n 6 \"United States 1 .............. 4,058,814 14.7 67.5 2,1…\n 7 \"\" \n 8 \"Alabama ....................... 63,299 14.4 65.0 2,0…\n 9 \"Alaska ........................... 9,974 16.0 74.6 2,4…\n10 \"Arizona ......................... 85,273 17.5 84.4 2,6…\n\n\nAnd now at just one line.\n\ndhs_2000_separate_rows[20, ] |>\n mutate(raw_data = str_remove(raw_data, \"\\\\.{40}\"))\n\n# A tibble: 1 × 1\n raw_data \n <chr> \n1 Alabama ....................... 63,299 14.4 65.0 2,021…\n\n\nIt does not get much better than this:\n\nWe have dots separating the states from the data.\nWe have a space between each of the columns.\n\nWe can now separate this into columns. First, we want to match on when there are at least two dots (remembering that the dot is a special character and so needs to be escaped).\n\ndhs_2000_separate_columns <-\n dhs_2000_separate_rows |>\n separate(\n col = raw_data,\n into = c(\"state\", \"data\"),\n sep = \"\\\\.{2,}\",\n remove = FALSE,\n fill = \"right\"\n )\n\ndhs_2000_separate_columns[18:28, ] |>\n select(state, data)\n\n# A tibble: 11 × 2\n state data \n <chr> <chr> \n 1 \"United States 1 \" \" 4,058,814 14.7 67.5 2,130.0 …\n 2 \"\" <NA> \n 3 \"Alabama \" \" 63,299 14.4 65.0 2,021.0 …\n 4 \"Alaska \" \" 9,974 16.0 74.6 2,437.0 …\n 5 \"Arizona \" \" 85,273 17.5 84.4 2,652.5 …\n 6 \"Arkansas \" \" 37,783 14.7 69.1 2,140.0 …\n 7 \"California \" \" 531,959 15.8 70.7 2,186.0 …\n 8 \"Colorado \" \" 65,438 15.8 73.1 2,356.5 …\n 9 \"Connecticut \" \" 43,026 13.0 61.2 1,931.5 …\n10 \"Delaware \" \" 11,051 14.5 63.5 2,014.0 …\n11 \"District of Columbia \" \" 7,666 14.8 63.0 1,975.…\n\n\nWe then separate the data based on spaces. There is an inconsistent number of spaces, so we first squish any example of more than one space into just one with str_squish() from stringr.\n\ndhs_2000_separate_data <-\n dhs_2000_separate_columns |>\n mutate(data = str_squish(data)) |>\n separate(\n col = data,\n into = c(\n \"number_of_births\",\n \"birth_rate\",\n \"fertility_rate\",\n \"TFR\",\n \"teen_births_all\",\n \"teen_births_15_17\",\n \"teen_births_18_19\"\n ),\n sep = \"\\\\s\",\n remove = FALSE\n )\n\ndhs_2000_separate_data[18:28, ] |>\n select(-raw_data, -data)\n\n# A tibble: 11 × 8\n state number_of_births birth_rate fertility_rate TFR teen_births_all\n <chr> <chr> <chr> <chr> <chr> <chr> \n 1 \"United Sta… 4,058,814 14.7 67.5 2,13… 48.5 \n 2 \"\" <NA> <NA> <NA> <NA> <NA> \n 3 \"Alabama \" 63,299 14.4 65.0 2,02… 62.9 \n 4 \"Alaska \" 9,974 16.0 74.6 2,43… 42.4 \n 5 \"Arizona \" 85,273 17.5 84.4 2,65… 69.1 \n 6 \"Arkansas \" 37,783 14.7 69.1 2,14… 68.5 \n 7 \"California… 531,959 15.8 70.7 2,18… 48.5 \n 8 \"Colorado \" 65,438 15.8 73.1 2,35… 49.2 \n 9 \"Connecticu… 43,026 13.0 61.2 1,93… 31.9 \n10 \"Delaware \" 11,051 14.5 63.5 2,01… 51.6 \n11 \"District o… 7,666 14.8 63.0 1,97… 80.7 \n# ℹ 2 more variables: teen_births_15_17 <chr>, teen_births_18_19 <chr>\n\n\nThis is all looking fairly great. The only thing left is to clean up.\n\ndhs_2000_cleaned <-\n dhs_2000_separate_data |>\n select(state, TFR) |>\n slice(18:74) |>\n drop_na() |> \n mutate(\n TFR = str_remove_all(TFR, \",\"),\n TFR = as.numeric(TFR),\n state = str_trim(state),\n state = if_else(state == \"United States 1\", \"Total\", state)\n )\n\nAnd run some checks, for instance that we have all the states.\n\nall(state.name %in% dhs_2000_cleaned$state)\n\n[1] TRUE\n\n\nAnd we are done (Table 7.4). We can see that there is quite a wide distribution of TFR by US state (Figure 7.15). Utah has the highest and Vermont the lowest.\n\ndhs_2000_cleaned |>\n slice(1:10) |>\n kable(\n col.names = c(\"State\", \"TFR\"),\n digits = 0,\n format.args = list(big.mark = \",\")\n )\n\n\n\nTable 7.4: First ten rows of a dataset of TFR by United States state, 2000-2019\n\n\n\n\n\n\nState\nTFR\n\n\n\n\nTotal\n2,130\n\n\nAlabama\n2,021\n\n\nAlaska\n2,437\n\n\nArizona\n2,652\n\n\nArkansas\n2,140\n\n\nCalifornia\n2,186\n\n\nColorado\n2,356\n\n\nConnecticut\n1,932\n\n\nDelaware\n2,014\n\n\nDistrict of Columbia\n1,976\n\n\n\n\n\n\n\n\n\ndhs_2000_cleaned |> \n filter(state != \"Total\") |> \n ggplot(aes(x = TFR, y = fct_reorder(state, TFR))) +\n geom_point() +\n theme_classic() +\n labs(y = \"State\", x = \"Total Fertility Rate\")\n\n\n\n\n\n\n\nFigure 7.15: Distribution of TFR by US state in 2000\n\n\n\n\n\nHealy (2022) provides another example of using this approach in a different context.\n\n\n7.4.3 Optical Character Recognition\nAll of the above is predicated on having a PDF that is already “digitized”. But what if it is made of images, such as the result of a scan. Such PDFs often contain unstructured data, meaning that the data are not tagged nor organized in a regular way. Optical Character Recognition (OCR) is a process that transforms an image of text into actual text. Although there may not be much difference to a human reading a PDF before and after OCR, the PDF becomes machine-readable which allows us to use scripts (Cheriet et al. 2007). OCR has been used to parse images of characters since the 1950s, initially using manual approaches. While manual approaches remain the gold standard, for reasons of cost effectiveness, this has been largely replaced with statistical models.\nIn this example we use tesseract to OCR a document. This is a R wrapper around the Tesseract open-source OCR engine. Tesseract was initially developed at HP in the 1980s, and is now mostly developed by Google. After we install and load tesseract we can use ocr().\nLet us see an example with a scan from the first page of Jane Eyre (Figure 7.16).\n\n\n\n\n\n\nFigure 7.16: Scan of first page of Jane Eyre\n\n\n\n\ntext <- ocr(\n here(\"jane_scan.png\"),\n engine = tesseract(\"eng\")\n)\ncat(text)\n\n\n\n1 THERE was no possibility of taking a walk that day. We had\nbeen wandering, indeed, in the leafless shrubbery an hour in\nthe morning; but since dinner (Mrs Reed, when there was no com-\npany, dined early) the cold winter wind had brought with it clouds\nso sombre, and a rain so penetrating, that further out-door exercise\n\nwas now out of the question.\n\nI was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twi-\nlight, with nipped fingers and toes, and a heart saddened by the\nchidings of Bessie, the nurse, and humbled by the consciousness of\nmy physical inferiority to Eliza, John, and Georgiana Reed.\n\nThe said Eliza, John, and Georgiana were now clustered round\ntheir mama in the drawing-room: she lay reclined on a sofa by the\nfireside, and with her darlings about her (for the time neither quar-\nrelling nor crying) looked perfectly happy. Me, she had dispensed\nfrom joining the group; saying, ‘She regretted to be under the\nnecessity of keeping me at a distance; but that until she heard from\nBessie, and could discover by her own observation that I was\nendeavouring in good earnest to acquire a more sociable and\nchild-like disposition, a more attractive and sprightly manner—\nsomething lighter, franker, more natural as it were—she really\nmust exclude me from privileges intended only for contented,\nhappy, littie children.’\n\n‘What does Bessie say I have done?’ I asked.\n\n‘Jane, I don’t like cavillers or questioners: besides, there is\nsomething truly forbidding in a child taking up her elders in that\nmanner. Be seated somewhere; and until you can speak pleasantly,\nremain silent.’\n\n. a TV\n\ni; STA AEE LT JEUNE TIS Sis\na) | | | a) ee\ni Ni 4 | | | ae ST | | a eg\n\nce A FEM yi | eS ee\nPe TT (SB ag ie pe\nis \\ ie mu) i i es SS\nveal | Dy eT |\npa || i er itl |\n\naes : Oty ZR UIE OR HMR Sa ote ariel\nSEEN ed — =\n15\n\n\nIn general the result is not too bad. OCR is a useful tool but is not perfect and the resulting data may require extra attention in terms of cleaning. For instance, in the OCR results of Figure 7.16 we see irregularities that would need to be fixed. Various options, such as focusing on the particular data of interest and increasing the contrast can help. Other popular OCR engines include Amazon Textract, Google Vision API, and ABBYY.",
"crumbs": [
"Acquisition",
- "7 Gather data "
+ "7 APIs, scraping, and parsing "
]
},
{
"objectID": "07-gather.html#exercises",
"href": "07-gather.html#exercises",
- "title": "7 Gather data",
+ "title": "7 APIs, scraping, and parsing",
"section": "7.5 Exercises",
- "text": "7.5 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: A group of five undergraduates—Matt, Ash, Jacki, Rol, and Mike—each read some number of pages from a book each day for 100 days. Two of the undergraduates are a couple and so their number of pages is positively correlated, however all the others are independent. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation (note the relationship between some variables). Then write five tests based on the simulated data.\n(Acquire) Please obtain some actual data about snowfall and add a script updating the simulated tests to these actual data.\n(Explore) Build a graph and table using the real data.\n(Communicate) Please write some text to accompany the graph and table. Separate the code appropriately into R files and a Quarto doc. Submit a link to a high-quality GitHub repo.\n\n\n\nQuiz\n\nWhat is an API, in the context of data gathering (pick one)?\n\nA standardized set of functions to process data locally.\nA markup language for structuring data.\nA protocol for web browsers to render HTML content.\nAn interface provided by a server that allows someone else to request data using code.\n\nWhen using APIs for data gathering, which of the following can be used for authentication (pick one)?\n\nProviding an API key or token in the request.\nUsing cookies stored in the browser.\nDisabling SSL verification.\nModifying the hosts file on the client machine.\n\nConsider the following code, which uses gh to access the GitHub API. When was the repo for heapsofpapers created (pick one)?\n\n2021-02-23\n2021-03-06\n2021-05-25\n2021-04-27\n\n\n\n# Based on Tyler Bradley and Monica Alexander\nrepos <- gh(\"/users/RohanAlexander/repos\", per_page = 100)\nrepo_info <- tibble(\n name = map_chr(repos, \"name\"),\n created = map_chr(repos, \"created_at\"),\n full_name = map_chr(repos, \"full_name\"),\n)\n\n\nPlease consider the UN’s Data API and the introductory note on how to use it by Schmertmann (2022). Argentina’s location code is 32. Modify the following code to determine what Argentina’s single-year fertility rate was for 20-year-olds in 1995 (pick one)?\n\n147.679\n172.988\n204.124\n128.665\n\n\n\nmy_indicator <- 68\nmy_location <- 50\nmy_startyr <- 1996\nmy_endyr <- 1999\n\nurl <- paste0(\n \"https://population.un.org/dataportalapi/api/v1\",\n \"/data/indicators/\", my_indicator, \"/locations/\",\n my_location, \"/start/\", my_startyr, \"/end/\",\n my_endyr, \"/?format=csv\"\n)\n\nun_data <- read_delim(file = url, delim = \"|\", skip = 1)\n\nun_data |>\n filter(AgeLabel == 25 & TimeLabel == 1996) |>\n select(Value)\n\n\nWhat is the main argument to GET() from httr (pick one)?\n\n“url”\n“website”\n“domain”\n“location”\n\nIn web scraping, what is the purpose of respecting the robots.txt file (pick one)?\n\nTo ensure the scraped data is accurate.\nTo avoid violating the website’s terms of service by following the site’s crawling guidelines.\nTo speed up the scraping process.\nTo obtain authentication credentials.\n\nWhat features of a website do we typically take advantage of when we parse the code (pick one)?\n\nHTML/CSS mark-up.\nCookies.\nFacebook beacons.\nCode comments.\n\nWhat are some principles to follow when scraping (select all that apply)?\n\nAvoid it if possible\nFollow the site’s guidance\nSlow down\nUse a scalpel not an axe.\n\nWhich of the following is NOT a recommended principle when performing web scraping (pick one)?\n\nAbide by the website’s terms of service.\nReduce the impact on the website’s server by slowing down requests.\nScrape all data regardless of necessity.\nAvoid republishing scraped pages.\n\nWhich of the following, used as part of a regular expression, would match a full stop (hint: see the “strings” cheat sheet) (pick one)?\n\n“.”\n“\\.”\n“\\\\\\.”\n\nWhat are three checks that we might like to use for demographic data, such as the number of births in a country in a particular year?\nWhich of these are functions from the purrr package (select all that apply)?\n\nmap()\nwalk()\nrun()\nsafely()\n\nWhat is the HTML tag for an item in a list (pick one)?\n\nli\nbody\nb\nem\n\nWhich function should we use if we have the text “rohan_alexander” in a column called “names” and want to split it into first name and surname based on the underscore (pick one)?\n\nspacing()\nslice()\nseparate()\ntext_to_columns()\n\nWhat is Optical Character Recognition (OCR) (pick one)?\n\nA process of converting handwritten notes into typed text.\nA method for translating images of text into machine-readable text.\nA technique for parsing structured data from APIs.\nA way to optimize code for faster execution.\n\nWhich function in R can be used to pause execution for a specified amount of time, useful for respecting rate limits during web scraping (pick one)?\n\nsleep()\npause()\nsys.sleep()\nwait()\n\nWhich of the following is a challenge when extracting data from PDFs (pick one)?\n\nPDFs cannot be read by any programming language.\nPDFs are designed for consistent human reading, not for data extraction.\nPDFs always contain unstructured data that cannot be processed.\nPDFs are encrypted and cannot be accessed without a password.\n\nWhen performing OCR on a scanned document, what is a common issue that might affect the accuracy of text recognition (pick one)?\n\nThe file size of the image.\nThe programming language used.\nThe quality and resolution of the scanned image.\nThe number of pages in the document.\n\nFrom Cirone and Spirling (2021), which of the following is NOT a common threat to inference when working with historical data (pick one)?\n\nSelection bias.\nConfirmation bias.\nTime decay.\nOver-representation of marginalized groups.\n\nFrom Cirone and Spirling (2021), what is the “drunkard’s search” problem in historical political economy (and more generally) (pick one)?\n\nSelecting data that is easiest to access without considering representativeness.\nSearching for data only from elite sources.\nOver-relying on digital archives for research.\nMisinterpreting historical texts due to modern biases.\n\nFrom Cirone and Spirling (2021), what role do DAGs have (pick one)?\n\nThey improve the accuracy of OCR for historical data.\nThey generate machine-readable text from historical sources.\nThey help researchers visualize and address causal relationships.\nThey serve as metadata to organize historical archives.\n\nFrom Johnson (2021), what was the focus of early prison data collection by the U.S. Census Bureau (pick one)?\n\nDocumenting health conditions.\nInvestigating racial differences in sentencing.\nRecording socioeconomic background and employment.\nCounting the number of incarcerated people and their demographics.\n\nFrom Johnson (2021), how does community-sourced prison data differ from state-sourced prison data (pick one)?\n\nCommunity data is collected by government officials.\nCommunity data emphasizes lived experiences and prison conditions.\nState data is less reliable than community data.\nState data is more reliable than community data.\n\nFrom Johnson (2021), which of the following is a limitation of state-sourced data (pick one)?\n\nState-sourced data is less reliable than academic studies.\nIt under-represents the prison population.\nIt may reproduce the biases and assumptions of earlier data collections.\nIt focuses on nonviolent offenders only.\n\nFrom Johnson (2021), what question should be asked when looking at prison data collection (pick one)?\n\n“Who established the data infrastructure and why?”.\n“How do the economic factors affect prison management?”.\n“Is the data being used to create public policy?”.\n\n\n\n\nActivity\nPlease redo the web scraping example, but for one of: Australia, Canada, India, or New Zealand.\nPlan, gather, and clean the data, and then use it to create a similar table to the one created above. Write a few paragraphs about your findings. Then write a few paragraphs about the data source, what you gathered, and how you went about it. What took longer than you expected? When did it become fun? What would you do differently next time you do this? Your submission should be at least two pages, but likely more.\nUse Quarto, and include an appropriate title, author, date, link to a GitHub repo, and citations. Submit a PDF.\n\n\n\n\nAlexander, Rohan, and A Mahfouz. 2021. heapsofpapers: Easily Download Heaps of PDF and CSV Files. https://CRAN.R-project.org/package=heapsofpapers.\n\n\nBailey, Rosemary. 2008. Design of Comparative Experiments. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511611483.\n\n\nBor, Jacob, Atheendar Venkataramani, David Williams, and Alexander Tsai. 2018. “Police Killings and Their Spillover Effects on the Mental Health of Black Americans: A Population-Based, Quasi-Experimental Study.” The Lancet 392 (10144): 302–10. https://doi.org/10.1016/s0140-6736(18)31130-9.\n\n\nBrontë, Charlotte. 1847. Jane Eyre. https://www.gutenberg.org/files/1260/1260-h/1260-h.htm.\n\n\nBryan, Jenny, and Hadley Wickham. 2021. gh: GitHub API. https://CRAN.R-project.org/package=gh.\n\n\nCheriet, Mohamed, Nawwaf Kharma, Cheng-Lin Liu, and Ching Suen. 2007. Character Recognition Systems: A Guide for Students and Practitioner. Wiley.\n\n\nCirone, Alexandra, and Arthur Spirling. 2021. “Turning History into Data: Data Collection, Measurement, and Inference in HPE.” Journal of Historical Political Economy 1 (1): 127–54. https://doi.org/10.1561/115.00000005.\n\n\nCollins, Annie, and Rohan Alexander. 2022. “Reproducibility of COVID-19 Pre-Prints.” Scientometrics 127: 4655–73. https://doi.org/10.1007/s11192-022-04418-2.\n\n\nComer, Benjamin P., and Jason R. Ingram. 2022. “Comparing Fatal Encounters, Mapping Police Violence, and Washington Post Fatal Police Shooting Data from 2015-2019: A Research Note.” Criminal Justice Review, January, 073401682110710. https://doi.org/10.1177/07340168211071014.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nCummins, Neil. 2022. “The Hidden Wealth of English Dynasties, 1892–2016.” The Economic History Review 75 (3): 667–702. https://doi.org/10.1111/ehr.13120.\n\n\nEisenstein, Michael. 2022. “Need Web Data? Here’s How to Harvest Them.” Nature 607: 200–201. https://doi.org/10.1038/d41586-022-01830-9.\n\n\nFirke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.\n\n\nGrolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://doi.org/10.18637/jss.v040.i03.\n\n\nHackett, Robert. 2016. “Researchers Caused an Uproar By Publishing Data From 70,000 OkCupid Users.” Fortune, May. https://fortune.com/2016/05/18/okcupid-data-research/.\n\n\nHealy, Kieran. 2022. “Unhappy in Its Own Way,” July. https://kieranhealy.org/blog/archives/2022/07/22/unhappy-in-its-own-way/.\n\n\nJenkins, Jennifer, Steven Rich, Andrew Ba Tran, Paige Moody, Julie Tate, and Ted Mellnik. 2022. “How the Washington Post Examines Police Shootings in the United States.” https://www.washingtonpost.com/investigations/2022/12/05/washington-post-fatal-police-shootings-methodology/.\n\n\nJohnson, Kaneesha. 2021. “Two Regimes of Prison Data Collection.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.72825001.\n\n\nKirkegaard, Emil, and Julius Bjerrekær. 2016. “The OKCupid Dataset: A Very Large Public Dataset of Dating Site Users.” Open Differential Psychology, 1–10. https://doi.org/10.26775/ODP.2016.11.03.\n\n\nLuscombe, Alex, Kevin Dick, and Kevin Walby. 2021. “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences.” Quality & Quantity 56 (3): 1–22. https://doi.org/10.1007/s11135-021-01164-0.\n\n\nLuscombe, Alex, Jamie Duncan, and Kevin Walby. 2022. “Jumpstarting the Justice Disciplines: A Computational-Qualitative Approach to Collecting and Analyzing Text and Image Data in Criminology and Criminal Justice Studies.” Journal of Criminal Justice Education 33 (2): 151–71. https://doi.org/10.1080/10511253.2022.2027477.\n\n\nMüller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.\n\n\nNix, Justin, and M. James Lozada. 2020. “Police Killings of Unarmed Black Americans: A Reassessment of Community Mental Health Spillover Effects,” January. https://doi.org/10.31235/osf.io/ajz2q.\n\n\nOoms, Jeroen. 2014. “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.\n\n\n———. 2022a. pdftools: Text Extraction, Rendering and Converting of PDF Documents. https://CRAN.R-project.org/package=pdftools.\n\n\n———. 2022b. tesseract: Open Source OCR Engine. https://CRAN.R-project.org/package=tesseract.\n\n\nPavlik, Kaylin. 2019. “Understanding + Classifying Genres Using Spotify Audio Features.” https://www.kaylinpavlik.com/classifying-songs-genres/.\n\n\nPerepolkin, Dmytro. 2022. polite: Be Nice on the Web. https://CRAN.R-project.org/package=polite.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nSalganik, Matthew, Peter Sheridan Dodds, and Duncan Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” Science 311 (5762): 854–56. https://doi.org/10.1126/science.1121066.\n\n\nSaulnier, Lucile, Siddharth Karamcheti, Hugo Laurençon, Léo Tronchon, Thomas Wang, Victor Sanh, Amanpreet Singh, et al. 2022. “Putting Ethical Principles at the Core of the Research Lifecycle.” https://huggingface.co/blog/ethical-charter-multimodal.\n\n\nSchmertmann, Carl. 2022. “UN API Test,” July. https://bonecave.schmert.net/un-api-example.html.\n\n\nTaflaga, Marija, and Matthew Kerby. 2019. “Who Does What Work in a Ministerial Office: Politically Appointed Staff and the Descriptive Representation of Women in Australian Political Offices, 19792010.” Political Studies 68 (2): 463–85. https://doi.org/10.1177/0032321719853459.\n\n\nThe Economist. 2022. “What Spotify Data Show about the Decline of English,” January. https://www.economist.com/interactives/graphic-detail/2022/01/29/what-spotify-data-show-about-the-decline-of-english.\n\n\nThe Washington Post. 2023. “Fatal Force Database.” https://github.com/washingtonpost/data-police-shootings.\n\n\nThompson, Charlie, Daniel Antal, Josiah Parry, Donal Phipps, and Tom Wolff. 2022. spotifyr: R Wrapper for the “Spotify” Web API. https://CRAN.R-project.org/package=spotifyr.\n\n\nThomson-DeVeaux, Amelia, Laura Bronner, and Damini Sharma. 2021. “Cities Spend Millions On Police Misconduct Every Year. Here’s Why It’s So Difficult to Hold Departments Accountable.” FiveThirtyEight, February. https://fivethirtyeight.com/features/police-misconduct-costs-cities-millions-every-year-but-thats-where-the-accountability-ends/.\n\n\nWickham, Hadley. 2021. babynames: US Baby Names 1880-2017. https://CRAN.R-project.org/package=babynames.\n\n\n———. 2022. rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.\n\n\n———. 2023. httr: Tools for Working with URLs and HTTP. https://CRAN.R-project.org/package=httr.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.\n\n\nWickham, Hadley, and Lionel Henry. 2022. purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.\n\n\nWickham, Hadley, Jim Hester, and Jeroen Ooms. 2021. xml2: Parse XML. https://CRAN.R-project.org/package=xml2.\n\n\nWong, Julia Carrie. 2020. “One Year Inside Trump’s Monumental Facebook Campaign.” The Guardian, January. https://www.theguardian.com/us-news/2020/jan/28/donald-trump-facebook-ad-campaign-2020-election.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nZimmer, Michael. 2018. “Addressing Conceptual Gaps in Big Data Research Ethics: An Application of Contextual Integrity.” Social Media + Society 4 (2): 1–11. https://doi.org/10.1177/2056305118768300.",
+ "text": "7.5 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: A group of five undergraduates—Matt, Ash, Jacki, Rol, and Mike—each read some number of pages from a book each day for 100 days. Two of the undergraduates are a couple and so their number of pages is positively correlated, however all the others are independent. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation (note the relationship between some variables). Then write five tests based on the simulated data.\n(Acquire) Please obtain some actual data, similar to the scenario, and add a script updating the simulated tests to these actual data.\n(Explore) Build a graph and table using the real data.\n(Communicate) Please write some text to accompany the graph and table. Separate the code appropriately into R files and a Quarto doc. Submit a link to a high-quality GitHub repo.\n\n\n\nQuiz\n\nWhat is an API, in the context of data gathering (pick one)?\n\nA standardized set of functions to process data locally.\nA markup language for structuring data.\nA protocol for web browsers to render HTML content.\nAn interface provided by a server that allows someone else to request data using code.\n\nWhen using APIs for data gathering, which of the following can be used for authentication (pick one)?\n\nProviding an API key or token in the request.\nUsing cookies stored in the browser.\nDisabling SSL verification.\nModifying the hosts file on the client machine.\n\nConsider the following code, which uses gh to access the GitHub API. When was the repo for heapsofpapers created (pick one)?\n\n2021-02-23\n2021-03-06\n2021-05-25\n2021-04-27\n\n\n\n# Based on Tyler Bradley and Monica Alexander\nrepos <- gh(\"/users/RohanAlexander/repos\", per_page = 100)\nrepo_info <- tibble(\n name = map_chr(repos, \"name\"),\n created = map_chr(repos, \"created_at\"),\n full_name = map_chr(repos, \"full_name\"),\n)\n\n\nPlease consider the UN’s Data API and the introductory note on how to use it by Schmertmann (2022). Argentina’s location code is 32. Modify the following code to determine what Argentina’s single-year fertility rate was for 20-year-olds in 1995 (pick one)?\n\n147.679\n172.988\n204.124\n128.665\n\n\n\nmy_indicator <- 68\nmy_location <- 50\nmy_startyr <- 1996\nmy_endyr <- 1999\n\nurl <- paste0(\n \"https://population.un.org/dataportalapi/api/v1\",\n \"/data/indicators/\", my_indicator, \"/locations/\",\n my_location, \"/start/\", my_startyr, \"/end/\",\n my_endyr, \"/?format=csv\"\n)\n\nun_data <- read_delim(file = url, delim = \"|\", skip = 1)\n\nun_data |>\n filter(AgeLabel == 25 & TimeLabel == 1996) |>\n select(Value)\n\n\nWhat is the main argument to GET() from httr (pick one)?\n\n“url”\n“website”\n“domain”\n“location”\n\nIn web scraping, what is the purpose of respecting the robots.txt file (pick one)?\n\nTo ensure the scraped data is accurate.\nTo avoid violating the website’s terms of service by following the site’s crawling guidelines.\nTo speed up the scraping process.\nTo obtain authentication credentials.\n\nWhat features of a website do we typically take advantage of when we parse the code (pick one)?\n\nHTML/CSS mark-up.\nCookies.\nFacebook beacons.\nCode comments.\n\nWhat are some principles to follow when scraping (select all that apply)?\n\nAvoid it if possible\nFollow the site’s guidance\nSlow down\nUse a scalpel not an axe.\n\nWhich of the following is NOT a recommended principle when performing web scraping (pick one)?\n\nAbide by the website’s terms of service.\nReduce the impact on the website’s server by slowing down requests.\nScrape all data regardless of necessity.\nAvoid republishing scraped pages.\n\nWhich of the following, used as part of a regular expression, would match a full stop (hint: see the “strings” cheat sheet) (pick one)?\n\n“.”\n“\\.”\n“\\\\\\.”\n\nWhat are three checks that we might like to use for demographic data, such as the number of births in a country in a particular year?\nWhich of these are functions from the purrr package (select all that apply)?\n\nmap()\nwalk()\nrun()\nsafely()\n\nWhat is the HTML tag for an item in a list (pick one)?\n\nli\nbody\nb\nem\n\nWhich function should we use if we have the text “rohan_alexander” in a column called “names” and want to split it into first name and surname based on the underscore (pick one)?\n\nspacing()\nslice()\nseparate()\ntext_to_columns()\n\nWhat is Optical Character Recognition (OCR) (pick one)?\n\nA process of converting handwritten notes into typed text.\nA method for translating images of text into machine-readable text.\nA technique for parsing structured data from APIs.\nA way to optimize code for faster execution.\n\nWhich function in R can be used to pause execution for a specified amount of time, useful for respecting rate limits during web scraping (pick one)?\n\nsleep()\npause()\nsys.sleep()\nwait()\n\nWhich of the following is a challenge when extracting data from PDFs (pick one)?\n\nPDFs cannot be read by any programming language.\nPDFs are designed for consistent human reading, not for data extraction.\nPDFs always contain unstructured data that cannot be processed.\nPDFs are encrypted and cannot be accessed without a password.\n\nWhen performing OCR on a scanned document, what is a common issue that might affect the accuracy of text recognition (pick one)?\n\nThe file size of the image.\nThe programming language used.\nThe quality and resolution of the scanned image.\nThe number of pages in the document.\n\nFrom Cirone and Spirling (2021), which of the following is NOT a common threat to inference when working with historical data (pick one)?\n\nSelection bias.\nConfirmation bias.\nTime decay.\nOver-representation of marginalized groups.\n\nFrom Cirone and Spirling (2021), what is the “drunkard’s search” problem in historical political economy (and more generally) (pick one)?\n\nSelecting data that is easiest to access without considering representativeness.\nSearching for data only from elite sources.\nOver-relying on digital archives for research.\nMisinterpreting historical texts due to modern biases.\n\nFrom Cirone and Spirling (2021), what role do DAGs have (pick one)?\n\nThey improve the accuracy of OCR for historical data.\nThey generate machine-readable text from historical sources.\nThey help researchers visualize and address causal relationships.\nThey serve as metadata to organize historical archives.\n\nFrom Johnson (2021), what was the focus of early prison data collection by the U.S. Census Bureau (pick one)?\n\nDocumenting health conditions.\nInvestigating racial differences in sentencing.\nRecording socioeconomic background and employment.\nCounting the number of incarcerated people and their demographics.\n\nFrom Johnson (2021), how does community-sourced prison data differ from state-sourced prison data (pick one)?\n\nCommunity data is collected by government officials.\nCommunity data emphasizes lived experiences and prison conditions.\nState data is less reliable than community data.\nState data is more reliable than community data.\n\nFrom Johnson (2021), which of the following is a limitation of state-sourced data (pick one)?\n\nState-sourced data is less reliable than academic studies.\nIt under-represents the prison population.\nIt may reproduce the biases and assumptions of earlier data collections.\nIt focuses on nonviolent offenders only.\n\nFrom Johnson (2021), what question should be asked when looking at prison data collection (pick one)?\n\n“Who established the data infrastructure and why?”.\n“How do the economic factors affect prison management?”.\n“Is the data being used to create public policy?”.\n\n\n\n\nActivity\nPlease redo the web scraping example, but for one of: Australia, Canada, India, or New Zealand.\nPlan, gather, and clean the data, and then use it to create a similar table to the one created above. Write a few paragraphs about your findings. Then write a few paragraphs about the data source, what you gathered, and how you went about it. What took longer than you expected? When did it become fun? What would you do differently next time you do this? Your submission should be at least two pages, but likely more.\nUse Quarto, and include an appropriate title, author, date, link to a GitHub repo, and citations. Submit a PDF.\n\n\n\n\nAlexander, Rohan, and A Mahfouz. 2021. heapsofpapers: Easily Download Heaps of PDF and CSV Files. https://CRAN.R-project.org/package=heapsofpapers.\n\n\nBailey, Rosemary. 2008. Design of Comparative Experiments. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511611483.\n\n\nBor, Jacob, Atheendar Venkataramani, David Williams, and Alexander Tsai. 2018. “Police Killings and Their Spillover Effects on the Mental Health of Black Americans: A Population-Based, Quasi-Experimental Study.” The Lancet 392 (10144): 302–10. https://doi.org/10.1016/s0140-6736(18)31130-9.\n\n\nBrontë, Charlotte. 1847. Jane Eyre. https://www.gutenberg.org/files/1260/1260-h/1260-h.htm.\n\n\nBryan, Jenny, and Hadley Wickham. 2021. gh: GitHub API. https://CRAN.R-project.org/package=gh.\n\n\nCheriet, Mohamed, Nawwaf Kharma, Cheng-Lin Liu, and Ching Suen. 2007. Character Recognition Systems: A Guide for Students and Practitioner. Wiley.\n\n\nCirone, Alexandra, and Arthur Spirling. 2021. “Turning History into Data: Data Collection, Measurement, and Inference in HPE.” Journal of Historical Political Economy 1 (1): 127–54. https://doi.org/10.1561/115.00000005.\n\n\nCollins, Annie, and Rohan Alexander. 2022. “Reproducibility of COVID-19 Pre-Prints.” Scientometrics 127: 4655–73. https://doi.org/10.1007/s11192-022-04418-2.\n\n\nComer, Benjamin P., and Jason R. Ingram. 2022. “Comparing Fatal Encounters, Mapping Police Violence, and Washington Post Fatal Police Shooting Data from 2015-2019: A Research Note.” Criminal Justice Review, January, 073401682110710. https://doi.org/10.1177/07340168211071014.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nCummins, Neil. 2022. “The Hidden Wealth of English Dynasties, 1892–2016.” The Economic History Review 75 (3): 667–702. https://doi.org/10.1111/ehr.13120.\n\n\nEisenstein, Michael. 2022. “Need Web Data? Here’s How to Harvest Them.” Nature 607: 200–201. https://doi.org/10.1038/d41586-022-01830-9.\n\n\nFirke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.\n\n\nGrolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://doi.org/10.18637/jss.v040.i03.\n\n\nHackett, Robert. 2016. “Researchers Caused an Uproar By Publishing Data From 70,000 OkCupid Users.” Fortune, May. https://fortune.com/2016/05/18/okcupid-data-research/.\n\n\nHealy, Kieran. 2022. “Unhappy in Its Own Way,” July. https://kieranhealy.org/blog/archives/2022/07/22/unhappy-in-its-own-way/.\n\n\nJenkins, Jennifer, Steven Rich, Andrew Ba Tran, Paige Moody, Julie Tate, and Ted Mellnik. 2022. “How the Washington Post Examines Police Shootings in the United States.” https://www.washingtonpost.com/investigations/2022/12/05/washington-post-fatal-police-shootings-methodology/.\n\n\nJohnson, Kaneesha. 2021. “Two Regimes of Prison Data Collection.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.72825001.\n\n\nKirkegaard, Emil, and Julius Bjerrekær. 2016. “The OKCupid Dataset: A Very Large Public Dataset of Dating Site Users.” Open Differential Psychology, 1–10. https://doi.org/10.26775/ODP.2016.11.03.\n\n\nLuscombe, Alex, Kevin Dick, and Kevin Walby. 2021. “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences.” Quality & Quantity 56 (3): 1–22. https://doi.org/10.1007/s11135-021-01164-0.\n\n\nLuscombe, Alex, Jamie Duncan, and Kevin Walby. 2022. “Jumpstarting the Justice Disciplines: A Computational-Qualitative Approach to Collecting and Analyzing Text and Image Data in Criminology and Criminal Justice Studies.” Journal of Criminal Justice Education 33 (2): 151–71. https://doi.org/10.1080/10511253.2022.2027477.\n\n\nMüller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.\n\n\nNix, Justin, and M. James Lozada. 2020. “Police Killings of Unarmed Black Americans: A Reassessment of Community Mental Health Spillover Effects,” January. https://doi.org/10.31235/osf.io/ajz2q.\n\n\nOoms, Jeroen. 2014. “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.\n\n\n———. 2022a. pdftools: Text Extraction, Rendering and Converting of PDF Documents. https://CRAN.R-project.org/package=pdftools.\n\n\n———. 2022b. tesseract: Open Source OCR Engine. https://CRAN.R-project.org/package=tesseract.\n\n\nPavlik, Kaylin. 2019. “Understanding + Classifying Genres Using Spotify Audio Features.” https://www.kaylinpavlik.com/classifying-songs-genres/.\n\n\nPerepolkin, Dmytro. 2022. polite: Be Nice on the Web. https://CRAN.R-project.org/package=polite.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nSalganik, Matthew, Peter Sheridan Dodds, and Duncan Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” Science 311 (5762): 854–56. https://doi.org/10.1126/science.1121066.\n\n\nSaulnier, Lucile, Siddharth Karamcheti, Hugo Laurençon, Léo Tronchon, Thomas Wang, Victor Sanh, Amanpreet Singh, et al. 2022. “Putting Ethical Principles at the Core of the Research Lifecycle.” https://huggingface.co/blog/ethical-charter-multimodal.\n\n\nSchmertmann, Carl. 2022. “UN API Test,” July. https://bonecave.schmert.net/un-api-example.html.\n\n\nTaflaga, Marija, and Matthew Kerby. 2019. “Who Does What Work in a Ministerial Office: Politically Appointed Staff and the Descriptive Representation of Women in Australian Political Offices, 19792010.” Political Studies 68 (2): 463–85. https://doi.org/10.1177/0032321719853459.\n\n\nThe Economist. 2022. “What Spotify Data Show about the Decline of English,” January. https://www.economist.com/interactives/graphic-detail/2022/01/29/what-spotify-data-show-about-the-decline-of-english.\n\n\nThe Washington Post. 2023. “Fatal Force Database.” https://github.com/washingtonpost/data-police-shootings.\n\n\nThompson, Charlie, Daniel Antal, Josiah Parry, Donal Phipps, and Tom Wolff. 2022. spotifyr: R Wrapper for the “Spotify” Web API. https://CRAN.R-project.org/package=spotifyr.\n\n\nThomson-DeVeaux, Amelia, Laura Bronner, and Damini Sharma. 2021. “Cities Spend Millions On Police Misconduct Every Year. Here’s Why It’s So Difficult to Hold Departments Accountable.” FiveThirtyEight, February. https://fivethirtyeight.com/features/police-misconduct-costs-cities-millions-every-year-but-thats-where-the-accountability-ends/.\n\n\nWickham, Hadley. 2021. babynames: US Baby Names 1880-2017. https://CRAN.R-project.org/package=babynames.\n\n\n———. 2022. rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.\n\n\n———. 2023. httr: Tools for Working with URLs and HTTP. https://CRAN.R-project.org/package=httr.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.\n\n\nWickham, Hadley, and Lionel Henry. 2022. purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.\n\n\nWickham, Hadley, Jim Hester, and Jeroen Ooms. 2021. xml2: Parse XML. https://CRAN.R-project.org/package=xml2.\n\n\nWong, Julia Carrie. 2020. “One Year Inside Trump’s Monumental Facebook Campaign.” The Guardian, January. https://www.theguardian.com/us-news/2020/jan/28/donald-trump-facebook-ad-campaign-2020-election.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nZimmer, Michael. 2018. “Addressing Conceptual Gaps in Big Data Research Ethics: An Application of Contextual Integrity.” Social Media + Society 4 (2): 1–11. https://doi.org/10.1177/2056305118768300.",
"crumbs": [
"Acquisition",
- "7 Gather data "
+ "7 APIs, scraping, and parsing "
]
},
{
"objectID": "08-hunt.html",
"href": "08-hunt.html",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "",
- "text": "8.1 Introduction\nThis chapter is about obtaining data with experiments. This is a situation in which we can explicitly control and vary what we are interested in. The advantage of this is that identifying and estimating an effect should be clear. There is a treatment group that is subject to what we are interested in, and a control group that is not. These are randomly split before treatment. And so, if they end up different, then it must be because of the treatment. Unfortunately, life is rarely so smooth. Arguing about how similar the treatment and control groups were tends to carry on indefinitely. And before we can estimate an effect, we need to be able to measure whatever it is that we are interested in, which is often surprisingly difficult.\nBy way of motivation, consider the situation of someone who moved to San Francisco in 2014—as soon as they moved the Giants won the World Series and the Golden State Warriors began a historic streak of World Championships. They then moved to Chicago, and immediately the Cubs won the World Series for the first time in 100 years. They then moved to Massachusetts, and the Patriots won the Super Bowl again, and again, and again. And finally, they moved to Toronto, where the Raptors immediately won the World Championship. Should a city pay them to move, or could municipal funds be better spent elsewhere?\nOne way to get at the answer would be to run an experiment. Make a list of the North American cities with major sports teams. Then roll some dice, send them to live there for a year, and measure the outcomes of the sports teams. With enough lifetimes, we could work it out. This would take a long time because we cannot both live in a city and not live in a city. This is the fundamental problem of causal inference: a person cannot be both treated and untreated. Experiments and randomized controlled trials are circumstances in which we try to randomly allocate some treatment, to have a belief that everything else was the same (or at least ignorable). We use the Neyman-Rubin potential outcomes framework to formalize the situation (Holland 1986).\nA treatment, \\(t\\), will often be a binary variable, that is either 0 or 1. It is 0 if the person, \\(i\\), is not treated, which is to say they are in the control group, and 1 if they are treated. We will typically have some outcome, \\(Y_i\\), of interest for that person which could be binary, categorical, multinomial, ordinal, continuous, or possibly even some other type of variable. For instance, it could be vote choice, in which case we could measure whether the person is: “Conservative” or “Not Conservative”; which party they support, say: “Conservative”, “Liberal”, “Democratic”, “Green”; or maybe a probability of supporting some particular leader.\nThe effect of a treatment is then causal if \\((Y_i|t=0) \\neq (Y_i|t=1)\\). That is to say, the outcome for person \\(i\\), given they were not treated, is different to their outcome given they were treated. If we could both treat and control the one individual at the one time, then we would know that it was only the treatment that had caused any change in outcome. There could be no other factor to explain it. But the fundamental problem of causal inference remains: we cannot both treat and control the one individual at the one time. So, when we want to know the effect of the treatment, we need to compare it with a counterfactual. The counterfactual, introduced in Chapter 4, is what would have happened if the treated individual were not treated. As it turns out, this means one way to think of causal inference is as a missing data problem, where we are missing the counterfactual.\nWe cannot compare treatment and control in one individual. So we instead compare the average of two groups—those treated and those not. We are looking to estimate the counterfactual at a group level because of the impossibility of doing it at an individual level. Making this trade-off allows us to move forward but comes at the cost of certainty. We must instead rely on randomization, probabilities, and expectations.\nWe usually consider a default of there being no effect and we look for evidence that would cause us to change our mind. As we are interested in what is happening in groups, we turn to expectations and notions of probability to express ourselves. Hence, we will make claims that apply on average. Maybe wearing fun socks really does make you have a lucky day, but on average, across the group, it is probably not the case. It is worth pointing out that we do not just have to be interested in the average effect. We may consider the median, or variance, or whatever. Nonetheless, if we were interested in the average effect, then one way to proceed would be to:\nThis is an estimator, introduced in Chapter 4, which is a way of putting together a guess of something of interest. The estimand is the thing of interest, in this case the average effect, and the estimate is whatever our guess turns out to be. We can simulate data to illustrate the situation.\nset.seed(853)\n\ntreat_control <-\n tibble(\n group = sample(x = c(\"Treatment\", \"Control\"), size = 100, replace = TRUE),\n binary_effect = sample(x = c(0, 1), size = 100, replace = TRUE)\n )\n\ntreat_control\n\n# A tibble: 100 × 2\n group binary_effect\n <chr> <dbl>\n 1 Treatment 0\n 2 Control 1\n 3 Control 1\n 4 Treatment 1\n 5 Treatment 1\n 6 Treatment 0\n 7 Treatment 1\n 8 Treatment 1\n 9 Control 0\n10 Control 0\n# ℹ 90 more rows\ntreat_control |>\n summarise(\n treat_result = sum(binary_effect) / length(binary_effect),\n .by = group\n )\n\n# A tibble: 2 × 2\n group treat_result\n <chr> <dbl>\n1 Treatment 0.552\n2 Control 0.333\nIn this case, we draw either 0 or 1, 100 times, for each the treatment and control group, and then the estimate of the average effect of being treated is 0.22.\nMore broadly, to tell causal stories we need to bring together theory and a detailed knowledge of what we are interested in (Cunningham 2021, 4). In Chapter 7 we discussed gathering data that we observed about the world. In this chapter we are going to be more active about turning the world into the data that we need. As the researcher, we will decide what to measure and how, and we will need to define what we are interested in. We will be active participants in the data-generating process. That is, if we want to use this data, then as researchers we must go out and hunt it.\nIn this chapter we cover experiments, especially constructing treatment and control groups, and appropriately considering their results. We go through implementing a survey. We discuss some aspects of ethical behavior in experiments through reference to the Tuskegee Syphilis Study and the Extracorporeal Membrane Oxygenation (ECMO) experiment and go through various case studies. Finally, we then turn to A/B testing, which is extensively used in industry, and consider a case study based on Upworthy data.\nRonald Fisher, the twentieth century statistician, and Francis Galton, the nineteenth century statistician, are the intellectual grandfathers of much of the work that we cover in this chapter. In some cases it is directly their work, in other cases it is work that built on their contributions. Both men believed in eugenics, amongst other things that are generally reprehensible. In the same way that art history acknowledges, say, Caravaggio as a murderer, while also considering his work and influence, so too must statistics and data science more generally concern themselves with this past, at the same time as we try to build a better future.",
+ "text": "8.1 Introduction\nThis chapter is about obtaining data with experiments and surveys. An experiment is a situation in which we can explicitly control and vary what we are interested in. The advantage of this is that identifying and estimating an effect should be clear. There is a treatment group that is subject to what we are interested in, and a control group that is not. These are randomly split before treatment. And so, if they end up different, then it must be because of the treatment. Unfortunately, life is rarely so smooth. Arguing about how similar the treatment and control groups were tends to carry on indefinitely. And before we can estimate an effect, we need to be able to measure whatever it is that we are interested in, which is often surprisingly difficult.\nBy way of motivation, consider the situation of someone who moved to San Francisco in 2014—as soon as they moved the Giants won the World Series and the Golden State Warriors began a historic streak of World Championships. They then moved to Chicago, and immediately the Cubs won the World Series for the first time in 100 years. They then moved to Massachusetts, and the Patriots won the Super Bowl again, and again, and again. And finally, they moved to Toronto, where the Raptors immediately won the World Championship. Should a city pay them to move, or could municipal funds be better spent elsewhere?\nOne way to get at the answer would be to run an experiment. Make a list of the North American cities with major sports teams. Then roll some dice, send them to live there for a year, and measure the outcomes of the sports teams. With enough lifetimes, we could work it out. This would take a long time because we cannot both live in a city and not live in a city. This is the fundamental problem of causal inference: a person cannot be both treated and untreated. Experiments and randomized controlled trials are circumstances in which we try to randomly allocate some treatment, to have a belief that everything else was the same (or at least ignorable). We use the Neyman-Rubin potential outcomes framework to formalize the situation (Holland 1986).\nA treatment, \\(t\\), will often be a binary variable, that is either 0 or 1. It is 0 if the person, \\(i\\), is not treated, which is to say they are in the control group, and 1 if they are treated. We will typically have some outcome, \\(Y_i\\), of interest for that person which could be binary, categorical, multinomial, ordinal, continuous, or possibly even some other type of variable. For instance, it could be vote choice, in which case we could measure whether the person is: “Conservative” or “Not Conservative”; which party they support, say: “Conservative”, “Liberal”, “Democratic”, “Green”; or maybe a probability of supporting some particular leader.\nThe effect of a treatment is then causal if \\((Y_i|t=0) \\neq (Y_i|t=1)\\). That is to say, the outcome for person \\(i\\), given they were not treated, is different to their outcome given they were treated. If we could both treat and control the one individual at the one time, then we would know that it was only the treatment that had caused any change in outcome. There could be no other factor to explain it. But the fundamental problem of causal inference remains: we cannot both treat and control the one individual at the one time. So, when we want to know the effect of the treatment, we need to compare it with a counterfactual. The counterfactual, introduced in Chapter 4, is what would have happened if the treated individual were not treated. As it turns out, this means one way to think of causal inference is as a missing data problem, where we are missing the counterfactual.\nWe cannot compare treatment and control in one individual. So we instead compare the average of two groups—those treated and those not. We are looking to estimate the counterfactual at a group level because of the impossibility of doing it at an individual level. Making this trade-off allows us to move forward but comes at the cost of certainty. We must instead rely on randomization, probabilities, and expectations.\nWe usually consider a default of there being no effect and we look for evidence that would cause us to change our mind. As we are interested in what is happening in groups, we turn to expectations and notions of probability to express ourselves. Hence, we will make claims that apply on average. Maybe wearing fun socks really does make you have a lucky day, but on average, across the group, it is probably not the case. It is worth pointing out that we do not just have to be interested in the average effect. We may consider the median, or variance, or whatever. Nonetheless, if we were interested in the average effect, then one way to proceed would be to:\nThis is an estimator, introduced in Chapter 4, which is a way of putting together a guess of something of interest. The estimand is the thing of interest, in this case the average effect, and the estimate is whatever our guess turns out to be. We can simulate data to illustrate the situation.\nset.seed(853)\n\ntreat_control <-\n tibble(\n group = sample(x = c(\"Treatment\", \"Control\"), size = 100, replace = TRUE),\n binary_effect = sample(x = c(0, 1), size = 100, replace = TRUE)\n )\n\ntreat_control\n\n# A tibble: 100 × 2\n group binary_effect\n <chr> <dbl>\n 1 Treatment 0\n 2 Control 1\n 3 Control 1\n 4 Treatment 1\n 5 Treatment 1\n 6 Treatment 0\n 7 Treatment 1\n 8 Treatment 1\n 9 Control 0\n10 Control 0\n# ℹ 90 more rows\ntreat_control |>\n summarise(\n treat_result = sum(binary_effect) / length(binary_effect),\n .by = group\n )\n\n# A tibble: 2 × 2\n group treat_result\n <chr> <dbl>\n1 Treatment 0.552\n2 Control 0.333\nIn this case, we draw either 0 or 1, 100 times, for each the treatment and control group, and then the estimate of the average effect of being treated is 0.22.\nMore broadly, to tell causal stories we need to bring together theory and a detailed knowledge of what we are interested in (Cunningham 2021, 4). In Chapter 7 we discussed gathering data that we observed about the world. In this chapter we are going to be more active about turning the world into the data that we need. As the researcher, we will decide what to measure and how, and we will need to define what we are interested in. We will be active participants in the data-generating process. That is, if we want to use this data, then as researchers we must go out and hunt it.\nIn this chapter we cover experiments, especially constructing treatment and control groups, and appropriately considering their results. We go through implementing a survey. We discuss some aspects of ethical behavior in experiments through reference to the Tuskegee Syphilis Study and the Extracorporeal Membrane Oxygenation (ECMO) experiment and go through various case studies. Finally, we then turn to A/B testing, which is extensively used in industry, and consider a case study based on Upworthy data.\nRonald Fisher, the twentieth century statistician, and Francis Galton, the nineteenth century statistician, are the intellectual grandfathers of much of the work that we cover in this chapter. In some cases it is directly their work, in other cases it is work that built on their contributions. Both men believed in eugenics, amongst other things that are generally reprehensible. In the same way that art history acknowledges, say, Caravaggio as a murderer, while also considering his work and influence, so too must statistics and data science more generally concern themselves with this past, at the same time as we try to build a better future.",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
"objectID": "08-hunt.html#introduction",
"href": "08-hunt.html#introduction",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "",
"text": "divide the dataset in two—treated and not treated—and have a binary effect variable—lucky day or not;\nsum the variable, then divide it by the length of the variable; and\ncompare this value between the two groups.",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
"objectID": "08-hunt.html#field-experiments-and-randomized-controlled-trials",
"href": "08-hunt.html#field-experiments-and-randomized-controlled-trials",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "8.2 Field experiments and randomized controlled trials",
"text": "8.2 Field experiments and randomized controlled trials\n\n8.2.1 Randomization\nCorrelation can be enough in some settings (Hill 1965), but to be able to make forecasts when things change, and circumstances are slightly different, we should try to understand causation. Economics went through a credibility revolution in the 2000s (Angrist and Pischke 2010). Economists realized previous work was not as reliable as it could be. There was increased concern with research design and use of experiments. This also happened in other social sciences, such as political science at a similar time (Druckman and Green 2021).\nThe key is the counterfactual: what would have happened in the absence of the treatment. Ideally, we could keep everything else constant, randomly divide the world into two groups, and treat one and not the other. Then we could be confident that any difference between the two groups was due to that treatment. The reason for this is that if we have some population and we randomly select two groups from it, then those two groups (provided they are both big enough) should have the same characteristics as the population. Randomized controlled trials (RCTs) and A/B testing attempt to get us as close to this “gold standard” as we can hope.\nWhen we, and others such as Athey and Imbens (2017b), use such positive language to refer to these approaches, we do not mean to imply that they are perfect. Just that they can be better than most of the other options. For instance, in Chapter 15 we will consider causality from observational data, and while this is sometimes all that we can do, the circumstances in which it is possible to evaluate both makes it clear that approaches based on observational data are usually second-best (Gordon et al. 2019; Gordon, Moakler, and Zettelmeyer 2022). RCTs and A/B testing also bring other benefits, such as the chance to design a study that focuses on a particular question and tries to uncover the mechanism by which the effect occurs (Alsan and Finkelstein 2021). But they are not perfect, and the embrace of RCTs has not been unanimous (Deaton 2010).\nOne bedrock of experimental practice is that it be blinded, that is, a participant does not know whether they are in the treatment or control group. A failure to blind, especially with subjective outcomes, is grounds for the dismissal of an entire experiment in some disciplines (Edwards 2017). Ideally experiments should be double-blind, that is, even the researcher does not know. Stolberg (2006) discusses an early example of a randomized double-blind trial in 1835 to evaluate the effect of homeopathic drugs where neither the participants nor the organizers knew who was in which group. This is rarely the case for RCTs and A/B testing. Again, this is not to say they are not useful—after all in 1847 Semmelweis identified the benefit of having an intern wash their hands before delivering babies without a blinded study (Morange 2016, 121). Another major concern is with the extent to which the result found in the RCT generalizes to outside of that setting. There are typically few RCTs conducted over a long time, although it is possible this is changing and Bouguen et al. (2019) provide some RCTs that could be followed up on to assess long-term effects. Finally, the focus on causality has not been without cost in social sciences. Some argue that a causality-focused approach centers attention on the types of questions that it can answer at the expense of other types of questions.\n\n\n8.2.2 Simulated example: cats or dogs\nWe hope to be able to establish treatment and control groups that are the same, but for the treatment. This means creating the control group is critical because when we do that, we establish the counterfactual. We might be worried about, say, underlying trends, which is one issue with a before-and-after comparison, or selection bias, which could occur when we allow self-selection into the treatment group. Either of these issues could result in biased estimates. We use randomization to go some way to addressing these.\nTo get started, we simulate a population, and then randomly sample from it. We will set it up so that half the population likes blue, and the other half likes white. And further, if someone likes blue then they almost surely prefer dogs, but if they like white then they almost surely prefer cats. Simulation is a critical part of the workflow advocated in this book. This is because we know what the outcomes should be from the analysis of simulated data. Whereas if we go straight to analyzing real data, then we do not know if unexpected outcomes are due to our own analysis errors, or actual results. Another good reason it is useful to take this approach of simulation is that when you are working in teams the analysis can get started before the data collection and cleaning is completed. The simulation will also help the collection and cleaning team think about tests they should run on their data.\n\nset.seed(853)\n\nnum_people <- 5000\n\npopulation <- tibble(\n person = 1:num_people,\n favorite_color = sample(c(\"Blue\", \"White\"), size = num_people, replace = TRUE),\n prefers_dogs = if_else(favorite_color == \"Blue\", \n rbinom(num_people, 1, 0.9), \n rbinom(num_people, 1, 0.1))\n )\n\npopulation |>\n count(favorite_color, prefers_dogs)\n\n# A tibble: 4 × 3\n favorite_color prefers_dogs n\n <chr> <int> <int>\n1 Blue 0 256\n2 Blue 1 2291\n3 White 0 2239\n4 White 1 214\n\n\nBuilding on the terminology and concepts introduced in Chapter 6, we now construct a sampling frame that contains about 80 per cent of the target population.\n\nset.seed(853)\n\nframe <-\n population |>\n mutate(in_frame = rbinom(n = num_people, 1, prob = 0.8)) |> \n filter(in_frame == 1)\n\nframe |>\n count(favorite_color, prefers_dogs)\n\n# A tibble: 4 × 3\n favorite_color prefers_dogs n\n <chr> <int> <int>\n1 Blue 0 201\n2 Blue 1 1822\n3 White 0 1803\n4 White 1 177\n\n\nFor now, we will set aside dog or cat preferences and focus on creating treatment and control groups with favorite color only.\n\nset.seed(853)\n\nsample <-\n frame |>\n select(-prefers_dogs) |>\n mutate(\n group = \n sample(x = c(\"Treatment\", \"Control\"), size = nrow(frame), replace = TRUE\n ))\n\nWhen we look at the mean for the two groups, we can see that the proportions that prefer blue or white are very similar to what we specified (Table 8.1).\n\nsample |>\n count(group, favorite_color) |>\n mutate(prop = n / sum(n),\n .by = group) |>\n kable(\n col.names = c(\"Group\", \"Prefers\", \"Number\", \"Proportion\"),\n digits = 2,\n format.args = list(big.mark = \",\")\n )\n\n\n\nTable 8.1: Proportion of the groups that prefer blue or white\n\n\n\n\n\n\nGroup\nPrefers\nNumber\nProportion\n\n\n\n\nControl\nBlue\n987\n0.50\n\n\nControl\nWhite\n997\n0.50\n\n\nTreatment\nBlue\n1,036\n0.51\n\n\nTreatment\nWhite\n983\n0.49\n\n\n\n\n\n\n\n\nWe randomized with favorite color only. But we should also find that we took dog or cat preferences along at the same time and will have a “representative” share of people who prefer dogs to cats. We can look at our dataset (Table 8.2).\n\nsample |>\n left_join(\n frame |> select(person, prefers_dogs),\n by = \"person\"\n ) |>\n count(group, prefers_dogs) |>\n mutate(prop = n / sum(n),\n .by = group) |>\n kable(\n col.names = c(\n \"Group\",\n \"Prefers dogs to cats\",\n \"Number\",\n \"Proportion\"\n ),\n digits = 2,\n format.args = list(big.mark = \",\")\n )\n\n\n\nTable 8.2: Proportion of the treatment and control group that prefer dogs or cats\n\n\n\n\n\n\nGroup\nPrefers dogs to cats\nNumber\nProportion\n\n\n\n\nControl\n0\n1,002\n0.51\n\n\nControl\n1\n982\n0.49\n\n\nTreatment\n0\n1,002\n0.50\n\n\nTreatment\n1\n1,017\n0.50\n\n\n\n\n\n\n\n\nIt is exciting to have a representative share on “unobservables”. (In this case, we do “observe” them—to illustrate the point—but we did not select on them). We get this because the variables were correlated. But it will break down in several ways that we will discuss. It also assumes large enough groups. For instance, if we considered specific dog breeds, instead of dogs as an entity, we may not find ourselves in this situation. To check that the two groups are the same, we look to see if we can identify a difference between the two groups based on observables, theory, experience, and expert opinion. In this case we looked at the mean, but we could look at other aspects as well.\nThis would traditionally bring us to Analysis of Variance (ANOVA). ANOVA was introduced around 100 years ago by Fisher while he was working on statistical problems in agriculture. (Stolley (1991) provides additional background on Fisher.) This is less unexpected than it may seem because historically agricultural research was closely tied to statistical innovation. Often statistical methods were designed to answer agricultural questions such as “does fertilizer work?” and were only later adapted to clinical trials (Yoshioka 1998). It was relatively easily to divide a field into “treated” and “non-treated”, and the magnitude of any effect was likely to be large. While appropriate for that context, often these same statistical approaches are still taught today in introductory material, even when they are being applied in different circumstances to those they were designed for. It almost always pays to take a step back and think about what is being done and whether it is appropriate to the circumstances. We mention ANOVA here because of its importance historically. There is nothing wrong with it in the right setting. But the number of modern use-cases where it is the best option tends to be small. It might be better to build the model that underpins ANOVA ourselves, which we cover in Chapter 12.\n\n\n8.2.3 Treatment and control\nIf the treatment and control groups are the same in all ways and remain that way, but for the treatment, then we have internal validity, which is to say that our control will work as a counterfactual and our results can speak to a difference between the groups in that study. Internal validity means that our estimates of the effect of the treatment speak to the treatment and not some other aspect. It means that we can use our results to make claims about what happened in the experiment.\nIf the group to which we applied our randomization were representative of the broader population, and the experimental set-up was like outside conditions, then we further could have external validity. That would mean that the difference that we find does not just apply in our own experiment, but also in the broader population. External validity means that we can use our experiment to make claims about what would happen outside the experiment. It is randomization that has allowed that to happen. In practice we would not just rely on one experiment but would instead consider that a contribution to a broader evidence-collection effort (Duflo 2020, 1955).\n\n\n\n\n\n\nShoulders of giants\n\n\n\nDr Esther Duflo is Abdul Latif Jameel Professor of Poverty Alleviation and Development Economics at MIT. After earning a PhD in Economics from MIT in 1999, she remained at MIT as an assistant professor, being promoted to full professor in 2003. One area of her research is economic development where she uses randomized controlled trials to understand how to address poverty. One of her most important books is Poor Economics (Banerjee and Duflo 2011). One of her most important papers is Banerjee et al. (2015) which uses randomization to examine the effect of microfinance. She was awarded the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel in 2019.\n\n\nBut this means we need randomization twice. Firstly, into the group that was subject to the experiment, and then secondly, between treatment and control. How do we think about this randomization, and to what extent does it matter?\nWe are interested in the effect of being treated. It may be that we charge different prices, which would be a continuous treatment variable, or that we compare different colors on a website, which would be a discrete treatment variable. Either way, we need to make sure that the groups are otherwise the same. How can we be convinced of this? One way is to ignore the treatment variable and to examine all other variables, looking for whether we can detect a difference between the groups based on any other variables. For instance, if we are conducting an experiment on a website, then are the groups roughly similar in terms of, say:\n\nMicrosoft and Apple users?\nSafari, Chrome, and Firefox users?\nMobile and desktop users?\nUsers from certain locations?\n\nFurther, are the groups representative of the broader population? These are all threats to the validity of our claims. For instance, the Nationscape survey which we consider later in this chapter was concerned about the number of Firefox users who completed the survey. In the end they exclude a subset of those respondents (Vavreck and Tausanovitch 2021, 5).\nWhen done properly, that is if the treatment is truly independent, then we can estimate the average treatment effect (ATE). In a binary treatment variable setting this is:\n\\[\\mbox{ATE} = \\mathbb{E}[Y|t=1] - \\mathbb{E}[Y|t=0].\\]\nThat is, the difference between the treated group, \\(t = 1\\), and the control group, \\(t = 0\\), when measured by the expected value of the outcome, \\(Y\\). The ATE becomes the difference between two conditional expectations.\nTo illustrate this concept, we simulate some data that shows an average difference of one between the treatment and control groups.\n\nset.seed(853)\n\nate_example <- \n tibble(person = 1:1000,\n treated = sample(c(\"Yes\", \"No\"), size = 1000, replace = TRUE)) |>\n mutate(outcome = case_when(\n treated == \"No\" ~ rnorm(n(), mean = 5, sd = 1),\n treated == \"Yes\" ~ rnorm(n(), mean = 6, sd = 1),\n ))\n\nWe can see the difference, which we simulated to be one, between the two groups in Figure 8.1. And we can compute the average between the groups and then the difference to see also that we roughly get back the result that we put in (Table 8.3).\n\nate_example |>\n ggplot(aes(x = outcome, fill = treated)) +\n geom_histogram(position = \"dodge2\", binwidth = 0.2) +\n theme_minimal() +\n labs(x = \"Outcome\", \n y = \"Number of people\", \n fill = \"Person was treated\") +\n scale_fill_brewer(palette = \"Set1\") +\n theme(legend.position = \"bottom\")\n\n\n\n\n\n\n\nFigure 8.1: Simulated data showing a difference between the treatment and control group\n\n\n\n\n\n\nate_example |>\n summarise(mean = mean(outcome),\n .by = treated) |> \n kable(\n col.names = c(\n \"Was treated?\",\n \"Average effect\"\n ),\n digits = 2\n )\n\n\n\nTable 8.3: Average difference between the treatment and control groups for data simulated to have an average difference of one\n\n\n\n\n\n\nWas treated?\nAverage effect\n\n\n\n\nYes\n6.06\n\n\nNo\n5.03\n\n\n\n\n\n\n\n\nUnfortunately, there is often a difference between simulated data and reality. For instance, an experiment cannot run for too long otherwise people may be treated many times or become inured to the treatment; but it cannot be too short otherwise we cannot measure longer-term outcomes. We cannot have a “representative” sample across every facet of a population, but if not, then the treatment and control may be different. Practical difficulties may make it difficult to follow up with certain groups and so we end up with a biased collection. Some questions to explore when working with real experimental data include:\n\nHow are the participants being selected into the frame for consideration?\nHow are they being selected for treatment? We would hope this is being done randomly, but this term is applied to a variety of situations. Additionally, early “success” can lead to pressure to treat everyone, especially in medical settings.\nHow is treatment being assessed?\nTo what extent is random allocation ethical and fair? Some argue that shortages mean it is reasonable to randomly allocate, but that may depend on how linear the benefits are. It may also be difficult to establish definitions, and the power imbalance between those making these decisions and those being treated should be considered.\n\nBias and other issues are not the end of the world. But we need to think about them carefully. Selection bias, introduced in Chapter 4, can be adjusted for, but only if it is recognized. For instance, how would the results of a survey about the difficulty of a university course differ if only students who completed the course were surveyed, and not those who dropped out? We should always work to try to make our dataset as representative as possible when we are creating it, but it may be possible to use a model to adjust for some of the bias after the fact. For instance, if there were a variable that was correlated with, say, attrition, then it could be added to the model either by itself, or as an interaction. Similarly, if there was correlation between the individuals. For instance, if there was some “hidden variable” that we did not know about that meant some individuals were correlated, then we could use wider standard errors. This needs to be done carefully and we discuss this further in Chapter 15. That said, if such issues can be anticipated, then it may be better to change the experiment. For instance, perhaps it would be possible to stratify by that variable.\n\n\n8.2.4 Fisher’s tea party\nThe British are funny when it comes to tea. There is substantial, persistent, debate in Britain about how to make the perfect “cuppa” with everyone from George Orwell to John Lennon weighing in. Some say to add the milk first. Others, to add it last. YouGov, a polling company, found that most respondents put milk in last (Smith 2018). But one might wonder whether the order matters at all.\nFisher introduced an experiment designed to see if a person can distinguish between a cup of tea where the milk was added first, or last. We begin by preparing eight cups of tea: four with milk added first and the other four with milk added last. We then randomize the order of all eight cups. We tell the taster, whom we will call “Ian”, about the experimental set-up: there are eight cups of tea, four of each type, he will be given cups of tea in a random order, and his task is to group them into two groups.\nOne of the nice aspects of this experiment is that we can do it ourselves. There are a few things to be careful of in practice. These include:\n\nthat the quantities of milk and tea are consistent,\nthe groups are marked in some way that the taster cannot see, and\nthe order is randomized.\n\nAnother nice aspect of this experiment is that we can calculate the chance that Ian is able to randomly get the groupings correct. To decide if his groupings were likely to have occurred at random, we need to calculate the probability this could happen. First, we count the number of successes out of the four that were chosen. There are: \\({8 \\choose 4} = \\frac{8!}{4!(8-4)!}=70\\) possible outcomes (Fisher [1935] 1949, 14). This notation means there are eight items in the set, and we are choosing four of them, and is used when the order of choice does not matter.\nWe are asking Ian to group the cups, not to identify which is which, and so there are two ways for him to be perfectly correct. He could either correctly identify all the ones that were milk-first (one outcome out of 70) or correctly identify all the ones that were tea-first (one outcome out of 70). This means the probability of this event is: \\(\\frac{2}{70}\\), or about three per cent.\nAs Fisher ([1935] 1949, 15) makes clear, this now becomes a judgement call. We need to consider the weight of evidence that we require before we accept the groupings did not occur by chance and that Ian was aware of what he was doing. We need to decide what evidence it takes for us to be convinced. If there is no possible evidence that would dissuade us from the view that we held coming into the experiment, say, that there is no difference between milk-first and tea-first, then what is the point of doing an experiment? We expect that if Ian got it completely right, then the reasonable person would accept that he was able to tell the difference.\nWhat if he is almost perfect? By chance, there are 16 ways for a person to be “off-by-one”. Either Ian thinks there was one cup that was milk-first when it was tea-first—there are, \\({4 \\choose 1} = 4\\), four ways this could happen—or he thinks there was one cup that was tea-first when it was milk-first—again, there are, \\({4 \\choose 1}\\) = 4, four ways this could happen. These outcomes are independent, so the probability is \\(\\frac{4\\times 4}{70}\\), or about 23 per cent. Given there is an almost 23 per cent chance of being off-by-one just by randomly grouping the teacups, this outcome probably would not convince us that Ian could tell the difference between tea-first and milk-first.\nWhat we are looking for, in order to claim something is experimentally demonstrable is that we have come to know the features of an experiment where such a result is reliably found (Fisher [1935] 1949, 16). We need a weight of evidence rather than just one experiment. We are looking to thoroughly interrogate our data and our experiments, and to think precisely about the analysis methods we are using. Rather than searching for meaning in constellations of stars, we want to make it as easy as possible for others to reproduce our work. It is in that way that our conclusions stand a better chance of holding up in the long term.\n\n\n8.2.5 Ethical foundations\nThe weight of evidence in medical settings can be measured in lost lives. One reason ethical practice in medical experiments developed is to prevent the unnecessary loss of life. We now detail two cases where human life may have been unnecessarily lost that helped establish foundations of ethical practice. We consider the need to obtain informed consent by discussing the Tuskegee Syphilis Study. And the need to ensure that an experiment is necessary by discussing the ECMO experiments.\n\n8.2.5.1 Tuskegee Syphilis Study\nFollowing Brandt (1978) and Alsan and Wanamaker (2018), the Tuskegee Syphilis Study is an infamous medical trial that began in 1932. As part of this experiment, 400 Black Americans with syphilis were not given appropriate treatment, nor even told they had syphilis, well after a standard treatment for syphilis was established and widely available. A control group, without syphilis, were also given non-effective drugs. These financially poor Black Americans in the United States South were offered minimal compensation and not told they were part of an experiment. Further, extensive work was undertaken to ensure the men would not receive treatment from anywhere, including writing to local doctors and the local health department. Even after some of the men were drafted and told to immediately get treatment, the draft board complied with a request to have the men excluded from treatment. By the time the study was stopped in 1972, more than half of the men were deceased and many of deaths were from syphilis-related causes.\nThe effect of the Tuskegee Syphilis Study was felt not just by the men in the study, but more broadly. Alsan and Wanamaker (2018) found that it is associated with a decrease in life expectancy at age 45 of up to 1.5 years for Black men located around central Alabama because of medical mistrust and decreased interactions with physicians. In response the United States established requirements for Institutional Review Boards and President Clinton made a formal apology in 1997. Brandt (1978, 27) says:\n\nIn retrospect the Tuskegee Study revealed more about the pathology of racism than the pathology of syphilis; more about the nature of scientific inquiry than the nature of the disease process\\(\\dots\\) [T]he notion that science is a value-free discipline must be rejected. The need for greater vigilance in assessing the specific ways in which social values and attitudes affect professional behavior is clearly indicated.\n\nHeller (2022) provides further background on the Tuskegee Syphilis Study.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nDr Marcella Alsan is a Professor of Public Policy at Harvard University. She has an MD from Loyola University and earned a PhD in Economics from Harvard University in 2012. She was appointed as an assistant professor at Stanford, being promoted to full professor in 2019 when she returned to Harvard. One area of her research is health inequality, and one particularly important paper is Alsan and Wanamaker (2018), which we discussed above. She was awarded a MacArthur Foundation Fellowship in 2021.\n\n\n\n\n8.2.5.2 Extracorporeal membrane oxygenation (ECMO)\nTurning to the evaluation of extracorporeal membrane oxygenation (ECMO), Ware (1989) describes how they viewed ECMO as a possible treatment for persistent pulmonary hypertension in newborn children. They enrolled 19 patients and used conventional medical therapy on ten of them, and ECMO on nine of them. It was found that six of the ten in the control group survived while all in the treatment group survived. Ware (1989) used randomized consent whereby only the parents of infants randomly selected to be treated with ECMO were asked to consent.\nWe are concerned with “equipoise”, by which we refer to a situation in which there is genuine uncertainty about whether the treatment is more effective than conventional procedures. In medical settings even if there is initial equipoise it could be undermined if the treatment is found to be effective early in the study. Ware (1989) describes how after the results of these first 19 patients, randomization stopped and only ECMO was used. The recruiters and those treating the patients were initially not told that randomization had stopped. It was decided that this complete allocation to ECMO would continue “until either the 28th survivor or the 4th death was observed”. After 19 of 20 additional patients survived the trial was terminated. The experiment was effectively divided into two phases: in the first there was randomized use of ECMO, and in the second only ECMO was used.\nOne approach in these settings is a “randomized play-the-winner” rule following Wei and Durham (1978). Treatment is still randomized, but the probability shifts with each successful treatment to make treatment more likely, and there is some stopping rule. Berry (1989) argues that far from the need for a more sophisticated stopping rule, there was no need for this study of ECMO because equipoise never existed. Berry (1989) re-visits the literature mentioned by Ware (1989) and finds extensive evidence that ECMO was already known to be effective. Berry (1989) points out that there is almost never complete consensus and so one could almost always argue, inappropriately, for the existence of equipoise even in the face of a substantial weight of evidence. Berry (1989) further criticizes Ware (1989) for the use of randomized consent because of the potential that there may have been different outcomes for the infants subject to conventional medical therapy had their parents known there were other options.\nThe Tuskegee Syphilis Study and ECMO experiments may seem quite far from our present circumstances. While it may be illegal to do this exact research these days, it does not mean that unethical research does not still happen. For instance, we see it in machine learning applications in health and other areas; while we are not meant to explicitly discriminate and we are meant to get consent, it does not mean that we cannot implicitly discriminate without any type of consumer buy-in. For instance, Obermeyer et al. (2019) describes how many health care systems in the United States use algorithms to score the severity of how sick a patient is. They show that for the same score, Black patients are sicker, and that if Black patients were scored in the same way as White patients, then they would receive considerably more care. They find that the discrimination occurs because the algorithm is based on health care costs, rather than sickness. But because access to healthcare is unequally distributed between Black and White patients, the algorithm, however inadvertently, perpetuates racial bias.",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
"objectID": "08-hunt.html#surveys",
"href": "08-hunt.html#surveys",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "8.3 Surveys",
"text": "8.3 Surveys\nHaving decided what to measure, one common way to get values is to use a survey. This is especially challenging, and there is an entire field—survey research—focused on it. Edelman, Vittert, and Meng (2021) make it clear that there are no new problems here, and the challenges that we face today are closely related to those that were faced in the past. There are many ways to implement surveys, and this decision matters. For some time, the only option was face-to-face surveys, where an enumerator conducted the survey in-person with the respondent. Eventually surveys began to be conducted over the telephone, again by an enumerator. One issue in both these settings was a considerable interviewer effect (Elliott et al. 2022). The internet brought a third era of survey research, characterized by low participation rates (Groves 2011). Surveys are a popular and invaluable way to get data. Face-to-face and telephone surveys are still used and have an important role to play, but many surveys are now internet-based.\nThere are many dedicated survey platforms, such as Survey Monkey and Qualtrics, that are largely internet-based. One especially common approach, because it is free, is to use Google Forms. In general, the focus of those platforms is enabling the user to construct and send a survey form. They typically expect the user already has contact details for some sampling frame.\nOther platforms, such as Amazon Mechanical Turk, mentioned in Chapter 3, and Prolific, focus on providing respondents. When using platforms like those we should try to understand who those respondents are and how they might differ from the population of interest (Levay, Freese, and Druckman 2016; Enns and Rothschild 2022).\nThe survey form needs to be considered within the context of the broader research and with special concern for the respondent. Try to conduct a test of the survey before releasing it. Light, Singer, and Willett (1990, 213), in the context of studies to evaluate higher education, say that there is no occasion in which a pilot study will not bring improvements, and that they are almost always worth it. In the case of surveys, we go further. If you do not have the time, or budget, to test a survey then it might be better to re-consider whether the survey should be done.\nTry to test the wording of a survey (Tourangeau, Rips, and Rasinski 2000, 23). When designing the survey, we need to have survey questions that are conversational and flow from one to the next, grouped within topics (Elson 2018). But we should also consider the cognitive load that we place on the respondent, and vary the difficulty of the questions.\nWhen designing a survey, the critical task is to keep the respondent front-of-mind (Dillman, Smyth, and Christian [1978] 2014, 94). Drawing on Swain (1985), all questions need to be relevant and able to be answered by the respondent. The wording of the questions should be based on what the respondent would be comfortable with. The decision between different question types turns on minimizing both error and the burden that we impose on the respondent. In general, if there are a small number of clear options then multiple-choice questions are appropriate. In that case, the responses should usually be mutually exclusive and collectively exhaustive. If they are not mutually exclusive, then this needs to be signaled in the text of the question. It is also important that units are specified, and that standard concepts are used, to the extent possible.\nOpen text boxes may be appropriate if there are many potential answers. This will increase both the time the respondent spends completing the survey and the time it will take to analyze the answers. Only ask one question at a time and try to ask questions in a neutral way that does not lead to one particular response. Testing the survey helps avoid ambiguous or double-barreled questions, which could confuse respondents. The subject matter of the survey will also affect the appropriate choice of question type. For instance, potentially “threatening” topics may be better considered with open-ended questions (Blair et al. 1977).\nAll surveys need to have an introduction that specifies a title for the survey, who is conducting it, their contact details, and the purpose. It should also include a statement about the confidentiality protections that are in place, and any ethics review board clearances that were obtained.\nWhen doing surveys, it is critical to ask the right person. For instance, Lichand and Wolf (2022) consider child labor. The extent of child labor is typically based on surveys of parents. When children were surveyed a considerable under-reporting by parents was found.\nOne aspect of particular concern is questions about sexual orientation and gender identity. While this is an evolving area, The White House (2023) provides recommendations for best practice, such as considering how the data will be used, and ensuring sufficient sample size. With regard to asking about sexual orientation they recommend the following question:\n\n“Which of the following best represents how you think of yourself?”\n\n“Gay or lesbian”\n“Straight, that is not gay or lesbian”\n“Bisexual”\n“I use a different term [free-text]”\n“I don’t know”\n\n\nAnd with regard to gender, they recommend a multi-question approach:\n\n“What sex were you assigned at birth, on your original birth certificate?”\n\n“Female”\n“Male”\n\n“How do you currently describe yourself (mark all that apply)?”\n\n“Female”\n“Male”\n“Transgender”\n“I use a different term [free-text]”\n\n\nAgain, this is an evolving area and best practice is likely to change.\nFinally, returning to the reason for doing surveys in the first place, while doing all this, it is important to also keep what we are interested in measuring in mind. Check that the survey questions relate to the estimand.\n\n8.3.1 Democracy Fund Voter Study Group\nAs an example of survey data, we will consider the Democracy Fund Voter Study Group Nationscape dataset (Tausanovitch and Vavreck 2021). This is a large series of surveys conducted between July 2019 and January 2021. It is weighted on a number of variables including: gender, major census regions, race, Hispanic ethnicity, household income, education, and age. Holliday et al. (2021) describe it as a convenience sample, which was introduced in Chapter 6, based on demographics. In this case, Holliday et al. (2021) detail how the sample was provided by Lucid, who operate an online platform for survey respondents, based on certain demographic quotas. Holliday et al. (2021) found that results are similar to government and commercial surveys.\nTo get the dataset, go to the Democracy Fund Voter Study Group website, then look for “Nationscape” and request access to the data. This could take a day or two. After getting access, focus on the “.dta” files. Nationscape conducted many surveys in the lead-up to the 2020 United States election, so there are many files. The filename is the reference date, where “ns20200625” refers to 25 June 2020. That is the file that we use here, but many of them are similar. We download and save it as “ns20200625.dta”.\nAs introduced in Online Appendix A, we can import “.dta” files after installing haven and labelled. The code that we use to import and prepare the survey dataset is based on that of Mitrovski, Yang, and Wankiewicz (2020).\n\nraw_nationscape_data <-\n read_dta(\"ns20200625.dta\")\n\n\n# The Stata format separates labels so reunite those\nraw_nationscape_data <-\n to_factor(raw_nationscape_data)\n\n# Just keep relevant variables\nnationscape_data <-\n raw_nationscape_data |>\n select(vote_2020, gender, education, state, age)\n\n\nnationscape_data\n\n# A tibble: 6,479 × 5\n vote_2020 gender education state age\n * <fct> <fct> <fct> <chr> <dbl>\n 1 Donald Trump Female Associate Degree WI 49\n 2 I am not sure/don't know Female College Degree (such as B.A., B.… VA 39\n 3 Donald Trump Female College Degree (such as B.A., B.… VA 46\n 4 Donald Trump Female High school graduate TX 75\n 5 Donald Trump Female High school graduate WA 52\n 6 I would not vote Female Other post high school vocationa… OH 44\n 7 Joe Biden Female Completed some college, but no d… MA 21\n 8 Joe Biden Female Completed some college, but no d… TX 38\n 9 Donald Trump Female Completed some college, but no d… CA 69\n10 Donald Trump Female College Degree (such as B.A., B.… NC 59\n# ℹ 6,469 more rows\n\n\nAt this point we want to clean up a few issues. For instance, for simplicity, remove anyone not voting for Trump or Biden.\n\nnationscape_data <-\n nationscape_data |>\n filter(vote_2020 %in% c(\"Joe Biden\", \"Donald Trump\")) |>\n mutate(vote_biden = if_else(vote_2020 == \"Joe Biden\", 1, 0)) |>\n select(-vote_2020)\n\nWe then want to create some variables of interest.\n\nnationscape_data <-\n nationscape_data |>\n mutate(\n age_group = case_when(\n age <= 29 ~ \"18-29\",\n age <= 44 ~ \"30-44\",\n age <= 59 ~ \"45-59\",\n age >= 60 ~ \"60+\",\n TRUE ~ \"Trouble\"\n ),\n gender = case_when(\n gender == \"Female\" ~ \"female\",\n gender == \"Male\" ~ \"male\",\n TRUE ~ \"Trouble\"\n ),\n education_level = case_when(\n education %in% c(\n \"3rd Grade or less\",\n \"Middle School - Grades 4 - 8\",\n \"Completed some high school\",\n \"High school graduate\"\n ) ~ \"High school or less\",\n education %in% c(\n \"Other post high school vocational training\",\n \"Completed some college, but no degree\"\n ) ~ \"Some post sec\",\n education %in% c(\n \"Associate Degree\",\n \"College Degree (such as B.A., B.S.)\",\n \"Completed some graduate, but no degree\"\n ) ~ \"Post sec +\",\n education %in% c(\"Masters degree\",\n \"Doctorate degree\") ~ \"Grad degree\",\n TRUE ~ \"Trouble\"\n )\n ) |>\n select(-education,-age)\n\nWe will draw on this dataset in Chapter 16, so we will save it.\n\nwrite_csv(x = nationscape_data,\n file = \"nationscape_data.csv\")\n\nWe can also have a look at some of the variables (Figure 8.2).\n\nnationscape_data |>\n mutate(supports = if_else(vote_biden == 1, \"Biden\", \"Trump\")) |> \n mutate(supports = factor(supports, levels = c(\"Trump\", \"Biden\"))) |> \n ggplot(mapping = aes(x = age_group, fill = supports)) +\n geom_bar(position = \"dodge2\") +\n theme_minimal() +\n labs(\n x = \"Age-group of respondent\",\n y = \"Number of respondents\",\n fill = \"Voted for\"\n ) +\n facet_wrap(vars(gender)) +\n guides(x = guide_axis(angle = 90)) +\n theme(legend.position = \"bottom\") +\n scale_fill_brewer(palette = \"Set1\")\n\n\n\n\n\n\n\nFigure 8.2: Examining some of the variables from the Nationscape survey dataset",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
"objectID": "08-hunt.html#rct-examples",
"href": "08-hunt.html#rct-examples",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "8.4 RCT examples",
"text": "8.4 RCT examples\n\n8.4.1 The Oregon Health Insurance Experiment\nIn the United States, unlike many developed countries, basic health insurance is not necessarily available to all residents, even those on low incomes. The Oregon Health Insurance Experiment involved low-income adults in Oregon, a state in the northwest of the United States, from 2008 to 2010 (Finkelstein et al. 2012).\n\n\n\n\n\n\nShoulders of giants\n\n\n\nDr Amy Finkelstein is John & Jennie S. Macdonald Professor of Economics at MIT. After earning a PhD in Economics from MIT in 2001, she was a Junior Fellow at the Harvard Society of Fellows, before returning to MIT as an assistant professor in 2005, being promoted to full professor in 2008. One area of her research is health economics where she uses randomized controlled trials to understand insurance. She was one of the lead researchers on Finkelstein et al. (2012) which examined the Oregon Health Insurance Experiment. She was awarded the John Bates Clark Medal in 2012 and a MacArthur Foundation Fellowship in 2018.\n\n\nOregon funded 10,000 places in the state-run Medicaid program, which provides health insurance for people with low incomes. A lottery was used to allocate these places, and this was judged fair because it was expected, correctly as it turned out, that demand for places would exceed the supply. In the end, 89,824 individuals signed up.\nThe draws were conducted over a six-month period and 35,169 individuals were selected (the household of those who won the draw were given the opportunity) but only 30 per cent of them turned out to be eligible and completed the paperwork. The insurance lasted indefinitely. This random allocation of insurance allowed the researchers to understand the effect of health insurance.\nThe reason that this random allocation is important is that it is not usually possible to compare those with and without insurance because the type of people that sign up to get health insurance differ to those who do not. That decision is “confounded” with other variables and results in selection bias.\nAs the opportunity to apply for health insurance was randomly allocated, the researchers were able to evaluate the health and earnings of those who received health insurance and compare them to those who did not. To do this they used administrative data, such as hospital discharge data, matched credit reports, and, uncommonly, mortality records. The extent of this data is limited and so they also conducted a survey.\nThe specifics of this are not important, and we will have more to say in Chapter 12, but they estimate the model:\n\\[\ny_{ihj} = \\beta_0 + \\beta_1\\mbox{Lottery} + X_{ih}\\beta_2 + V_{ih}\\beta_3 + \\epsilon_{ihj}\n\\tag{8.1}\\]\nEquation 8.1 explains various \\(j\\) outcomes (such as health) for an individual \\(i\\) in household \\(h\\) as a function of an indicator variable as to whether household \\(h\\) was selected by the lottery. It is the \\(\\beta_1\\) coefficient that is of particular interest. That is the estimate of the mean difference between the treatment and control groups. \\(X_{ih}\\) is a set of variables that are correlated with the probability of being treated. These adjust for that impact to a certain extent. An example of that is the number of individuals in a household. And finally, \\(V_{ih}\\) is a set of variables that are not correlated with the lottery, such as demographics and previous hospital discharges.\nLike earlier studies such as Brook et al. (1984), Finkelstein et al. (2012) found that the treatment group used more health care including both primary and preventive care as well as hospitalizations but had lower out-of-pocket medical expenditures. More generally, the treatment group reported better physical and mental health.\n\n\n8.4.2 Civic Honesty Around The Globe\nTrust is not something that we think regularly about, but it is fundamental to most interactions, both economic and personal. For instance, many people get paid after they do some work—they are trusting their employer will make good, and vice versa. If you get paid in advance, then they are trusting you. In a strictly naive, one-shot, world without transaction costs, this does not make sense. If you get paid in advance, the incentive is for you to take the money and run in the last pay period before you quit, and through backward induction everything falls apart. We do not live in such a world. For one thing there are transaction costs, for another, generally, we have repeated interactions, and finally, the world usually ends up being fairly small.\nUnderstanding the extent of honesty in different countries may help us to explain economic development and other aspects of interest such as tax compliance, but it is hard to measure. We cannot ask people how honest they are—the liars would lie, resulting in a lemons problem (Akerlof 1970). This is a situation of adverse selection, where the liars know they are liars, but others do not. To get around this Cohn et al. (2019a) conduct an experiment in 355 cities across 40 countries where they “turned in” a wallet that was either empty or contained the local equivalent of US$13.45. They were interested in whether the “recipient” attempted to return the wallet. They found that generally wallets with money were more likely to be returned (Cohn et al. 2019a, 1).\nIn total Cohn et al. (2019a) “turn in” 17,303 wallets to various institutions including banks, museums, hotels, and police stations. The importance of such institutions to an economy is well accepted (Acemoglu, Johnson, and Robinson 2001) and they are common across most countries. Importantly, for the experiment, they usually have a reception area where the wallet could be turned in (Cohn et al. 2019a, 1).\nIn the experiment a research assistant turned in the wallet to an employee at the reception area, using a set form of words. The research assistant had to note various features of the setting, such as the gender, age-group, and busyness of the “recipient”. The wallets were transparent and contained a key, a grocery list, and a business card with a name and email address. The outcome of interest was whether an email was sent to the unique email address on the business card in the wallet. The grocery list was included to signal that the owner of the wallet was a local. The key was included as something that was only useful to the owner of the wallet, and never the recipient, in contrast to the cash, to adjust for altruism. The language and currency were adapted to local conditions.\nThe primary treatment in the experiment is whether the wallet contained money or not. The key outcome was whether the wallet was attempted to be returned or not. It was found that the median response time was 26 minutes, and that if an email was sent then it usually happened within a day (Cohn et al. 2019b, 10).\nUsing the data for the paper that is made available (Cohn 2019) we can see that considerable differences were found between countries (Figure 8.3). In almost all countries wallets with money were more likely to be returned than wallets without. The experiments were conducted across 40 countries, which were chosen based on them having enough cities with populations of at least 100,000, as well as the ability for the research assistants to safely visit and withdraw cash. Within those countries, the cities were chosen starting with the largest ones and there were usually 400 observations in each country (Cohn et al. 2019b, 5). Cohn et al. (2019a) further conducted the experiment with the equivalent of US$94.15 in three countries—Poland, the UK, and the US—and found that reporting rates further increased.\n\n\n\n\n\n\n\n\nFigure 8.3: Comparison of the proportion of wallets handed in, by country, depending on whether they contained money\n\n\n\n\n\nIn addition to the experiments, Cohn et al. (2019a) conducted surveys that allowed them to understand some reasons for their findings. During the survey, participants were given one of the scenarios and then asked to answer questions. The use of surveys also allowed them to be specific about the respondents. The survey involved 2,525 respondents (829 in the UK, 809 in Poland, and 887 in the US) (Cohn et al. 2019b, 36). Participants were chosen using attention checks and demographic quotas based on age, gender, and residence, and they received US$4.00 for their participation (Cohn et al. 2019b, 36). The survey did not find that larger rewards were expected for turning in a wallet with more money. But it did find that failure to turn in a wallet with more money caused the respondent to feel more like they had stolen money.",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
"objectID": "08-hunt.html#ab-testing",
"href": "08-hunt.html#ab-testing",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "8.5 A/B testing",
"text": "8.5 A/B testing\nThe past two decades have probably seen the most experiments ever run, likely by several orders of magnitude. This is because of the extensive use of A/B testing at tech firms (Kohavi et al. 2012). For a long time decisions such as what font to use were based on the Highest Paid Person’s Opinion (HIPPO) (Christian 2012). These days, many large tech companies have extensive infrastructure for experiments. They term them A/B tests because of the comparison of two groups: one that gets treatment A and the other that either gets treatment B or does not see any change (Salganik 2018, 185). We could additionally consider more than two options at which point we typically use the terminology of “arms” of the experiment.\nThe proliferation of experiments in the private sector has brought with it a host of ethical concerns. Some private companies do not have ethical review boards, and there are different ethical concerns in the private sector compared with academia. For instance, many A/B tests are designed, explicitly, to make a consumer more likely to spend money. While society may not generally have a concern with that in the case of an online grocery retailer, society may have a problem in the case of an online gambling website. More extensive legislation and the development of private-sector ethical best practice are both likely as the extent of experimentation in the private sector becomes better known.\nEvery time you are online you are probably subject to tens, hundreds, or potentially thousands, of different A/B tests. While, at their heart, they are just experiments that use sensors to measure data that need to be analyzed, they have many special features that are interesting in their own light. For instance, Kohavi, Tang, and Xu (2020, 3) discuss the example of Microsoft’s search engine Bing. They used A/B testing to examine how to display advertisements. Based on these tests they ended up lengthening the title on the advertisement. They found this caused revenue to increase by 12 per cent, or around $100 million annually, without any significant measured trade-off.\nIn this book we use the term A/B test to refer to the situation in which we primarily implement an experiment through a technology stack about something that is primarily of the internet, such as a change to a website or similar and measured with sensors rather than a survey. While at their heart they are just experiments, A/B tests have a range of specific concerns. Bosch and Revilla (2022) detail some of these from a statistical perspective. There is something different about doing tens of thousands of small experiments all the time, compared with the typical RCT set-up of conducting one experiment over the course of months.\nRCTs are often, though not exclusively, done in academia or by government agencies, but much of A/B testing occurs in industry. This means that if you are in industry and want to introduce A/B testing to your firm there can be aspects such as culture and relationship building that become important. It can be difficult to convince a manager to run an experiment. Indeed, sometimes it can be easier to experiment by not delivering, or delaying, a change that has been decided to create a control group rather than a treatment group (Salganik 2018, 188). Sometimes the most difficult aspect of A/B testing is not the analysis, it is the politics. This is not unique to A/B testing and, for instance, looking at the history of biology, we see that even aspects such as germ theory were not resolved by experiment, but instead by ideology and social standing (Morange 2016, 124).\nFollowing Kohavi, Tang, and Xu (2020, 153), when conducting A/B testing, as with all experiments, we need to be concerned with delivery. In the case of an experiment, it is usually clear how it is being delivered. For instance, we may have the person come to a doctor’s clinic and then inject them with either a drug or a placebo. But in the case of A/B testing, it is less obvious. For instance, should we make a change to a website, or to an app? This decision affects our ability to both conduct the experiment and to gather data from it. (Urban, Sreenivasan, and Kannan (2016) provide an overview of A/B testing at Netflix, assuming an app is installed on a PlayStation 4.)\nIt is relatively easy and normal to update a website all the time. This means that small changes can be easily implemented if the A/B test is delivered that way. But in the case of an app, conducting an A/B test becomes a bigger deal. For instance, the release may need to go through an app store, and so would need to be part of a regular release cycle. There is also a selection concern: some users will not update the app and it is possible they are different to those that do regularly update the app.\nThe delivery decision also affects our ability to gather data from the A/B test. A website change is less of a big deal because we get data from a website whenever a user interacts with it. But in the case of an app, the user may use the app offline or with limited data upload which can add complications.\nWe need to plan! For instance, results are unlikely to be available the day after a change to an app, but they could be available the day after a change to a website. Further, we may need to consider our results in the context of different devices and platforms, potentially using, say, regression which will be covered in Chapter 12.\nThe second aspect of concern, as introduced in Chapter 6, is instrumentation. When we conduct a traditional experiment we might, for instance, ask respondents to fill out a survey. But this is usually not done with A/B testing. Instead we usually use various sensors (Kohavi, Tang, and Xu 2020, 162). One approach is to use cookies but different types of users will clear these at different rates. Another approach is to force the user to download a tiny image from a server, so that we know when they have completed some action. For instance, this is commonly used to track whether a user has opened an email. But again different types of users will block these at different rates.\nThe third aspect of concern is what are we randomizing over (Kohavi, Tang, and Xu 2020, 166)? In the case of traditional experiments, this is often a person, or sometimes various groups of people. But in the case of A/B testing it can be less clear. For instance, are we randomizing over the page, the session, or the user?\nTo think about this, let us consider color. For instance, say we are interested in whether we should change our logo from red to blue on the homepage. If we are randomizing at the page level, then if the user goes to some other page of our website, and then back to the homepage, the logo could change colors. If we are randomizing at the session level, then it could be blue while they use the website this time, if they close it and come back, then it could be red. Finally, if we are randomizing at a user level then possibly it would always be red for one user, but always blue for another.\nThe extent to which this matters depends on a trade-off between consistency and importance. For instance, if we are A/B testing product prices then consistency is likely an important feature. But if we are A/B testing background colors then consistency might not be as important. On the other hand, if we are A/B testing the position of a log-in button then it might be important that we not move that around too much for the one user, but between users it might matter less.\nIn A/B testing, as in traditional experiments, we are concerned that our treatment and control groups are the same, but for the treatment. In the case of traditional experiments, we satisfy ourselves of this by conducting analysis based on the data that we have after the experiment is conducted. That is usually all we can do because it would be weird to treat or control both groups. But in the case of A/B testing, the pace of experimentation allows us to randomly create the treatment and control groups, and then check, before we subject the treatment group to the treatment, that the groups are the same. For instance, if we were to show each group the same website, then we would expect the same outcomes across the two groups. If we found different outcomes then we would know that we may have a randomization issue (Taddy 2019, 129). This is termed an A/A test and was mentioned in Chapter 4.\nWe usually run A/B tests not because we desperately care about the specific outcome, but because that feeds into some other measure that we care about. For instance, do we care whether the website is quite-dark-blue or slightly-dark-blue? Probably not. We probably actually care about the company share price. But what if the A/B test outcome of what is the best blue comes at a cost to the share price?\nTo illustrate this, pretend that we work at a food delivery app, and we are concerned with driver retention. Say we do some A/B tests and find that drivers are always more likely to be retained when they can deliver food to the customer faster. Our hypothetical finding is that faster is better, for driver retention, always. But one way to achieve faster deliveries is for the driver to not put the food into a hot box that would maintain the food’s temperature. Something like that might save 30 seconds, which is significant on a ten-minute delivery. Unfortunately, although we would decide to encourage that based on A/B tests designed to optimize driver-retention, such a decision would likely make the customer experience worse. If customers receive cold food that is meant to be hot, then they may stop using the app, which would be bad for the business. Chen et al. (2022) describe how they found a similar situation at Facebook in terms of notifications—although reducing the number of notifications reduced user engagement in the short-term, over the long-term it increased both user satisfaction and app usage.\nThis trade-off could become known during the hypothetical driver experiment if we were to look at customer complaints. It is possible that on a small team the A/B test analyst would be exposed to those tickets, but on a larger team they may not be. Ensuring that A/B tests are not resulting in false optimization is especially important. This is not something that we typically have to worry about in normal experiments. As another example of this Aprameya (2020) describes testing a feature of Duolingo, a language-learning application, which served an ad for Duolingo Plus when a regular Duolingo user was offline. The feature was found to be positive for Duolingo’s revenue, but negative for customer learning habits. Presumably enough customer negativity would eventually have resulted in the feature having a negative effect on revenue. Related to this, we want to think carefully about the nature of the result that we expect. For instance, in the shades of blues example, we are unlikely to find substantial surprises, and so it might be sufficient to try a small range of blues. But what if we considered a wider variety of colors?\n\n\n\n\n\n\nShoulders of giants\n\n\n\nDr Susan Athey is the Economics of Technology Professor at Stanford University. After earning a PhD in Economics from Stanford in 1995, she joined MIT as an assistant professor, returning to Stanford in 2001, where she was promoted to full professor in 2004. One area of her research is applied economics, and one particularly important paper is Abadie et al. (2017), which considers when standard errors need to be clustered. Another is Athey and Imbens (2017a), which considers how to analyze randomized experiments. In addition to her academic appointments, she has worked at Microsoft and other technology firms and been extensively involved in running experiments in this context. She was awarded the John Bates Clark Medal in 2007.\n\n\n\n8.5.1 Upworthy\nThe trouble with much of A/B testing is that it is done by private firms and so we typically do not have access to their datasets. But Matias et al. (2021) provide access to a dataset of A/B tests from Upworthy, a media website that used A/B testing to optimize their content. Fitts (2014) provides more background information about Upworthy. And the datasets of A/B tests are available here.\nWe can look at what the dataset looks like and get a sense for it by looking at the names and an extract.\n\nupworthy <- read_csv(\"https://osf.io/vy8mj/download\")\n\n\nupworthy |>\n names()\n\n [1] \"...1\" \"created_at\" \"updated_at\" \n [4] \"clickability_test_id\" \"excerpt\" \"headline\" \n [7] \"lede\" \"slug\" \"eyecatcher_id\" \n[10] \"impressions\" \"clicks\" \"significance\" \n[13] \"first_place\" \"winner\" \"share_text\" \n[16] \"square\" \"test_week\" \n\nupworthy |>\n head()\n\n# A tibble: 6 × 17\n ...1 created_at updated_at clickability_test_id excerpt\n <dbl> <dttm> <dttm> <chr> <chr> \n1 11 2014-11-20 11:33:26 2016-04-02 16:25:54 546dd17e26714c82cc00001c Things…\n2 12 2014-11-20 15:00:01 2016-04-02 16:25:54 546e01d626714c6c4400004e Things…\n3 13 2014-11-20 11:33:51 2016-04-02 16:25:54 546dd17e26714c82cc00001c Things…\n4 14 2014-11-20 11:34:12 2016-04-02 16:25:54 546dd17e26714c82cc00001c Things…\n5 15 2014-11-20 11:34:33 2016-04-02 16:25:54 546dd17e26714c82cc00001c Things…\n6 16 2014-11-20 11:34:48 2016-04-02 16:25:54 546dd17e26714c82cc00001c Things…\n# ℹ 12 more variables: headline <chr>, lede <chr>, slug <chr>,\n# eyecatcher_id <chr>, impressions <dbl>, clicks <dbl>, significance <dbl>,\n# first_place <lgl>, winner <lgl>, share_text <chr>, square <chr>,\n# test_week <dbl>\n\n\nIt is also useful to look at the documentation for the dataset. This describes the structure of the dataset, which is that there are packages within tests. A package is a collection of headlines and images that were shown randomly to different visitors to the website, as part of a test. A test can include many packages. Each row in the dataset is a package and the test that it is part of is specified by the “clickability_test_id” column.\nThere are many variables. We will focus on:\n\n“created_at”;\n“clickability_test_id”, so that we can create comparison groups;\n“headline”;\n“impressions”, which is the number of people that saw the package; and\n“clicks” which is the number of clicks on that package.\n\nWithin each batch of tests, we are interested in the effect of the varied headlines on impressions and clicks.\n\nupworthy_restricted <-\n upworthy |>\n select(\n created_at, clickability_test_id, headline, impressions, clicks\n )\n\n\nhead(upworthy_restricted)\n\n# A tibble: 6 × 5\n created_at clickability_test_id headline impressions clicks\n <dttm> <chr> <chr> <dbl> <dbl>\n1 2014-11-20 11:33:26 546dd17e26714c82cc00001c Let’s See … H… 3118 8\n2 2014-11-20 15:00:01 546e01d626714c6c4400004e People Sent T… 4587 130\n3 2014-11-20 11:33:51 546dd17e26714c82cc00001c $3 Million Is… 3017 19\n4 2014-11-20 11:34:12 546dd17e26714c82cc00001c The Fact That… 2974 26\n5 2014-11-20 11:34:33 546dd17e26714c82cc00001c Reason #351 T… 3050 10\n6 2014-11-20 11:34:48 546dd17e26714c82cc00001c I Was Already… 3061 20\n\n\nWe will focus on the text contained in headlines, and look at whether headlines that asked a question got more clicks than those that did not. We want to remove the effect of different images and so will focus on those tests that have the same image. To identify whether a headline asks a question, we search for a question mark. Although there are more complicated constructions that we could use, this is enough to get started.\n\nupworthy_restricted <-\n upworthy_restricted |>\n mutate(\n asks_question =\n str_detect(string = headline, pattern = \"\\\\?\")\n )\n\nupworthy_restricted |>\n count(asks_question)\n\n# A tibble: 2 × 2\n asks_question n\n <lgl> <int>\n1 FALSE 89559\n2 TRUE 15992\n\n\nFor every test, and for every picture, we want to know whether asking a question affected the number of clicks.\n\nquestion_or_not <-\n upworthy_restricted |>\n summarise(\n ave_clicks = mean(clicks),\n .by = c(clickability_test_id, asks_question)\n ) \n\nquestion_or_not |>\n pivot_wider(names_from = asks_question,\n values_from = ave_clicks,\n names_prefix = \"ave_clicks_\") |>\n drop_na(ave_clicks_FALSE, ave_clicks_TRUE) |>\n mutate(difference_in_clicks = ave_clicks_TRUE - ave_clicks_FALSE) |> \n summarise(average_differce = mean(difference_in_clicks))\n\n# A tibble: 1 × 1\n average_differce\n <dbl>\n1 -4.16\n\n\nWe could also consider a cross-tab (Table 8.4).\n\nquestion_or_not |> \n summarise(mean = mean(ave_clicks),\n .by = asks_question) |> \n kable(\n col.names = c(\"Asks a question?\", \"Mean clicks\"),\n digits = 0\n )\n\n\n\nTable 8.4: Difference between the average number of clicks\n\n\n\n\n\n\nAsks a question?\nMean clicks\n\n\n\n\nTRUE\n45\n\n\nFALSE\n57\n\n\n\n\n\n\n\n\nWe find that in general, having a question in the headline may slightly decrease the number of clicks on a headline, although if there is an effect it does not appear to be very large (Figure 8.4).\n\n\n\n\n\n\n\n\nFigure 8.4: Comparison of the average number of clicks when a headline contains a question mark or not",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
"objectID": "08-hunt.html#exercises",
"href": "08-hunt.html#exercises",
- "title": "8 Hunt data",
+ "title": "8 Experiments and surveys",
"section": "8.6 Exercises",
- "text": "8.6 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: A political candidate is interested in how two polling values change over the course of an election campaign: approval rating and vote-share. The two are measured as percentages, and are somewhat correlated. There tends to be large changes when there is a debate between candidates. Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation. Please include five tests based on the simulated data. Submit a link to a GitHub Gist that contains your code.\n(Acquire) Please identify and document a possible source of such a dataset.\n(Explore) Please use ggplot2 to build the graph that you sketched using the simulated data. Submit a link to a GitHub Gist that contains your code.\n(Communicate) Please write two paragraphs about what you did.\n\n\n\nQuiz\n\nWhich of the following best describes the fundamental problem of causal inference (pick one)?\n\nWe cannot observe both the treatment and control outcomes for the same individual simultaneously.\nRandomization cannot eliminate all biases in an experiment.\nIt is impossible to establish external validity in any experiment.\nSurveys cannot accurately measure individual preferences.\n\nIn the Neyman-Rubin potential outcomes framework, what is the primary goal when conducting an experiment (pick one)?\n\nTo maximize the sample size for greater statistical power.\nTo estimate the causal effect by comparing treatment and control groups.\nTo ensure all participants receive the treatment at some point.\nTo focus on external validity over internal validity.\n\nBased on Gertler et al. (2016), what does the basic impact evaluation formula \\(\\Delta = (Y_i|t=1) - (Y_i|t=0)\\) represent (pick one)?\n\nThe total cost of a program.\nThe difference in outcomes between treatment and comparison groups.\nThe average change in a participant’s salary.\nThe effect of external market forces on outcomes.\n\nWhy is randomization important in experimental design (pick one)?\n\nIt ensures the sample is representative of the population.\nIt eliminates the need for a control group.\nIt helps create treatment and control groups that are similar except for the treatment.\nIt guarantees external validity.\n\nBased on Gertler et al. (2016), what is a common problem when trying to measure the counterfactual (pick one)?\n\nIt is impossible to observe both treatment and non-treatment outcomes for the same individual.\nData for control groups are always inaccurate.\nOnly randomized trials can provide the counterfactual.\nPrograms typically do not have sufficient participants.\n\nBased on Gertler et al. (2016), selection bias occurs when (pick one):\n\nThe program is implemented at a national scale.\nProgram evaluation lacks financial support.\nData collection is incomplete.\nParticipants are not randomly assigned.\n\nWhat is external validity (pick one)?\n\nFindings from an experiment hold in that setting.\nFindings from an experiment that has been repeated many times.\nFindings from an experiment hold outside that setting.\nFindings from an experiment for which code and data are available.\n\nWhat is internal validity (pick one)?\n\nFindings from an experiment hold in that setting.\nFindings from an experiment hold outside that setting.\nFindings from an experiment that has been repeated many times.\nFindings from an experiment for which code and data are available.\n\nBased on Gertler et al. (2016), what does internal validity refer to in an impact evaluation (pick one)?\n\nThe ability to generalize findings to other populations.\nThe efficiency of program management.\nThe long-term sustainability of a program.\nThe accuracy of measuring the causal effect of a program.\n\nBased on Gertler et al. (2016), what does external validity refer to in an impact evaluation (pick one)?\n\nThe ability to generalize the results to the eligible population.\nThe effectiveness of a randomized control trial.\nThe administrative costs of a program.\nThe extent to which outcomes reflect policy changes.\n\nPlease write some code for the following dataset that would randomly assign people into one of two groups.\n\n\nnetflix_data <-\n tibble(\n person = c(\"Ian\", \"Ian\", \"Roger\", \"Roger\",\n \"Roger\", \"Patricia\", \"Patricia\", \"Helen\"\n ),\n tv_show = c(\n \"Broadchurch\", \"Duty-Shame\", \"Broadchurch\", \"Duty-Shame\",\n \"Shetland\", \"Broadchurch\", \"Shetland\", \"Duty-Shame\"\n ),\n hours = c(6.8, 8.0, 0.8, 9.2, 3.2, 4.0, 0.2, 10.2)\n )\n\n\nBased on Gertler et al. (2016), a valid comparison group must have all of the following characteristics EXCEPT (pick one):\n\nBe affected directly or indirectly by the program.\nThe same average characteristics as the treatment group.\nHave outcomes that would change the same way as the treatment group.\nReact to the program in a similar way if given the program.\n\nBased on Gertler et al. (2016), before-and-after comparisons are considered counterfeit estimates because (pick one):\n\nThey focus on unimportant metrics.\nThey involve random assignment.\nThey require large data samples.\nThey assume outcomes do not change over time.\n\nBased on Gertler et al. (2016), which scenario could ethically allow the use of randomized assignment as a program allocation tool (pick one)?\n\nAll participants are enrolled based on income levels.\nA program has more eligible participants than available spaces.\nEvery eligible participant can be accommodated by the program.\nThe program only serves one specific group.\n\nThe Tuskegee Syphilis Study is an example of a violation of which ethical principle (pick one)?\n\nObtaining informed consent from participants.\nEnsuring statistical power in experimental design.\nMaintaining confidentiality of participant data.\nProviding monetary compensation to participants.\n\nWhat does “equipoise” refer to in the context of clinical trials (pick one)?\n\nThe balance between treatment efficacy and side effects.\nThe ethical requirement of genuine uncertainty about the treatment’s effectiveness.\nThe statistical equilibrium achieved when sample sizes are equal.\nThe state where all participants have equal access to the treatment.\n\nWare (1989, 299) mentions “randomized-consent” and continues that it was “attractive in this setting because a standard approach to informed consent would require that parents of infants near death be approached to give informed consent for an invasive surgical procedure that would then, in some instances, not be administered. Those familiar with the agonizing experience of having a child in a neonatal intensive care unit can appreciate that the process of obtaining informed consent would be both frightening and stressful to parents.” To what extent do you agree with this position, especially given, as Ware (1989, 305), mentions “the need to withhold information about the study from parents of infants receiving Conventional Medical Therapy (CMT)”?\nWhich of the following is a key concern when designing survey questions (pick one)?\n\nUsing technical jargon to appear more credible.\nLeading respondents toward a desired answer.\nEnsuring questions are relevant and easily understood by respondents.\nAsking multiple questions at once to save time.\n\nIn the context of experiments, what is a “confounder” (pick one)?\n\nA variable that is intentionally manipulated by the researcher.\nA variable that is not controlled for and may affect the outcome.\nA participant who does not follow the experimental protocol.\nAn error in data collection leading to invalid results.\n\nThe Oregon Health Insurance Experiment primarily aimed to assess the impact of what (pick one)?\n\nIntroducing a new private health insurance plan.\nRandomly providing Medicaid to low-income adults to study health outcomes.\nComparing different medical treatments for chronic illnesses.\nEvaluating the cost-effectiveness of health interventions.\n\nIn survey design, what is the purpose of a pilot study (pick one)?\n\nTo collect preliminary data for publication.\nTo test and refine the survey instrument before full deployment.\nTo increase the sample size for better statistical power.\nTo ensure all respondents understand the study’s hypotheses.\n\nWhy might an A/A test be conducted in the context of A/B testing (pick one)?\n\nTo compare two entirely different treatments.\nTo ensure the randomization process is properly creating comparable groups.\nTo save resources by not implementing a new treatment.\nTo test the effectiveness of the control condition.\n\nWhat ethical concern is particularly relevant to A/B testing in industry settings (pick one)?\n\nDifficulty in measuring long-term effects.\nThe high cost of conducting experiments.\nLack of informed consent from users being experimented on.\nEnsuring statistical significance in large datasets.\n\nPretend that you work as a junior analyst for a large consulting firm. Further, pretend that your consulting firm has taken a contract to put together a facial recognition model for a government border security department. Write at least three paragraphs, with examples and references, discussing your thoughts, with regard to ethics, on this matter.\nWhat does the term “average treatment effect” (ATE) refer to (pick one)?\n\nThe difference in outcomes between the treatment and control groups across the entire sample.\nThe effect of treatment on a single individual.\nThe average outcome observed in the control group.\nThe total sum of all treatment effects observed.\n\nIn the context of experiments, what does “blinding” refer to (pick one)?\n\nKeeping the sample size hidden from participants.\nEnsuring participants do not know whether they are receiving the treatment or control.\nRandomly assigning treatments without recording the assignments.\nUsing complex statistical methods to analyze data.\n\nWhat is a primary reason for doing simulation before analyzing real data in experiments (pick one)?\n\nSimulation is more accurate than real data analyses.\nSimulation helps understand expected outcomes and potential errors in analysis.\nSimulation requires less computational power.\nSimulation eliminates the need for actual data collection.\n\nWhich statement best captures the concept of “selection bias” (pick one)?\n\nParticipants drop out of a study at random.\nThe sample accurately represents the target population.\nThe method of selecting participants causes the sample to be unrepresentative.\nAll variables are controlled except for the treatment variable.\n\nPlease redo the Upworthy analysis, but for “!” instead of “?”. What is the difference in clicks (pick one)?\n\n-8.3\n-7.2\n-4.5\n-5.6\n\nAs described by Letterman (2021), which sampling methodology was used to increase the likelihood of including respondents from smaller religious groups without introducing bias (pick one)?\n\nQuota sampling.\nSnowball sampling.\nComposite measure of size.\nRandom digit dialing.\n\nAs described by Letterman (2021), how did the researchers ensure that their survey was ethically conducted (pick one)?\n\nThey provided financial incentives for participation.\nThey obtained approval from an institutional research review board (IRB) within India.\nThey only surveyed individuals who volunteered.\nThey anonymized data by not collecting any demographic information.\n\nBased on Stantcheva (2023), what is coverage error in survey sampling (pick one)?\n\nThe difference between the target population and the sample frame.\nThe difference between the planned sample and the actual respondents.\nErrors due to respondent’s inattentiveness.\nBias from oversampling minorities.\n\nBased on Stantcheva (2023), what is the moderacy response bias (pick one)?\n\nThe tendency to choose extreme values on a scale.\nBias introduced by question order.\nThe tendency to choose middle options regardless of question content.\nThe tendency to agree with the surveyor’s expected answer.\n\nBased on Stantcheva (2023), which of the following is a way to minimize social desirability bias in online surveys (pick one)?\n\nMaking the respondent’s identity public.\nOffering high monetary rewards.\nProviding reassurances about the confidentiality of responses.\nKeeping survey questions long and complex.\n\nBased on Stantcheva (2023), what does response order bias refer to (pick one)?\n\nRespondents systematically choosing extreme values.\nRespondents skipping sensitive questions.\nRespondents failing to understand the question.\nRespondents choosing answers based on the order they are presented.\n\nBased on Stantcheva (2023), while managing a survey, you should do everything APART from (pick one)?\n\nSoft-launch the survey.\nMonitor the survey.\nTest statistical hypotheses.\nCheck the data.\n\nA common approach to minimizing question order effect is to randomize the order of questions. To what extent do you think this is effective?\nBased on Stantcheva (2023), what is a good practice for recruiting respondents in online surveys (pick one)?\n\nRevealing the survey’s topic in the invitation email.\nEmphasizing the length of the survey to increase engagement.\nProviding minimal information about the survey’s purpose initially.\nOffering the highest possible monetary incentives.\n\nBased on Stantcheva (2023), what does attrition in surveys refer to (pick one)?\n\nThe rate at which respondents drop out before completing the survey.\nThe total number of people who received the invitation.\nThe accuracy of the data collected.\nThe differences between respondents and nonrespondents.\n\n\n\n\nActivity\nPlease consider the Special Virtual Issue on Nonresponse Rates and Nonresponse Adjustments of the Journal of Survey Statistics and Methodology. Focus on one aspect of the editorial, and with reference to relevant literature, please discuss it in at least two pages. Use Quarto, and include an appropriate title, author, date, link to a GitHub repo, and citations. Submit a PDF.\n\n\nPaper\nAt about this point the Howrah Paper from Online Appendix E would be appropriate.\n\n\n\n\nAbadie, Alberto, Susan Athey, Guido Imbens, and Jeffrey Wooldridge. 2017. “When Should You Adjust Standard Errors for Clustering?” Working Paper 24003. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w24003.\n\n\nAcemoglu, Daron, Simon Johnson, and James Robinson. 2001. “The Colonial Origins of Comparative Development: An Empirical Investigation.” American Economic Review 91 (5): 1369–1401. https://doi.org/10.1257/aer.91.5.1369.\n\n\nAkerlof, George. 1970. “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism.” The Quarterly Journal of Economics 84 (3): 488–500. https://doi.org/10.2307/1879431.\n\n\nAlsan, Marcella, and Amy Finkelstein. 2021. “Beyond Causality: Additional Benefits of Randomized Controlled Trials for Improving Health Care Delivery.” The Milbank Quarterly 99 (4): 864–81. https://doi.org/10.1111/1468-0009.12521.\n\n\nAlsan, Marcella, and Marianne Wanamaker. 2018. “Tuskegee and the Health of Black Men.” The Quarterly Journal of Economics 133 (1): 407–55. https://doi.org/10.1093/qje/qjx029.\n\n\nAngrist, Joshua, and Jörn-Steffen Pischke. 2010. “The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con Out of Econometrics.” Journal of Economic Perspectives 24 (2): 3–30. https://doi.org/10.1257/jep.24.2.3.\n\n\nAprameya, Lavanya. 2020. “Improving Duolingo, One Experiment at a Time.” Duolingo Blog, January. https://blog.duolingo.com/improving-duolingo-one-experiment-at-a-time/.\n\n\nAthey, Susan, and Guido Imbens. 2017a. “The Econometrics of Randomized Experiments.” In Handbook of Field Experiments, 73–140. Elsevier. https://doi.org/10.1016/bs.hefe.2016.10.003.\n\n\n———. 2017b. “The State of Applied Econometrics: Causality and Policy Evaluation.” Journal of Economic Perspectives 31 (2): 3–32. https://doi.org/10.1257/jep.31.2.3.\n\n\nBanerjee, Abhijit, and Esther Duflo. 2011. Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty. New York: PublicAffairs.\n\n\nBanerjee, Abhijit, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan. 2015. “The Miracle of Microfinance? Evidence from a Randomized Evaluation.” American Economic Journal: Applied Economics 7 (1): 22–53. https://doi.org/10.1257/app.20130533.\n\n\nBerry, Donald. 1989. “Comment: Ethics and ECMO.” Statistical Science 4 (4): 306–10. https://www.jstor.org/stable/2245830.\n\n\nBlair, Ed, Seymour Sudman, Norman M Bradburn, and Carol Stocking. 1977. “How to Ask Questions about Drinking and Sex: Response Effects in Measuring Consumer Behavior.” Journal of Marketing Research 14 (3): 316–21. https://doi.org/10.2307/3150769.\n\n\nBosch, Oriol, and Melanie Revilla. 2022. “When survey science met web tracking: Presenting an error framework for metered data.” Journal of the Royal Statistical Society: Series A (Statistics in Society), November, 1–29. https://doi.org/10.1111/rssa.12956.\n\n\nBouguen, Adrien, Yue Huang, Michael Kremer, and Edward Miguel. 2019. “Using Randomized Controlled Trials to Estimate Long-Run Impacts in Development Economics.” Annual Review of Economics 11 (1): 523–61. https://doi.org/10.1146/annurev-economics-080218-030333.\n\n\nBrandt, Allan. 1978. “Racism and Research: The Case of the Tuskegee Syphilis Study.” Hastings Center Report, 21–29. https://doi.org/10.2307/3561468.\n\n\nBrook, Robert, John Ware, William Rogers, Emmett Keeler, Allyson Ross Davies, Cathy Sherbourne, George Goldberg, Kathleen Lohr, Patricia Camp, and Joseph Newhouse. 1984. “The Effect of Coinsurance on the Health of Adults: Results from the RAND Health Insurance Experiment.” https://www.rand.org/pubs/reports/R3055.html.\n\n\nChen, Weijun, Yan Qi, Yuwen Zhang, Christina Brown, Akos Lada, and Harivardan Jayaraman. 2022. “Notifications: Why Less Is More,” December. https://medium.com/@AnalyticsAtMeta/notifications-why-less-is-more-how-facebook-has-been-increasing-both-user-satisfaction-and-app-9463f7325e7d.\n\n\nChristian, Brian. 2012. “The A/B Test: Inside the Technology That’s Changing the Rules of Business.” Wired, April. https://www.wired.com/2012/04/ff-abtesting/.\n\n\nCohn, Alain. 2019. “Data and code for: Civic Honesty Around the Globe.” Harvard Dataverse. https://doi.org/10.7910/dvn/ykbodn.\n\n\nCohn, Alain, Michel André Maréchal, David Tannenbaum, and Christian Lukas Zünd. 2019a. “Civic Honesty Around the Globe.” Science 365 (6448): 70–73. https://doi.org/10.1126/science.aau8712.\n\n\n———. 2019b. “Supplementary Materials for: Civic Honesty Around the Globe.” Science 365 (6448): 70–73.\n\n\nCunningham, Scott. 2021. Causal Inference: The Mixtape. 1st ed. New Haven: Yale Press. https://mixtape.scunning.com.\n\n\nDeaton, Angus. 2010. “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature 48 (2): 424–55. https://doi.org/10.1257/jel.48.2.424.\n\n\nDillman, Don, Jolene Smyth, and Leah Christian. (1978) 2014. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. 4th ed. Wiley.\n\n\nDruckman, James, and Donald Green. 2021. “A New Era of Experimental Political Science.” In Advances in Experimental Political Science, 1–16. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108777919.002.\n\n\nDuflo, Esther. 2020. “Field Experiments and the Practice of Policy.” American Economic Review 110 (7): 1952–73. https://doi.org/10.1257/aer.110.7.1952.\n\n\nEdelman, Murray, Liberty Vittert, and Xiao-Li Meng. 2021. “An Interview with Murray Edelman on the History of the Exit Poll.” Harvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.3a25cd24.\n\n\nEdwards, Jonathan. 2017. “PACE team response shows a disregard for the principles of science.” Journal of Health Psychology 22 (9): 1155–58. https://doi.org/10.1177/1359105317700886.\n\n\nElliott, Michael, Brady West, Xinyu Zhang, and Stephanie Coffey. 2022. “The Anchoring Method: Estimation of Interviewer Effects in the Absence of Interpenetrated Sample Assignment.” Survey Methodology 48 (1): 25–48. http://www.statcan.gc.ca/pub/12-001-x/2022001/article/00005-eng.htm.\n\n\nElson, Malte. 2018. “Question Wording and Item Formulation.” https://doi.org/10.31234/osf.io/e4ktc.\n\n\nEnns, Peter, and Jake Rothschild. 2022. “Do You Know Where Your Survey Data Come From?” May. https://medium.com/3streams/surveys-3ec95995dde2.\n\n\nFinkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan Gruber, Joseph Newhouse, Heidi Allen, Katherine Baicker, and Oregon Health Study Group. 2012. “The Oregon Health Insurance Experiment: Evidence from the First Year.” The Quarterly Journal of Economics 127 (3): 1057–1106. https://doi.org/10.1093/qje/qjs020.\n\n\nFisher, Ronald. (1935) 1949. The Design of Experiments. 5th ed. London: Oliver; Boyd.\n\n\nFitts, Alexis Sobel. 2014. “The King of Content: How Upworthy Aims to Alter the Web, and Could End up Altering the World.” Columbia Journalism Review 53: 34–38. https://archives.cjr.org/feature/the%5Fking%5Fof%5Fcontent.php.\n\n\nFry, Hannah. 2020. “Big Tech Is Testing You.” The New Yorker, February, 61–65. https://www.newyorker.com/magazine/2020/03/02/big-tech-is-testing-you.\n\n\nGertler, Paul, Sebastian Martinez, Patrick Premand, Laura Rawlings, and Christel Vermeersch. 2016. Impact Evaluation in Practice. 2nd ed. The World Bank. https://doi.org/10.1596/978-1-4648-0779-4.\n\n\nGordon, Brett, Robert Moakler, and Florian Zettelmeyer. 2022. “Close Enough? A Large-Scale Exploration of Non-Experimental Approaches to Advertising Measurement.” Marketing Science, November. https://doi.org/10.1287/mksc.2022.1413.\n\n\nGordon, Brett, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2019. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook.” Marketing Science 38 (2): 193–225. https://doi.org/10.1287/mksc.2018.1135.\n\n\nGroves, Robert. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75 (5): 861–71. https://doi.org/10.1093/poq/nfr057.\n\n\nHeller, Jean. 2022. “AP Exposes the Tuskegee Syphilis Study: The 50th Anniversary.” AP, July. https://apnews.com/article/tuskegee-study-ap-story-investigation-syphilis-53403657e77d76f52df6c2e2892788c9.\n\n\nHill, Austin Bradford. 1965. “The Environment and Disease: Association or Causation?” Proceedings of the Royal Society of Medicine 58 (5): 295–300.\n\n\nHolland, Paul. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60. https://doi.org/10.2307/2289064.\n\n\nHolliday, Derek, Tyler Reny, Alex Rossell Hayes, Aaron Rudkin, Chris Tausanovitch, and Lynn Vavreck. 2021. “Democracy Fund + UCLA Nationscape Methodology and Representativeness Assessment.”\n\n\nKohavi, Ron, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. 2012. “Trustworthy Online Controlled Experiments.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 12, 1st ed. ACM Press. https://doi.org/10.1145/2339530.2339653.\n\n\nKohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.\n\n\nLarmarange, Joseph. 2023. labelled: Manipulating Labelled Data. https://CRAN.R-project.org/package=labelled.\n\n\nLetterman, Clark. 2021. “Q&A: How Pew Research Center surveyed nearly 30,000 people in India,” July. https://medium.com/pew-research-center-decoded/q-a-how-pew-research-center-surveyed-nearly-30-000-people-in-india-7c778f6d650e.\n\n\nLevay, Kevin, Jeremy Freese, and James Druckman. 2016. “The Demographic and Political Composition of Mechanical Turk Samples.” SAGE Open 6 (1): 1–17. https://doi.org/10.1177/2158244016636433.\n\n\nLichand, Guilherme, and Sharon Wolf. 2022. “Measuring Child Labor: Whom Should Be Asked, and Why It Matters,” March. https://doi.org/10.21203/rs.3.rs-1474562/v1.\n\n\nLight, Richard, Judith Singer, and John Willett. 1990. By Design: Planning Research on Higher Education. 1st ed. Cambridge: Harvard University Press.\n\n\nMatias, Nathan, Kevin Munger, Marianne Aubin Le Quere, and Charles Ebersole. 2021. “The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media.” Scientific Data 8 (1): 1–8. https://doi.org/10.1038/s41597-021-00934-7.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe Biden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey: Princeton University Press.\n\n\nObermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nSalganik, Matthew. 2018. Bit by Bit: Social Research in the Digital Age. New Jersey: Princeton University Press.\n\n\nSmith, Matthew. 2018. “Should Milk Go in a Cup of Tea First or Last?” July. https://yougov.co.uk/topics/consumer/articles-reports/2018/07/30/should-milk-go-cup-tea-first-or-last.\n\n\nStantcheva, Stefanie. 2023. “How to Run Surveys: A Guide to Creating Your Own Identifying Variation and Revealing the Invisible.” Annual Review of Economics 15 (1): 205–34. https://doi.org/10.1146/annurev-economics-091622-010157.\n\n\nStolberg, Michael. 2006. “Inventing the Randomized Double-Blind Trial: The Nuremberg Salt Test of 1835.” Journal of the Royal Society of Medicine 99 (12): 642–43. https://doi.org/10.1177/014107680609901216.\n\n\nStolley, Paul. 1991. “When Genius Errs: R. A. Fisher and the Lung Cancer Controversy.” American Journal of Epidemiology 133 (5): 416–25. https://doi.org/10.1093/oxfordjournals.aje.a115904.\n\n\nSwain, Larry. 1985. “Basic Principles of Questionnaire Design.” Survey Methodology 11 (2): 161–70.\n\n\nTaddy, Matt. 2019. Business Data Science. 1st ed. McGraw Hill.\n\n\nTausanovitch, Chris, and Lynn Vavreck. 2021. “Democracy Fund + UCLA Nationscape Project.” https://www.voterstudygroup.org/data/nationscape.\n\n\nThe White House. 2023. “Recommendations on the Best Practices for the Collection of Sexual Orientation and Gender Identity Data on Federal Statistical Survey,” January. https://www.whitehouse.gov/wp-content/uploads/2023/01/SOGI-Best-Practices.pdf.\n\n\nTourangeau, Roger, Lance Rips, and Kenneth Rasinski. 2000. The Psychology of Survey Response. 1st ed. Cambridge University Press. https://doi.org/10.1017/CBO9780511819322.003.\n\n\nUrban, Steve, Rangarajan Sreenivasan, and Vineet Kannan. 2016. “It’s All A/Bout Testing: The Netflix Experimentation Platform.” Netflix Technology Blog, April. https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15.\n\n\nVavreck, Lynn, and Chris Tausanovitch. 2021. “Democracy Fund + UCLA Nationscape Project User Guide.” https://www.voterstudygroup.org/data/nationscape.\n\n\nWare, James. 1989. “Investigating Therapies of Potentially Great Benefit: ECMO.” Statistical Science 4 (4): 298–306. https://doi.org/10.1214/ss/1177012384.\n\n\nWei, LJ, and S Durham. 1978. “The Randomized Play-the-Winner Rule in Medical Trials.” Journal of the American Statistical Association 73 (364): 840–43. https://doi.org/10.2307/2286290.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Evan Miller, and Danny Smith. 2023. haven: Import and Export “SPSS” “Stata” and “SAS” Files. https://CRAN.R-project.org/package=haven.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nXu, Ya. 2020. “Causal Inference Challenges in Industry: A Perspective from Experiences at LinkedIn.” YouTube, July. https://youtu.be/OoKsLAvyIYA.\n\n\nYoshioka, Alan. 1998. “Use of Randomisation in the Medical Research Council’s Clinical Trial of Streptomycin in Pulmonary Tuberculosis in the 1940s.” BMJ 317 (7167): 1220–23. https://doi.org/10.1136/bmj.317.7167.1220.",
+ "text": "8.6 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: A political candidate is interested in how two polling values change over the course of an election campaign: approval rating and vote-share. The two are measured as percentages, and are somewhat correlated. There tends to be large changes when there is a debate between candidates. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please simulate situation, including the relationship, and then write tests for the simulated dataset.\n(Acquire) Please obtain some actual data, similar to the scenario, and add a script updating the simulated tests to these actual data.\n(Explore) Build graphs and tables using the real data.\n(Communicate) Write a short paper using Quarto and submit a link to a high-quality GitHub repo.\n\n\n\nQuiz\n\nWhich of the following best describes the fundamental problem of causal inference (pick one)?\n\nRandomization cannot eliminate all biases in an experiment.\nSurveys cannot accurately measure individual preferences.\nWe cannot observe both the treatment and control outcomes for the same individual simultaneously.\nIt is impossible to establish external validity in any experiment.\n\nIn the Neyman-Rubin potential outcomes framework, what is the primary goal when conducting an experiment (pick one)?\n\nTo estimate the causal effect by comparing treatment and control groups.\nTo focus on external validity over internal validity.\nTo maximize the sample size for greater statistical power.\nTo ensure all participants receive the treatment at some point.\n\nFrom Gertler et al. (2016), what does the basic impact evaluation formula \\(\\Delta = (Y_i|t=1) - (Y_i|t=0)\\) represent (pick one)?\n\nThe difference in outcomes between treatment and comparison groups.\nThe average change in a participant’s salary.\nThe effect of external market forces on outcomes.\nThe total cost of a program.\n\nWhy is randomization important in experimental design (pick one)?\n\nIt ensures the sample is representative of the population.\nIt eliminates the need for a control group.\nIt guarantees external validity.\nIt helps create treatment and control groups that are similar except for the treatment.\n\nFrom Gertler et al. (2016), what is a common problem when trying to measure the counterfactual (pick one)?\n\nOnly randomized trials can provide the counterfactual.\nData for control groups are always inaccurate.\nIt is impossible to observe both treatment and non-treatment outcomes for the same individual.\nPrograms typically do not have sufficient participants.\n\nFrom Gertler et al. (2016), when does selection bias happen (pick one)?\n\nProgram evaluation lacks financial support.\nThe program is implemented at a national scale.\nParticipants are not randomly assigned.\nData collection is incomplete.\n\nWhat is external validity (pick one)?\n\nFindings from an experiment that has been repeated many times.\nFindings from an experiment hold in that setting.\nFindings from an experiment for which code and data are available.\nFindings from an experiment hold outside that setting.\n\nWhat is internal validity (pick one)?\n\nFindings from an experiment for which code and data are available.\nFindings from an experiment that has been repeated many times.\nFindings from an experiment hold in that setting.\nFindings from an experiment hold outside that setting.\n\nFrom Gertler et al. (2016), what does internal validity refer to in an impact evaluation (pick one)?\n\nThe accuracy of measuring the causal effect of a program.\nThe ability to generalize findings to other populations.\nThe efficiency of program management.\nThe long-term sustainability of a program.\n\nFrom Gertler et al. (2016), what does external validity refer to in an impact evaluation (pick one)?\n\nThe administrative costs of a program.\nThe ability to generalize the results to the eligible population.\nThe effectiveness of a randomized control trial.\nThe extent to which outcomes reflect policy changes.\n\nPlease write some code for the following dataset that would randomly assign people into one of two groups.\n\n\nnetflix_data <-\n tibble(\n person = c(\"Ian\", \"Ian\", \"Roger\", \"Roger\",\n \"Roger\", \"Patricia\", \"Patricia\", \"Helen\"\n ),\n tv_show = c(\n \"Broadchurch\", \"Duty-Shame\", \"Broadchurch\", \"Duty-Shame\",\n \"Shetland\", \"Broadchurch\", \"Shetland\", \"Duty-Shame\"\n ),\n hours = c(6.8, 8.0, 0.8, 9.2, 3.2, 4.0, 0.2, 10.2)\n )\n\n\nFrom Gertler et al. (2016), a valid comparison group must have all of the following characteristics EXCEPT (pick one)?\n\nThe same average characteristics as the treatment group.\nHave outcomes that would change the same way as the treatment group.\nBe affected directly or indirectly by the program.\nReact to the program in a similar way if given the program.\n\nFrom Gertler et al. (2016), why are before-and-after comparisons considered counterfeit estimates (pick one)?\n\nThey involve random assignment.\nThey focus on unimportant metrics.\nThey require large data samples.\nThey assume outcomes do not change over time.\n\nFrom Gertler et al. (2016), which scenario could ethically allow the use of randomized assignment as a program allocation tool (pick one)?\n\nAll participants are enrolled based on income levels.\nEvery eligible participant can be accommodated by the program.\nThe program only serves one specific group.\nA program has more eligible participants than available spaces.\n\nThe Tuskegee Syphilis Study is an example of a violation of which ethical principle (pick one)?\n\nMaintaining confidentiality of participant data.\nEnsuring statistical power in experimental design.\nObtaining informed consent from participants.\nProviding monetary compensation to participants.\n\nWhat does “equipoise” refer to in the context of clinical trials (pick one)?\n\nThe statistical equilibrium achieved when sample sizes are equal.\nThe state where all participants have equal access to the treatment.\nThe balance between treatment efficacy and side effects.\nThe ethical requirement of genuine uncertainty about the treatment’s effectiveness.\n\nWare (1989, 299) mentions “randomized-consent” and continues that it was “attractive in this setting because a standard approach to informed consent would require that parents of infants near death be approached to give informed consent for an invasive surgical procedure that would then, in some instances, not be administered. Those familiar with the agonizing experience of having a child in a neonatal intensive care unit can appreciate that the process of obtaining informed consent would be both frightening and stressful to parents.” To what extent do you agree with this position, especially given, as Ware (1989, 305), mentions “the need to withhold information about the study from parents of infants receiving Conventional Medical Therapy (CMT)”?\nWhich of the following is a key concern when designing survey questions (pick one)?\n\nAsking multiple questions at once to save time.\nUsing technical jargon to appear more credible.\nEnsuring questions are relevant and easily understood by respondents.\nLeading respondents toward a desired answer.\n\nIn the context of experiments, what is a “confounder” (pick one)?\n\nA participant who does not follow the experimental protocol.\nA variable that is intentionally manipulated by the researcher.\nA variable that is not controlled for and may affect the outcome.\nAn error in data collection leading to invalid results.\n\nThe Oregon Health Insurance Experiment primarily aimed to assess the impact of what (pick one)?\n\nRandomly providing Medicaid to low-income adults to study health outcomes.\nIntroducing a new private health insurance plan.\nEvaluating the cost-effectiveness of health interventions.\nComparing different medical treatments for chronic illnesses.\n\nIn survey design, what is the purpose of a pilot study (pick one)?\n\nTo ensure all respondents understand the study’s hypotheses.\nTo test and refine the survey instrument before full deployment.\nTo increase the sample size for better statistical power.\nTo collect preliminary data for publication.\n\nWhy might an A/A test be conducted in the context of A/B testing (pick one)?\n\nTo test the effectiveness of the control condition.\nTo ensure the randomization process is properly creating comparable groups.\nTo compare two entirely different treatments.\nTo save resources by not implementing a new treatment.\n\nWhat ethical concern is particularly relevant to A/B testing in industry settings (pick one)?\n\nThe high cost of conducting experiments.\nDifficulty in measuring long-term effects.\nLack of informed consent from users being experimented on.\nEnsuring statistical significance in large datasets.\n\nPretend that you work as a junior analyst for a large consulting firm. Further, pretend that your consulting firm has taken a contract to put together a facial recognition model for a government border security department. Write at least three paragraphs, with examples and references, discussing your thoughts, with regard to ethics, on this matter.\nWhat does the term “average treatment effect” (ATE) refer to (pick one)?\n\nThe effect of treatment on a single individual.\nThe average outcome observed in the control group.\nThe difference in outcomes between the treatment and control groups across the entire sample.\nThe total sum of all treatment effects observed.\n\nIn the context of experiments, what does “blinding” refer to (pick one)?\n\nUsing complex statistical methods to analyze data.\nEnsuring participants do not know whether they are receiving the treatment or control.\nKeeping the sample size hidden from participants.\nRandomly assigning treatments without recording the assignments.\n\nWhat is a primary reason for doing simulation before analyzing real data in experiments (pick one)?\n\nSimulation is more accurate than real data analyses.\nSimulation requires less computational power.\nSimulation eliminates the need for actual data collection.\nSimulation helps understand expected outcomes and potential errors in analysis.\n\nWhich statement best captures the concept of “selection bias” (pick one)?\n\nThe sample accurately represents the target population.\nAll variables are controlled except for the treatment variable.\nParticipants drop out of a study at random.\nThe method of selecting participants causes the sample to be unrepresentative.\n\nPlease redo the Upworthy analysis, but for “!” instead of “?”. What is the difference in clicks (pick one)?\n\n-8.3\n-7.2\n-5.6\n-4.5\n\n\nAs described by Letterman (2021), which sampling methodology was used to increase the likelihood of including respondents from smaller religious groups without introducing bias (pick one)?\n\nSnowball sampling.\nQuota sampling.\nRandom digit dialing.\nComposite measure of size.\n\nAs described by Letterman (2021), how did the researchers ensure that their survey was ethically conducted (pick one)?\n\nThey obtained approval from an institutional research review board (IRB) within India.\nThey only surveyed individuals who volunteered.\nThey anonymized data by not collecting any demographic information.\nThey provided financial incentives for participation.\n\nFrom Stantcheva (2023), what is coverage error in survey sampling (pick one)?\n\nErrors due to respondent’s inattentiveness.\nThe difference between the target population and the sample frame.\nBias from oversampling minorities.\nThe difference between the planned sample and the actual respondents.\n\nFrom Stantcheva (2023), what is the moderacy response bias (pick one)?\n\nThe tendency to choose middle options regardless of question content.\nBias introduced by question order.\nThe tendency to choose extreme values on a scale.\nThe tendency to agree with the surveyor’s expected answer.\n\nFrom Stantcheva (2023), which of the following is a way to minimize social desirability bias in online surveys (pick one)?\n\nOffering high monetary rewards.\nProviding reassurances about the confidentiality of responses.\nMaking the respondent’s identity public.\nKeeping survey questions long and complex.\n\nFrom Stantcheva (2023), what does response order bias refer to (pick one)?\n\nRespondents skipping sensitive questions.\nRespondents systematically choosing extreme values.\nRespondents failing to understand the question.\nRespondents choosing answers based on their order.\n\nFrom Stantcheva (2023), while managing a survey, you should do everything APART from (pick one)?\n\nCheck the data.\nMonitor the survey.\nTest statistical hypotheses.\nSoft-launch the survey.\n\nA common approach to minimizing question order effect is to randomize the order of questions. To what extent do you think this is effective?\nFrom Stantcheva (2023), what is a good practice for recruiting respondents in online surveys (pick one)? a. Offering the highest possible monetary incentives. b. Providing minimal information about the survey’s purpose initially. c. Revealing the survey’s topic in the invitation email. d. Emphasizing the length of the survey to increase engagement.\nFrom Stantcheva (2023), what does attrition in surveys refer to (pick one)?\n\nThe total number of people who received the invitation.\nThe accuracy of the data collected.\nThe differences between respondents and nonrespondents.\nThe rate at which respondents drop out before completing the survey.\n\n\n\n\nActivity\nPlease consider the Special Virtual Issue on Nonresponse Rates and Nonresponse Adjustments of the Journal of Survey Statistics and Methodology. Focus on one aspect of the editorial, and with reference to relevant literature, please discuss it in at least two pages. Use Quarto, and include an appropriate title, author, date, link to a GitHub repo, and citations. Submit a PDF.\n\n\nPaper\nAt about this point the Howrah Paper from Online Appendix E would be appropriate.\n\n\n\n\nAbadie, Alberto, Susan Athey, Guido Imbens, and Jeffrey Wooldridge. 2017. “When Should You Adjust Standard Errors for Clustering?” Working Paper 24003. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w24003.\n\n\nAcemoglu, Daron, Simon Johnson, and James Robinson. 2001. “The Colonial Origins of Comparative Development: An Empirical Investigation.” American Economic Review 91 (5): 1369–1401. https://doi.org/10.1257/aer.91.5.1369.\n\n\nAkerlof, George. 1970. “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism.” The Quarterly Journal of Economics 84 (3): 488–500. https://doi.org/10.2307/1879431.\n\n\nAlsan, Marcella, and Amy Finkelstein. 2021. “Beyond Causality: Additional Benefits of Randomized Controlled Trials for Improving Health Care Delivery.” The Milbank Quarterly 99 (4): 864–81. https://doi.org/10.1111/1468-0009.12521.\n\n\nAlsan, Marcella, and Marianne Wanamaker. 2018. “Tuskegee and the Health of Black Men.” The Quarterly Journal of Economics 133 (1): 407–55. https://doi.org/10.1093/qje/qjx029.\n\n\nAngrist, Joshua, and Jörn-Steffen Pischke. 2010. “The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con Out of Econometrics.” Journal of Economic Perspectives 24 (2): 3–30. https://doi.org/10.1257/jep.24.2.3.\n\n\nAprameya, Lavanya. 2020. “Improving Duolingo, One Experiment at a Time.” Duolingo Blog, January. https://blog.duolingo.com/improving-duolingo-one-experiment-at-a-time/.\n\n\nAthey, Susan, and Guido Imbens. 2017a. “The Econometrics of Randomized Experiments.” In Handbook of Field Experiments, 73–140. Elsevier. https://doi.org/10.1016/bs.hefe.2016.10.003.\n\n\n———. 2017b. “The State of Applied Econometrics: Causality and Policy Evaluation.” Journal of Economic Perspectives 31 (2): 3–32. https://doi.org/10.1257/jep.31.2.3.\n\n\nBanerjee, Abhijit, and Esther Duflo. 2011. Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty. New York: PublicAffairs.\n\n\nBanerjee, Abhijit, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan. 2015. “The Miracle of Microfinance? Evidence from a Randomized Evaluation.” American Economic Journal: Applied Economics 7 (1): 22–53. https://doi.org/10.1257/app.20130533.\n\n\nBerry, Donald. 1989. “Comment: Ethics and ECMO.” Statistical Science 4 (4): 306–10. https://www.jstor.org/stable/2245830.\n\n\nBlair, Ed, Seymour Sudman, Norman M Bradburn, and Carol Stocking. 1977. “How to Ask Questions about Drinking and Sex: Response Effects in Measuring Consumer Behavior.” Journal of Marketing Research 14 (3): 316–21. https://doi.org/10.2307/3150769.\n\n\nBosch, Oriol, and Melanie Revilla. 2022. “When survey science met web tracking: Presenting an error framework for metered data.” Journal of the Royal Statistical Society: Series A (Statistics in Society), November, 1–29. https://doi.org/10.1111/rssa.12956.\n\n\nBouguen, Adrien, Yue Huang, Michael Kremer, and Edward Miguel. 2019. “Using Randomized Controlled Trials to Estimate Long-Run Impacts in Development Economics.” Annual Review of Economics 11 (1): 523–61. https://doi.org/10.1146/annurev-economics-080218-030333.\n\n\nBrandt, Allan. 1978. “Racism and Research: The Case of the Tuskegee Syphilis Study.” Hastings Center Report, 21–29. https://doi.org/10.2307/3561468.\n\n\nBrook, Robert, John Ware, William Rogers, Emmett Keeler, Allyson Ross Davies, Cathy Sherbourne, George Goldberg, Kathleen Lohr, Patricia Camp, and Joseph Newhouse. 1984. “The Effect of Coinsurance on the Health of Adults: Results from the RAND Health Insurance Experiment.” https://www.rand.org/pubs/reports/R3055.html.\n\n\nChen, Weijun, Yan Qi, Yuwen Zhang, Christina Brown, Akos Lada, and Harivardan Jayaraman. 2022. “Notifications: Why Less Is More,” December. https://medium.com/@AnalyticsAtMeta/notifications-why-less-is-more-how-facebook-has-been-increasing-both-user-satisfaction-and-app-9463f7325e7d.\n\n\nChristian, Brian. 2012. “The A/B Test: Inside the Technology That’s Changing the Rules of Business.” Wired, April. https://www.wired.com/2012/04/ff-abtesting/.\n\n\nCohn, Alain. 2019. “Data and code for: Civic Honesty Around the Globe.” Harvard Dataverse. https://doi.org/10.7910/dvn/ykbodn.\n\n\nCohn, Alain, Michel André Maréchal, David Tannenbaum, and Christian Lukas Zünd. 2019a. “Civic Honesty Around the Globe.” Science 365 (6448): 70–73. https://doi.org/10.1126/science.aau8712.\n\n\n———. 2019b. “Supplementary Materials for: Civic Honesty Around the Globe.” Science 365 (6448): 70–73.\n\n\nCunningham, Scott. 2021. Causal Inference: The Mixtape. 1st ed. New Haven: Yale Press. https://mixtape.scunning.com.\n\n\nDeaton, Angus. 2010. “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature 48 (2): 424–55. https://doi.org/10.1257/jel.48.2.424.\n\n\nDillman, Don, Jolene Smyth, and Leah Christian. (1978) 2014. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. 4th ed. Wiley.\n\n\nDruckman, James, and Donald Green. 2021. “A New Era of Experimental Political Science.” In Advances in Experimental Political Science, 1–16. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108777919.002.\n\n\nDuflo, Esther. 2020. “Field Experiments and the Practice of Policy.” American Economic Review 110 (7): 1952–73. https://doi.org/10.1257/aer.110.7.1952.\n\n\nEdelman, Murray, Liberty Vittert, and Xiao-Li Meng. 2021. “An Interview with Murray Edelman on the History of the Exit Poll.” Harvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.3a25cd24.\n\n\nEdwards, Jonathan. 2017. “PACE team response shows a disregard for the principles of science.” Journal of Health Psychology 22 (9): 1155–58. https://doi.org/10.1177/1359105317700886.\n\n\nElliott, Michael, Brady West, Xinyu Zhang, and Stephanie Coffey. 2022. “The Anchoring Method: Estimation of Interviewer Effects in the Absence of Interpenetrated Sample Assignment.” Survey Methodology 48 (1): 25–48. http://www.statcan.gc.ca/pub/12-001-x/2022001/article/00005-eng.htm.\n\n\nElson, Malte. 2018. “Question Wording and Item Formulation.” https://doi.org/10.31234/osf.io/e4ktc.\n\n\nEnns, Peter, and Jake Rothschild. 2022. “Do You Know Where Your Survey Data Come From?” May. https://medium.com/3streams/surveys-3ec95995dde2.\n\n\nFinkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan Gruber, Joseph Newhouse, Heidi Allen, Katherine Baicker, and Oregon Health Study Group. 2012. “The Oregon Health Insurance Experiment: Evidence from the First Year.” The Quarterly Journal of Economics 127 (3): 1057–1106. https://doi.org/10.1093/qje/qjs020.\n\n\nFisher, Ronald. (1935) 1949. The Design of Experiments. 5th ed. London: Oliver; Boyd.\n\n\nFitts, Alexis Sobel. 2014. “The King of Content: How Upworthy Aims to Alter the Web, and Could End up Altering the World.” Columbia Journalism Review 53: 34–38. https://archives.cjr.org/feature/the%5Fking%5Fof%5Fcontent.php.\n\n\nFry, Hannah. 2020. “Big Tech Is Testing You.” The New Yorker, February, 61–65. https://www.newyorker.com/magazine/2020/03/02/big-tech-is-testing-you.\n\n\nGertler, Paul, Sebastian Martinez, Patrick Premand, Laura Rawlings, and Christel Vermeersch. 2016. Impact Evaluation in Practice. 2nd ed. The World Bank. https://doi.org/10.1596/978-1-4648-0779-4.\n\n\nGordon, Brett, Robert Moakler, and Florian Zettelmeyer. 2022. “Close Enough? A Large-Scale Exploration of Non-Experimental Approaches to Advertising Measurement.” Marketing Science, November. https://doi.org/10.1287/mksc.2022.1413.\n\n\nGordon, Brett, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2019. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook.” Marketing Science 38 (2): 193–225. https://doi.org/10.1287/mksc.2018.1135.\n\n\nGroves, Robert. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75 (5): 861–71. https://doi.org/10.1093/poq/nfr057.\n\n\nHeller, Jean. 2022. “AP Exposes the Tuskegee Syphilis Study: The 50th Anniversary.” AP, July. https://apnews.com/article/tuskegee-study-ap-story-investigation-syphilis-53403657e77d76f52df6c2e2892788c9.\n\n\nHill, Austin Bradford. 1965. “The Environment and Disease: Association or Causation?” Proceedings of the Royal Society of Medicine 58 (5): 295–300.\n\n\nHolland, Paul. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60. https://doi.org/10.2307/2289064.\n\n\nHolliday, Derek, Tyler Reny, Alex Rossell Hayes, Aaron Rudkin, Chris Tausanovitch, and Lynn Vavreck. 2021. “Democracy Fund + UCLA Nationscape Methodology and Representativeness Assessment.”\n\n\nKohavi, Ron, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. 2012. “Trustworthy Online Controlled Experiments.” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 12, 1st ed. ACM Press. https://doi.org/10.1145/2339530.2339653.\n\n\nKohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.\n\n\nLarmarange, Joseph. 2023. labelled: Manipulating Labelled Data. https://CRAN.R-project.org/package=labelled.\n\n\nLetterman, Clark. 2021. “Q&A: How Pew Research Center surveyed nearly 30,000 people in India,” July. https://medium.com/pew-research-center-decoded/q-a-how-pew-research-center-surveyed-nearly-30-000-people-in-india-7c778f6d650e.\n\n\nLevay, Kevin, Jeremy Freese, and James Druckman. 2016. “The Demographic and Political Composition of Mechanical Turk Samples.” SAGE Open 6 (1): 1–17. https://doi.org/10.1177/2158244016636433.\n\n\nLichand, Guilherme, and Sharon Wolf. 2022. “Measuring Child Labor: Whom Should Be Asked, and Why It Matters,” March. https://doi.org/10.21203/rs.3.rs-1474562/v1.\n\n\nLight, Richard, Judith Singer, and John Willett. 1990. By Design: Planning Research on Higher Education. 1st ed. Cambridge: Harvard University Press.\n\n\nMatias, Nathan, Kevin Munger, Marianne Aubin Le Quere, and Charles Ebersole. 2021. “The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media.” Scientific Data 8 (1): 1–8. https://doi.org/10.1038/s41597-021-00934-7.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe Biden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey: Princeton University Press.\n\n\nObermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nSalganik, Matthew. 2018. Bit by Bit: Social Research in the Digital Age. New Jersey: Princeton University Press.\n\n\nSmith, Matthew. 2018. “Should Milk Go in a Cup of Tea First or Last?” July. https://yougov.co.uk/topics/consumer/articles-reports/2018/07/30/should-milk-go-cup-tea-first-or-last.\n\n\nStantcheva, Stefanie. 2023. “How to Run Surveys: A Guide to Creating Your Own Identifying Variation and Revealing the Invisible.” Annual Review of Economics 15 (1): 205–34. https://doi.org/10.1146/annurev-economics-091622-010157.\n\n\nStolberg, Michael. 2006. “Inventing the Randomized Double-Blind Trial: The Nuremberg Salt Test of 1835.” Journal of the Royal Society of Medicine 99 (12): 642–43. https://doi.org/10.1177/014107680609901216.\n\n\nStolley, Paul. 1991. “When Genius Errs: R. A. Fisher and the Lung Cancer Controversy.” American Journal of Epidemiology 133 (5): 416–25. https://doi.org/10.1093/oxfordjournals.aje.a115904.\n\n\nSwain, Larry. 1985. “Basic Principles of Questionnaire Design.” Survey Methodology 11 (2): 161–70.\n\n\nTaddy, Matt. 2019. Business Data Science. 1st ed. McGraw Hill.\n\n\nTausanovitch, Chris, and Lynn Vavreck. 2021. “Democracy Fund + UCLA Nationscape Project.” https://www.voterstudygroup.org/data/nationscape.\n\n\nThe White House. 2023. “Recommendations on the Best Practices for the Collection of Sexual Orientation and Gender Identity Data on Federal Statistical Survey,” January. https://www.whitehouse.gov/wp-content/uploads/2023/01/SOGI-Best-Practices.pdf.\n\n\nTourangeau, Roger, Lance Rips, and Kenneth Rasinski. 2000. The Psychology of Survey Response. 1st ed. Cambridge University Press. https://doi.org/10.1017/CBO9780511819322.003.\n\n\nUrban, Steve, Rangarajan Sreenivasan, and Vineet Kannan. 2016. “It’s All A/Bout Testing: The Netflix Experimentation Platform.” Netflix Technology Blog, April. https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15.\n\n\nVavreck, Lynn, and Chris Tausanovitch. 2021. “Democracy Fund + UCLA Nationscape Project User Guide.” https://www.voterstudygroup.org/data/nationscape.\n\n\nWare, James. 1989. “Investigating Therapies of Potentially Great Benefit: ECMO.” Statistical Science 4 (4): 298–306. https://doi.org/10.1214/ss/1177012384.\n\n\nWei, LJ, and S Durham. 1978. “The Randomized Play-the-Winner Rule in Medical Trials.” Journal of the American Statistical Association 73 (364): 840–43. https://doi.org/10.2307/2286290.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Evan Miller, and Danny Smith. 2023. haven: Import and Export “SPSS” “Stata” and “SAS” Files. https://CRAN.R-project.org/package=haven.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nXu, Ya. 2020. “Causal Inference Challenges in Industry: A Perspective from Experiences at LinkedIn.” YouTube, July. https://youtu.be/OoKsLAvyIYA.\n\n\nYoshioka, Alan. 1998. “Use of Randomisation in the Medical Research Council’s Clinical Trial of Streptomycin in Pulmonary Tuberculosis in the 1940s.” BMJ 317 (7167): 1220–23. https://doi.org/10.1136/bmj.317.7167.1220.",
"crumbs": [
"Acquisition",
- "8 Hunt data "
+ "8 Experiments and surveys "
]
},
{
@@ -797,7 +797,7 @@
"href": "09-clean_and_prepare.html#kenyan-census",
"title": "9 Clean and prepare",
"section": "9.7 2019 Kenyan census",
- "text": "9.7 2019 Kenyan census\nAs a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\nThe distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.\n\n9.7.1 Gather and clean\nWe first need to download and read in the PDF of the 2019 Kenyan census.4\n\ncensus_url <-\n paste0(\n \"https://www.knbs.or.ke/download/2019-kenya-population-and-\",\n \"housing-census-volume-iii-distribution-of-population-by-age-\",\n \"sex-and-administrative-units/?wpdmdl=5729&refresh=\",\n \"620561f1ce3ad1644519921\"\n )\n\ndownload.file(\n url = census_url,\n destfile = \"2019_Kenya_census.pdf\",\n mode = \"wb\"\n)\n\nWe can use pdf_text() from pdftools again here.\n\nkenya_census <-\n pdf_text(\n pdf = \"2019_Kenya_census.pdf\"\n )\n\nIn this example we will focus on the page of the PDF about Nairobi (Figure 9.8).\n\n\n\n\n\n\nFigure 9.8: Page from the 2019 Kenyan census about Nairobi\n\n\n\n\n9.7.1.1 Make rectangular\nThe first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.\n\n# Focus on the page of interest\njust_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]\n\n# Remove blank lines\njust_nairobi <- just_nairobi[just_nairobi != \"\"]\n\n# Remove titles, headings and other content at the top of the page\njust_nairobi <- just_nairobi[5:length(just_nairobi)]\n\n# Remove page numbers and other content at the bottom of the page\njust_nairobi <- just_nairobi[1:62]\n\n# Convert into a tibble\ndemography_data <- tibble(all = just_nairobi)\n\nAt this point the data are in a tibble. This allows us to use our familiar dplyr verbs. In particular we want to separate the columns.\n\ndemography_data <-\n demography_data |>\n mutate(all = str_squish(all)) |>\n mutate(all = str_replace(all, \"10 -14\", \"10-14\")) |>\n mutate(all = str_replace(all, \"Not Stated\", \"NotStated\")) |>\n # Deal with the two column set-up\n separate(\n col = all,\n into = c(\n \"age\", \"male\", \"female\", \"total\",\n \"age_2\", \"male_2\", \"female_2\", \"total_2\"\n ),\n sep = \" \",\n remove = TRUE,\n fill = \"right\",\n extra = \"drop\"\n )\n\nThey are side by side at the moment. We need to instead append to the bottom.\n\ndemography_data_long <-\n rbind(\n demography_data |> select(age, male, female, total),\n demography_data |>\n select(age_2, male_2, female_2, total_2) |>\n rename(\n age = age_2,\n male = male_2,\n female = female_2,\n total = total_2\n )\n )\n\n\n# There is one row of NAs, so remove it\ndemography_data_long <-\n demography_data_long |>\n remove_empty(which = c(\"rows\"))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total \n <chr> <chr> <chr> <chr> \n 1 Total 2,192,452 2,204,376 4,396,828\n 2 0 57,265 56,523 113,788 \n 3 1 56,019 54,601 110,620 \n 4 2 52,518 51,848 104,366 \n 5 3 51,115 51,027 102,142 \n 6 4 47,182 46,889 94,071 \n 7 0-4 264,099 260,888 524,987 \n 8 5 45,203 44,711 89,914 \n 9 6 43,635 44,226 87,861 \n10 7 43,507 43,655 87,162 \n# ℹ 113 more rows\n\n\nHaving got it into a rectangular format, we now need to clean the dataset to make it useful.\n\n\n9.7.1.2 Validity\nTo attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.\n\ndemography_data_long |>\n select(male, female, total) |>\n mutate(across(everything(), ~ str_remove_all(., \"[:digit:]\"))) |>\n distinct()\n\n# A tibble: 5 × 3\n male female total\n <chr> <chr> <chr>\n1 \",,\" \",,\" \",,\" \n2 \",\" \",\" \",\" \n3 \"\" \",\" \",\" \n4 \"\" \"\" \",\" \n5 \"\" \"\" \"\" \n\n\nWe need to remove commas. While we could use janitor here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(across(c(male, female, total), ~ str_remove_all(., \",\"))) |>\n mutate(across(c(male, female, total), ~ as.integer(.)))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total\n <chr> <int> <int> <int>\n 1 Total 2192452 2204376 4396828\n 2 0 57265 56523 113788\n 3 1 56019 54601 110620\n 4 2 52518 51848 104366\n 5 3 51115 51027 102142\n 6 4 47182 46889 94071\n 7 0-4 264099 260888 524987\n 8 5 45203 44711 89914\n 9 6 43635 44226 87861\n10 7 43507 43655 87162\n# ℹ 113 more rows\n\n\n\n\n9.7.1.3 Internal consistency\n\nThe census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as “ages 0 to 5”, or a single age, such as “1”.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age_type = if_else(str_detect(age, \"-\"), \n \"age-group\", \n \"single-year\"),\n age_type = if_else(str_detect(age, \"Total\"), \n \"age-group\", \n age_type)\n )\n\nAt the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age = as_factor(age)\n )\n\n\n\n\n9.7.2 Check and test\nHaving gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”, which are the only two gender categories available.\n\ndemography_data_long |>\n mutate(\n check_sum = male + female,\n totals_match = if_else(total == check_sum, 1, 0)\n ) |>\n filter(totals_match == 0)\n\n# A tibble: 0 × 7\n# ℹ 7 variables: age <fct>, male <int>, female <int>, total <int>,\n# age_type <chr>, check_sum <int>, totals_match <dbl>\n\n\nFinally, we want to check that the single-age counts sum to the age-groups.\n\ndemography_data_long |>\n mutate(age_groups = if_else(age_type == \"age-group\", \n age, \n NA_character_)) |>\n fill(age_groups, .direction = \"up\") |>\n mutate(\n group_sum = sum(total),\n group_sum = group_sum / 2,\n difference = total - group_sum,\n .by = c(age_groups)\n ) |>\n filter(age_type == \"age-group\" & age_groups != \"Total\") |> \n head()\n\n# A tibble: 6 × 8\n age male female total age_type age_groups group_sum difference\n <fct> <int> <int> <int> <chr> <chr> <dbl> <dbl>\n1 0-4 264099 260888 524987 age-group 0-4 524987 0\n2 5-9 215230 217482 432712 age-group 5-9 432712 0\n3 10-14 185008 193542 378550 age-group 10-14 378550 0\n4 15-19 159098 192755 351853 age-group 15-19 351853 0\n5 20-24 249534 313485 563019 age-group 20-24 563019 0\n6 25-29 282703 300845 583548 age-group 25-29 583548 0\n\n\n\n\n9.7.3 Tidy-up\nNow that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.\n\ndemography_data_tidy <-\n demography_data_long |>\n rename_with(~paste0(., \"_total\"), male:total) |>\n pivot_longer(cols = contains(\"_total\"), \n names_to = \"type\", \n values_to = \"number\") |>\n separate(\n col = type,\n into = c(\"gender\", \"part_of_area\"),\n sep = \"_\"\n ) |>\n select(age, age_type, gender, number)\n\nThe original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2022). We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (Figure 9.9).\n\ndemography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n filter(gender != \"total\") |>\n ggplot(aes(x = age, y = number, fill = gender)) +\n geom_col(aes(x = age, y = number, fill = gender), \n position = \"dodge\") +\n scale_y_continuous(labels = comma) +\n scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), \"100+\")) +\n theme_classic() +\n scale_fill_brewer(palette = \"Set1\") +\n labs(\n y = \"Number\",\n x = \"Age\",\n fill = \"Gender\",\n caption = \"Data source: 2019 Kenya Census\"\n ) +\n theme(legend.position = \"bottom\") +\n coord_flip()\n\n\n\n\n\n\n\nFigure 9.9: Distribution of age and gender in Nairobi in 2019, based on Kenyan census\n\n\n\n\n\nA variety of features are clear from Figure 9.9, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.\nFinally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.\n\ncolumn_names_as_contracts <-\n demography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n rename(\n \"fctr_group_age\" = \"age\",\n \"chr_group_gender\" = \"gender\",\n \"int_group_count\" = \"number\"\n )\n\nWe can then use pointblank to set-up tests for us.\n\nagent <-\n create_agent(tbl = column_names_as_contracts) |>\n col_is_character(columns = vars(chr_group_gender)) |>\n col_is_factor(columns = vars(fctr_group_age)) |>\n col_is_integer(columns = vars(int_group_count)) |>\n col_vals_in_set(\n columns = chr_group_gender,\n set = c(\"male\", \"female\", \"total\")\n ) |>\n interrogate()\n\nagent\n\n\n\n\n\n\n\n\nPointblank Validation\n\n\n\n\n[2024-10-14|11:01:33]\n\n\ntibble column_names_as_contracts\n\n\n\n\n\n\nSTEP\nCOLUMNS\nVALUES\nTBL\nEVAL\nUNITS\nPASS\nFAIL\nW\nS\nN\nEXT\n\n\n\n\n\n\n1\n\n\n\n\ncol_is_character\n\n \n\n\n col_is_character()\n\n\n▮chr_group_gender\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n2\n\n\n\n\ncol_is_factor\n\n \n\n\n col_is_factor()\n\n\n▮fctr_group_age\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n3\n\n\n\n\ncol_is_integer\n\n \n\n\n col_is_integer()\n\n\n▮int_group_count\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n4\n\n\n\n\ncol_vals_in_set\n\n \n\n\n col_vals_in_set()\n\n\n▮chr_group_gender\n\n\nmale, female, total\n\n\n\n \n\n\n✓\n306\n306\n1\n0\n0\n—\n—\n—\n—\n\n\n\n2024-10-14 11:01:33 EDT < 1 s 2024-10-14 11:01:33 EDT",
+ "text": "9.7 2019 Kenyan census\nAs a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\nThe distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.\n\n9.7.1 Gather and clean\nWe first need to download and read in the PDF of the 2019 Kenyan census.4\n\ncensus_url <-\n paste0(\n \"https://www.knbs.or.ke/download/2019-kenya-population-and-\",\n \"housing-census-volume-iii-distribution-of-population-by-age-\",\n \"sex-and-administrative-units/?wpdmdl=5729&refresh=\",\n \"620561f1ce3ad1644519921\"\n )\n\ndownload.file(\n url = census_url,\n destfile = \"2019_Kenya_census.pdf\",\n mode = \"wb\"\n)\n\nWe can use pdf_text() from pdftools again here.\n\nkenya_census <-\n pdf_text(\n pdf = \"2019_Kenya_census.pdf\"\n )\n\nIn this example we will focus on the page of the PDF about Nairobi (Figure 9.8).\n\n\n\n\n\n\nFigure 9.8: Page from the 2019 Kenyan census about Nairobi\n\n\n\n\n9.7.1.1 Make rectangular\nThe first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.\n\n# Focus on the page of interest\njust_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]\n\n# Remove blank lines\njust_nairobi <- just_nairobi[just_nairobi != \"\"]\n\n# Remove titles, headings and other content at the top of the page\njust_nairobi <- just_nairobi[5:length(just_nairobi)]\n\n# Remove page numbers and other content at the bottom of the page\njust_nairobi <- just_nairobi[1:62]\n\n# Convert into a tibble\ndemography_data <- tibble(all = just_nairobi)\n\nAt this point the data are in a tibble. This allows us to use our familiar dplyr verbs. In particular we want to separate the columns.\n\ndemography_data <-\n demography_data |>\n mutate(all = str_squish(all)) |>\n mutate(all = str_replace(all, \"10 -14\", \"10-14\")) |>\n mutate(all = str_replace(all, \"Not Stated\", \"NotStated\")) |>\n # Deal with the two column set-up\n separate(\n col = all,\n into = c(\n \"age\", \"male\", \"female\", \"total\",\n \"age_2\", \"male_2\", \"female_2\", \"total_2\"\n ),\n sep = \" \",\n remove = TRUE,\n fill = \"right\",\n extra = \"drop\"\n )\n\nThey are side by side at the moment. We need to instead append to the bottom.\n\ndemography_data_long <-\n rbind(\n demography_data |> select(age, male, female, total),\n demography_data |>\n select(age_2, male_2, female_2, total_2) |>\n rename(\n age = age_2,\n male = male_2,\n female = female_2,\n total = total_2\n )\n )\n\n\n# There is one row of NAs, so remove it\ndemography_data_long <-\n demography_data_long |>\n remove_empty(which = c(\"rows\"))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total \n <chr> <chr> <chr> <chr> \n 1 Total 2,192,452 2,204,376 4,396,828\n 2 0 57,265 56,523 113,788 \n 3 1 56,019 54,601 110,620 \n 4 2 52,518 51,848 104,366 \n 5 3 51,115 51,027 102,142 \n 6 4 47,182 46,889 94,071 \n 7 0-4 264,099 260,888 524,987 \n 8 5 45,203 44,711 89,914 \n 9 6 43,635 44,226 87,861 \n10 7 43,507 43,655 87,162 \n# ℹ 113 more rows\n\n\nHaving got it into a rectangular format, we now need to clean the dataset to make it useful.\n\n\n9.7.1.2 Validity\nTo attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.\n\ndemography_data_long |>\n select(male, female, total) |>\n mutate(across(everything(), ~ str_remove_all(., \"[:digit:]\"))) |>\n distinct()\n\n# A tibble: 5 × 3\n male female total\n <chr> <chr> <chr>\n1 \",,\" \",,\" \",,\" \n2 \",\" \",\" \",\" \n3 \"\" \",\" \",\" \n4 \"\" \"\" \",\" \n5 \"\" \"\" \"\" \n\n\nWe need to remove commas. While we could use janitor here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(across(c(male, female, total), ~ str_remove_all(., \",\"))) |>\n mutate(across(c(male, female, total), ~ as.integer(.)))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total\n <chr> <int> <int> <int>\n 1 Total 2192452 2204376 4396828\n 2 0 57265 56523 113788\n 3 1 56019 54601 110620\n 4 2 52518 51848 104366\n 5 3 51115 51027 102142\n 6 4 47182 46889 94071\n 7 0-4 264099 260888 524987\n 8 5 45203 44711 89914\n 9 6 43635 44226 87861\n10 7 43507 43655 87162\n# ℹ 113 more rows\n\n\n\n\n9.7.1.3 Internal consistency\n\nThe census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as “ages 0 to 5”, or a single age, such as “1”.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age_type = if_else(str_detect(age, \"-\"), \n \"age-group\", \n \"single-year\"),\n age_type = if_else(str_detect(age, \"Total\"), \n \"age-group\", \n age_type)\n )\n\nAt the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age = as_factor(age)\n )\n\n\n\n\n9.7.2 Check and test\nHaving gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”, which are the only two gender categories available.\n\ndemography_data_long |>\n mutate(\n check_sum = male + female,\n totals_match = if_else(total == check_sum, 1, 0)\n ) |>\n filter(totals_match == 0)\n\n# A tibble: 0 × 7\n# ℹ 7 variables: age <fct>, male <int>, female <int>, total <int>,\n# age_type <chr>, check_sum <int>, totals_match <dbl>\n\n\nFinally, we want to check that the single-age counts sum to the age-groups.\n\ndemography_data_long |>\n mutate(age_groups = if_else(age_type == \"age-group\", \n age, \n NA_character_)) |>\n fill(age_groups, .direction = \"up\") |>\n mutate(\n group_sum = sum(total),\n group_sum = group_sum / 2,\n difference = total - group_sum,\n .by = c(age_groups)\n ) |>\n filter(age_type == \"age-group\" & age_groups != \"Total\") |> \n head()\n\n# A tibble: 6 × 8\n age male female total age_type age_groups group_sum difference\n <fct> <int> <int> <int> <chr> <chr> <dbl> <dbl>\n1 0-4 264099 260888 524987 age-group 0-4 524987 0\n2 5-9 215230 217482 432712 age-group 5-9 432712 0\n3 10-14 185008 193542 378550 age-group 10-14 378550 0\n4 15-19 159098 192755 351853 age-group 15-19 351853 0\n5 20-24 249534 313485 563019 age-group 20-24 563019 0\n6 25-29 282703 300845 583548 age-group 25-29 583548 0\n\n\n\n\n9.7.3 Tidy-up\nNow that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.\n\ndemography_data_tidy <-\n demography_data_long |>\n rename_with(~paste0(., \"_total\"), male:total) |>\n pivot_longer(cols = contains(\"_total\"), \n names_to = \"type\", \n values_to = \"number\") |>\n separate(\n col = type,\n into = c(\"gender\", \"part_of_area\"),\n sep = \"_\"\n ) |>\n select(age, age_type, gender, number)\n\nThe original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2022). We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (Figure 9.9).\n\ndemography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n filter(gender != \"total\") |>\n ggplot(aes(x = age, y = number, fill = gender)) +\n geom_col(aes(x = age, y = number, fill = gender), \n position = \"dodge\") +\n scale_y_continuous(labels = comma) +\n scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), \"100+\")) +\n theme_classic() +\n scale_fill_brewer(palette = \"Set1\") +\n labs(\n y = \"Number\",\n x = \"Age\",\n fill = \"Gender\",\n caption = \"Data source: 2019 Kenya Census\"\n ) +\n theme(legend.position = \"bottom\") +\n coord_flip()\n\n\n\n\n\n\n\nFigure 9.9: Distribution of age and gender in Nairobi in 2019, based on Kenyan census\n\n\n\n\n\nA variety of features are clear from Figure 9.9, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.\nFinally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.\n\ncolumn_names_as_contracts <-\n demography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n rename(\n \"fctr_group_age\" = \"age\",\n \"chr_group_gender\" = \"gender\",\n \"int_group_count\" = \"number\"\n )\n\nWe can then use pointblank to set-up tests for us.\n\nagent <-\n create_agent(tbl = column_names_as_contracts) |>\n col_is_character(columns = vars(chr_group_gender)) |>\n col_is_factor(columns = vars(fctr_group_age)) |>\n col_is_integer(columns = vars(int_group_count)) |>\n col_vals_in_set(\n columns = chr_group_gender,\n set = c(\"male\", \"female\", \"total\")\n ) |>\n interrogate()\n\nagent\n\n\n\n\n\n\n\n\nPointblank Validation\n\n\n\n\n[2024-10-14|11:37:28]\n\n\ntibble column_names_as_contracts\n\n\n\n\n\n\nSTEP\nCOLUMNS\nVALUES\nTBL\nEVAL\nUNITS\nPASS\nFAIL\nW\nS\nN\nEXT\n\n\n\n\n\n\n1\n\n\n\n\ncol_is_character\n\n \n\n\n col_is_character()\n\n\n▮chr_group_gender\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n2\n\n\n\n\ncol_is_factor\n\n \n\n\n col_is_factor()\n\n\n▮fctr_group_age\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n3\n\n\n\n\ncol_is_integer\n\n \n\n\n col_is_integer()\n\n\n▮int_group_count\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n4\n\n\n\n\ncol_vals_in_set\n\n \n\n\n col_vals_in_set()\n\n\n▮chr_group_gender\n\n\nmale, female, total\n\n\n\n \n\n\n✓\n306\n306\n1\n0\n0\n—\n—\n—\n—\n\n\n\n2024-10-14 11:37:28 EDT < 1 s 2024-10-14 11:37:28 EDT",
"crumbs": [
"Preparation",
"9 Clean and prepare "
@@ -951,7 +951,7 @@
"href": "11-eda.html#united-states-population-and-income-data",
"title": "11 Exploratory data analysis",
"section": "11.2 1975 United States population and income data",
- "text": "11.2 1975 United States population and income data\nAs a first example we consider US state populations as of 1975. This dataset is built into R with state.x77. Here is what the dataset looks like:\n\nus_populations <-\n state.x77 |>\n as_tibble() |>\n clean_names() |>\n mutate(state = rownames(state.x77)) |>\n select(state, population, income)\n\nus_populations\n\n# A tibble: 50 × 3\n state population income\n <chr> <dbl> <dbl>\n 1 Alabama 3615 3624\n 2 Alaska 365 6315\n 3 Arizona 2212 4530\n 4 Arkansas 2110 3378\n 5 California 21198 5114\n 6 Colorado 2541 4884\n 7 Connecticut 3100 5348\n 8 Delaware 579 4809\n 9 Florida 8277 4815\n10 Georgia 4931 4091\n# ℹ 40 more rows\n\n\nWe want to get a quick sense of the data. The first step is to have a look at the top and bottom of it with head() and tail(), then a random selection, and finally to focus on the variables and their class with glimpse(). The random selection is an important aspect, and when you use head() you should also quickly consider a random selection.\n\nus_populations |>\n head()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Alabama 3615 3624\n2 Alaska 365 6315\n3 Arizona 2212 4530\n4 Arkansas 2110 3378\n5 California 21198 5114\n6 Colorado 2541 4884\n\nus_populations |>\n tail()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Vermont 472 3907\n2 Virginia 4981 4701\n3 Washington 3559 4864\n4 West Virginia 1799 3617\n5 Wisconsin 4589 4468\n6 Wyoming 376 4566\n\nus_populations |>\n slice_sample(n = 6)\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Indiana 5313 4458\n2 Alabama 3615 3624\n3 Michigan 9111 4751\n4 Georgia 4931 4091\n5 Pennsylvania 11860 4449\n6 Colorado 2541 4884\n\nus_populations |>\n glimpse()\n\nRows: 50\nColumns: 3\n$ state <chr> \"Alabama\", \"Alaska\", \"Arizona\", \"Arkansas\", \"California\", \"…\n$ population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …\n$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…\n\n\nWe are then interested in understanding key summary statistics, such as the minimum, median, and maximum values for numeric variables with summary() from base R and the number of observations.\n\nus_populations |>\n summary()\n\n state population income \n Length:50 Min. : 365 Min. :3098 \n Class :character 1st Qu.: 1080 1st Qu.:3993 \n Mode :character Median : 2838 Median :4519 \n Mean : 4246 Mean :4436 \n 3rd Qu.: 4968 3rd Qu.:4814 \n Max. :21198 Max. :6315 \n\n\nFinally, it is especially important to understand the behavior of these key summary statistics at the limits. In particular, one approach is to randomly remove some observations and compare what happens to them. For instance, we can randomly create five datasets that differ on the basis of which observations were removed. We can then compare the summary statistics. If any of them are especially different, then we would want to look at the observations that were removed as they may contain observations with high influence.\n\nsample_means <- tibble(seed = c(), mean = c(), states_ignored = c())\n\nfor (i in c(1:5)) {\n set.seed(i)\n dont_get <- c(sample(x = state.name, size = 5))\n sample_means <-\n sample_means |>\n rbind(tibble(\n seed = i,\n mean =\n us_populations |>\n filter(!state %in% dont_get) |>\n summarise(mean = mean(population)) |>\n pull(),\n states_ignored = str_c(dont_get, collapse = \", \")\n ))\n}\n\nsample_means |>\n kable(\n col.names = c(\"Seed\", \"Mean\", \"Ignored states\"),\n digits = 0,\n format.args = list(big.mark = \",\"),\n booktabs = TRUE\n )\n\n\n\nTable 11.1: Comparing the mean population when different states are randomly removed\n\n\n\n\n\n\n\n\n\n\n\nSeed\nMean\nIgnored states\n\n\n\n\n1\n4,469\nArkansas, Rhode Island, Alabama, North Dakota, Minnesota\n\n\n2\n4,027\nMassachusetts, Iowa, Colorado, West Virginia, New York\n\n\n3\n4,086\nCalifornia, Idaho, Rhode Island, Oklahoma, South Carolina\n\n\n4\n4,391\nHawaii, Arizona, Connecticut, Utah, New Jersey\n\n\n5\n4,340\nAlaska, Texas, Iowa, Hawaii, South Dakota\n\n\n\n\n\n\n\n\nIn the case of the populations of US states, we know that larger states, such as California and New York, will have an out sized effect on our estimate of the mean. Table 11.1 supports that, as we can see that when we use seeds 2 and 3, there is a lower mean.",
+ "text": "11.2 1975 United States population and income data\nAs a first example we consider US state populations as of 1975. This dataset is built into R with state.x77. Here is what the dataset looks like:\n\nus_populations <-\n state.x77 |>\n as_tibble() |>\n clean_names() |>\n mutate(state = rownames(state.x77)) |>\n select(state, population, income)\n\nus_populations\n\n# A tibble: 50 × 3\n state population income\n <chr> <dbl> <dbl>\n 1 Alabama 3615 3624\n 2 Alaska 365 6315\n 3 Arizona 2212 4530\n 4 Arkansas 2110 3378\n 5 California 21198 5114\n 6 Colorado 2541 4884\n 7 Connecticut 3100 5348\n 8 Delaware 579 4809\n 9 Florida 8277 4815\n10 Georgia 4931 4091\n# ℹ 40 more rows\n\n\nWe want to get a quick sense of the data. The first step is to have a look at the top and bottom of it with head() and tail(), then a random selection, and finally to focus on the variables and their class with glimpse(). The random selection is an important aspect, and when you use head() you should also quickly consider a random selection.\n\nus_populations |>\n head()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Alabama 3615 3624\n2 Alaska 365 6315\n3 Arizona 2212 4530\n4 Arkansas 2110 3378\n5 California 21198 5114\n6 Colorado 2541 4884\n\nus_populations |>\n tail()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Vermont 472 3907\n2 Virginia 4981 4701\n3 Washington 3559 4864\n4 West Virginia 1799 3617\n5 Wisconsin 4589 4468\n6 Wyoming 376 4566\n\nus_populations |>\n slice_sample(n = 6)\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Utah 1203 4022\n2 Hawaii 868 4963\n3 Maine 1058 3694\n4 Delaware 579 4809\n5 Iowa 2861 4628\n6 Kansas 2280 4669\n\nus_populations |>\n glimpse()\n\nRows: 50\nColumns: 3\n$ state <chr> \"Alabama\", \"Alaska\", \"Arizona\", \"Arkansas\", \"California\", \"…\n$ population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …\n$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…\n\n\nWe are then interested in understanding key summary statistics, such as the minimum, median, and maximum values for numeric variables with summary() from base R and the number of observations.\n\nus_populations |>\n summary()\n\n state population income \n Length:50 Min. : 365 Min. :3098 \n Class :character 1st Qu.: 1080 1st Qu.:3993 \n Mode :character Median : 2838 Median :4519 \n Mean : 4246 Mean :4436 \n 3rd Qu.: 4968 3rd Qu.:4814 \n Max. :21198 Max. :6315 \n\n\nFinally, it is especially important to understand the behavior of these key summary statistics at the limits. In particular, one approach is to randomly remove some observations and compare what happens to them. For instance, we can randomly create five datasets that differ on the basis of which observations were removed. We can then compare the summary statistics. If any of them are especially different, then we would want to look at the observations that were removed as they may contain observations with high influence.\n\nsample_means <- tibble(seed = c(), mean = c(), states_ignored = c())\n\nfor (i in c(1:5)) {\n set.seed(i)\n dont_get <- c(sample(x = state.name, size = 5))\n sample_means <-\n sample_means |>\n rbind(tibble(\n seed = i,\n mean =\n us_populations |>\n filter(!state %in% dont_get) |>\n summarise(mean = mean(population)) |>\n pull(),\n states_ignored = str_c(dont_get, collapse = \", \")\n ))\n}\n\nsample_means |>\n kable(\n col.names = c(\"Seed\", \"Mean\", \"Ignored states\"),\n digits = 0,\n format.args = list(big.mark = \",\"),\n booktabs = TRUE\n )\n\n\n\nTable 11.1: Comparing the mean population when different states are randomly removed\n\n\n\n\n\n\n\n\n\n\n\nSeed\nMean\nIgnored states\n\n\n\n\n1\n4,469\nArkansas, Rhode Island, Alabama, North Dakota, Minnesota\n\n\n2\n4,027\nMassachusetts, Iowa, Colorado, West Virginia, New York\n\n\n3\n4,086\nCalifornia, Idaho, Rhode Island, Oklahoma, South Carolina\n\n\n4\n4,391\nHawaii, Arizona, Connecticut, Utah, New Jersey\n\n\n5\n4,340\nAlaska, Texas, Iowa, Hawaii, South Dakota\n\n\n\n\n\n\n\n\nIn the case of the populations of US states, we know that larger states, such as California and New York, will have an out sized effect on our estimate of the mean. Table 11.1 supports that, as we can see that when we use seeds 2 and 3, there is a lower mean.",
"crumbs": [
"Modeling",
"11 Exploratory data analysis "
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index fcc2f908..8c4c3870 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -30,15 +30,15 @@
https://tellingstorieswithdata.com/06-farm.html
- 2024-10-14T14:56:21.696Z
+ 2024-10-14T15:31:01.098Z
https://tellingstorieswithdata.com/07-gather.html
- 2024-10-14T15:07:39.842Z
+ 2024-10-14T15:31:41.962Z
https://tellingstorieswithdata.com/08-hunt.html
- 2024-10-14T15:07:57.393Z
+ 2024-10-14T15:35:59.666Z
https://tellingstorieswithdata.com/09-clean_and_prepare.html
diff --git a/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta b/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta
index 4f12d3c8..aa36126a 100644
Binary files a/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta and b/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta differ