From ab257f237f4c931d2e6c5f29b47f8101d7b7b349 Mon Sep 17 00:00:00 2001 From: Rohan Alexander Date: Tue, 29 Oct 2024 22:20:07 -0400 Subject: [PATCH] Add double brackets --- 00-errata.qmd | 2 +- 01-introduction.qmd | 2 - 03-workflow.qmd | 6 +- 06-farm.qmd | 10 +- 26-sql.qmd | 4 +- docs/00-errata.html | 2 +- docs/01-introduction.html | 1 - docs/02-drinking_from_a_fire_hose.html | 20 +- docs/03-workflow.html | 6 +- docs/06-farm.html | 8 +- .../figure-html/fig-readioovertime-1.png | Bin 110412 -> 110364 bytes docs/09-clean_and_prepare.html | 6 +- docs/11-eda.html | 16 +- docs/23-assessment.html | 880 +++++++++--------- docs/24-interaction.html | 12 +- docs/26-sql.html | 4 +- docs/99-references.html | 5 + docs/search.json | 20 +- docs/sitemap.xml | 10 +- ..._cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta | Bin 343 -> 342 bytes references.bib | 9 + 21 files changed, 524 insertions(+), 499 deletions(-) diff --git a/00-errata.qmd b/00-errata.qmd index 206fe24b..3007d794 100644 --- a/00-errata.qmd +++ b/00-errata.qmd @@ -10,7 +10,7 @@ Chapman and Hall/CRC published this book in July 2023. You can purchase that [he This online version has some updates to what was printed. An online version that matches the print version is available [here](https://rohanalexander.github.io/telling_stories-published/). ::: -*Last updated: 28 October 2024.* +*Last updated: 29 October 2024.* The book was reviewed by Piotr Fryzlewicz in *The American Statistician* [@Fryzlewicz2024] and Nick Cox on [Amazon](https://www.amazon.com/gp/customer-reviews/R3S602G9RUDOF/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=1032134771). I am grateful that they gave such a lot of their time to provide the review, as well as their corrections and suggestions. diff --git a/01-introduction.qmd b/01-introduction.qmd index 05707f75..0c1276af 100644 --- a/01-introduction.qmd +++ b/01-introduction.qmd @@ -292,5 +292,3 @@ Ultimately, we are all just telling stories with data, but these stories are inc The purpose of this activity is to clarify in your mind the difficulty of measurement, even of seemingly simple things, and hence the likelihood of measurement issues in more complicated areas. Please obtain some seeds for a fast-growing plant such as radishes, mustard greens, or arugula. Plant the seeds and measure how much soil you used. Water them and measure the water you used. Each day take a note of any changes. More generally, measure and record as much as you can. Note your thoughts about the difficulty of measurement. Eventually your seeds will sprout, and you should measure how they grow. - -We will return to use the data that you gathered. diff --git a/03-workflow.qmd b/03-workflow.qmd index c0b7fb9d..35b1448b 100644 --- a/03-workflow.qmd +++ b/03-workflow.qmd @@ -78,7 +78,7 @@ If science is about systematically building and organizing knowledge in terms of - Reproducible research is when "[a]uthors provide all the necessary data and the computer codes to run the analysis again, re-creating the results." - A replication is a study "that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses." -Regardless of what it is specifically called, @Gelman2016 identifies how large an issue the lack of it is in various social sciences. Work that is not reproducible does not contribute to our stock of knowledge about the world. This is wasteful and potentially even unethical. Since @Gelman2016, a great deal of work has been done in many social sciences and the situation has improved a little, but much work remains. That is also the case in the life sciences [@heil2021reproducibility], cancer research [@Begley2012; @Mullard2021], and computer science [@pineau2021improving].\index{computer science} +For our purposes we use the definition from @nationalacademies [p. 46]: "Reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis."\index{reproducibility!definition} Regardless of what it is specifically called, @Gelman2016 identifies how large an issue the lack of it is in various social sciences. Work that is not reproducible does not contribute to our stock of knowledge about the world. This is wasteful and potentially even unethical. Since @Gelman2016, a great deal of work has been done in many social sciences and the situation has improved a little, but much work remains. That is also the case in the life sciences [@heil2021reproducibility], cancer research [@Begley2012; @Mullard2021], and computer science [@pineau2021improving].\index{computer science} Some of the examples that @Gelman2016 talks about are not that important in the scheme of things. But at the same time, we saw, and continue to see, similar approaches being used in areas with big impacts. For instance, many governments have created "nudge" units that implement public policy [@sunstein2017economics] even though there is evidence that some of the claims lack credibility [@nonudge; @gelmannudge].\index{public policy} Governments are increasingly using algorithms that they do not make open [@chouldechova18a]. And @herndon2014does document how research in economics that was used by governments to justify austerity policies following the 2007–2008 financial crisis turned out to not be reproducible.\index{economics} @@ -175,6 +175,8 @@ format: pdf --- ``` +### References + We can include references by specifying a BibTeX\index{BibTeX} file in the top matter and then calling it within the text, as needed. ``` @@ -214,6 +216,8 @@ To cite\index{citation} R in the Quarto document we then include `@citeR`, which The reference list\index{citation} at the end of the paper is automatically built based on calling the BibTeX file and including references in the paper. At the end of the Quarto document, include a heading "# References" and the actual citations will be included after that. When the Quarto file is rendered, Quarto sees these in the content, goes to the BibTeX file to get the reference details that it needs, builds the reference list, and then adds it at the end of the rendered document. +BibTeX\index{BibTeX!capitalization} will try to adjust the capitalization of entries. This can be helpful, but sometimes it is better to insist on a specific capitalization. To force BibTeX to use a particular capitalization use double braces instead of single braces around the entry. For instance, in the above examples `{{R Core Team}}` will be printed with that exact capitalization, whereas `{Telling Stories with Data}` is subject to the whims of BibTeX. Insisting on a particular capitalization is important when citing R packages, which can have a specific capitalization, and when citing an organization as an author. For instance, when citing `usethis`, you need to use `title = {{usethis: Automate Package and Project Setup}},`, not `title = {usethis: Automate Package and Project Setup},`. And if, say, data were provided by the City of Toronto, then when specifying the author to cite that dataset you would want to use `author = {{City of Toronto}},` not `author = {City of Toronto},`. The latter would result in the incorrect reference list entry "Toronto, City of" while the former would result in the correct reference list entry of "City of Toronto". + ### Essential commands diff --git a/06-farm.qmd b/06-farm.qmd index 39c1e641..6bc94f9e 100644 --- a/06-farm.qmd +++ b/06-farm.qmd @@ -1319,7 +1319,9 @@ Please use IPUMS to access the 2022 ACS. Making use of the codebook, how many re If there were 391,171 respondents in California (STATEICP) across all levels of education, then can you please use the ratio estimators approach of Laplace to estimate the total number of respondents in each state i.e. take the ratio that you worked out for California and apply it to the rest of the states. (Hint: You can now work out the ratio between the number of respondents with doctoral degrees in a state and number of respondents in a state and then apply that ratio to your column of the number of respondents with a doctoral degree in each state.) Compare it to the actual number of respondents in each state. -Write a short (2ish pages + appendices + references) paper using Quarto. Submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Do not forget to cite the data (but don't upload the raw data to GitHub). Your paper should cover at least: +Write a brief paper using Quarto and submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Components of the rubric that are relevant are: "R/Python is cited", "Data are appropriately cited", "Class paper", "LLM usage is documented", "Title", "Author, date, and repo", "Abstract", "Introduction", "Data", "Results", "Discussion", "Prose", "Cross-references", "Captions", "Graphs and tables", "Referencing", "Commits", "Sketches", "Simulation", "Tests", and "Reproducible workflow". + +Your paper should cover at least: - Instructions on how to obtain the data (in the appendix). - A brief overview of the ratio estimators approach. @@ -1327,6 +1329,8 @@ Write a short (2ish pages + appendices + references) paper using Quarto. Submit - Some explanation of why you think they are different i.e. the strengths and weaknesses of using ratio estimators. - A discussion of the strengths and weaknesses of the sampling approaches used by the ACS. (Hint: One fun thing could be to find the actual population of one state, then given the number of respondents in that and every state, use the ratio estimators approach to estimate the population of every state and then compare it with the actual populations.) +Do not forget to cite the data (but do not upload the raw data to GitHub). + Note: @@ -1344,7 +1348,7 @@ The purpose of this activity is to: A biobank contains de-identified biomedical data. For instance, UK Biobank contains samples from half a million UK participants. One use is to look at a respondent's whole genome sequencing data and then relate it to various conditions they have to better understand the extent of genetic determinant. -Please simulate the UK Biobank dataset. A whole genome would be a sequence of the letters A, T, C, and G, of a length of around 3 billion. For each individual, simulate only a length of 12 in total, made up of four groups of three letters. Then consider four conditions: colon cancer, cystic fibrosis, Parkinson's disease, and skin cancer. Create some probability association between different three letter combinations in your simulated genome sequences and whether a person has that condition. For instance, perhaps having AAA in the first three positions, is associated with five percent higher rate of colon cancer. Pretend that UK Biobank uses simple random sampling. +Please simulate the UK Biobank dataset. A whole genome would be a sequence of the letters A, T, C, and G, of a length of around 3 billion. For each individual, simulate only a length of 12 in total, made up of four groups of three letters. Then consider four conditions: colon cancer, cystic fibrosis, Parkinson's disease, and skin cancer. Create some probability association between different three letter combinations in your simulated genome sequences and whether a person has that condition. For instance, perhaps having AAA in the first three positions, is associated with five percent higher rate of colon cancer. Pretend that UK Biobank uses simple random sampling. Don't forget to set a seed. @Davies2024 argue that UK Biobank should use family-based sampling. That is, they would cluster at the level of families. We would expect that families would have similar (but not necessarily the same) sequences. Please redo your simulation to assume families of sizes one to five. @@ -1352,4 +1356,4 @@ Now analyze your simulations and compare your ability to find relationships betw The other aspect to note is that sampling at the level of the family may be easier for the collectors, because if they are collecting data about one member of the family then it may be slightly more convenient to collect data about another member of that family. Discuss the difference between probability and non-probability sampling and the nuances of the distinction. -Write a short (2ish pages + appendices + references) paper using Quarto. Submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. \ No newline at end of file +Write a brief paper using Quarto and submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Components of the rubric that are relevant are: "R/Python is cited", "Class paper", "LLM usage is documented", "Title", "Author, date, and repo", "Abstract", "Introduction", "Data", "Results", "Discussion", "Prose", "Cross-references", "Captions", "Graphs and tables", "Referencing", "Commits", "Sketches", "Simulation", "Tests", and "Reproducible workflow". diff --git a/26-sql.qmd b/26-sql.qmd index 4b3bd678..df3a84df 100644 --- a/26-sql.qmd +++ b/26-sql.qmd @@ -334,7 +334,7 @@ Please submit a screenshot showing you got at least 70 per cent in the free w3sc Get the SQL dataset from here: https://jacobfilipp.com/hammer/. -Use SQL (not R or Python) to make some finding. Write a short paper using Quarto. In the discussion please have one sub-section each on of: 1) correlation vs. causation; 2) missing data; 3) sources of bias. +Use SQL (not R or Python) to make some finding using this observational data. Write a short paper using Quarto (you are welcome to use R/Python to make graphs but not for data preparation/manipulation which should occur in SQL in a separate script). In the discussion please have one sub-section each on: 1) correlation vs. causation; 2) missing data; 3) sources of bias. -Submit a link to a GitHub repo that meets the general expectations. +Submit a link to a GitHub repo (one repo per group) that meets the general expectations. Components of the rubric that are relevant are: "R/Python is cited", "Data are cited", "Class paper", "LLM usage is documented", "Title", "Author, date, and repo", "Abstract", "Introduction", "Data", "Measurement", "Results", "Discussion", "Prose", "Cross-references", "Captions", "Graphs/tables/etc", "Referencing", "Commits", "Sketches", "Simulation", "Tests", and "Reproducible workflow". diff --git a/docs/00-errata.html b/docs/00-errata.html index b319d447..7e510800 100644 --- a/docs/00-errata.html +++ b/docs/00-errata.html @@ -478,7 +478,7 @@

Errors and updates

-

Last updated: 28 October 2024.

+

Last updated: 29 October 2024.

The book was reviewed by Piotr Fryzlewicz in The American Statistician (Fryzlewicz 2024) and Nick Cox on Amazon. I am grateful that they gave such a lot of their time to provide the review, as well as their corrections and suggestions.

Since the publication of this book in July 2023, there have been a variety of changes in the world. The rise of generative AI has changed the way that people code, Python has become easier to integrate alongside R because of Quarto, and packages continue to update (not to mention a new cohort of students has started going through the book). One advantage of having an online version is that I can make improvements.

I am grateful for the corrections and suggestions of: Andrew Black, Clay Ford, Crystal Lewis, David Jankoski, Donna Mulkern, Emi Tanaka, Emily Su, Inessa De Angelis, James Wade, Julia Kim, Krishiv Jain, Seamus Ross, Tino Kanngiesser, and Zak Varty.

diff --git a/docs/01-introduction.html b/docs/01-introduction.html index 3c80e9f5..565c3149 100644 --- a/docs/01-introduction.html +++ b/docs/01-introduction.html @@ -793,7 +793,6 @@

Quiz

Activity

The purpose of this activity is to clarify in your mind the difficulty of measurement, even of seemingly simple things, and hence the likelihood of measurement issues in more complicated areas.

Please obtain some seeds for a fast-growing plant such as radishes, mustard greens, or arugula. Plant the seeds and measure how much soil you used. Water them and measure the water you used. Each day take a note of any changes. More generally, measure and record as much as you can. Note your thoughts about the difficulty of measurement. Eventually your seeds will sprout, and you should measure how they grow.

-

We will return to use the data that you gathered.