Get the SQL dataset from here: https://jacobfilipp.com/hammer/.
-Use SQL (not R or Python) to make some finding. Write a short paper using Quarto. In the discussion please have one sub-section each on of: 1) correlation vs. causation; 2) missing data; 3) sources of bias.
-Submit a link to a GitHub repo that meets the general expectations.
+Use SQL (not R or Python) to make some finding using this observational data. Write a short paper using Quarto (you are welcome to use R/Python to make graphs but not for data preparation/manipulation which should occur in SQL in a separate script). In the discussion please have one sub-section each on: 1) correlation vs. causation; 2) missing data; 3) sources of bias.
+Submit a link to a GitHub repo (one repo per group) that meets the general expectations. Components of the rubric that are relevant are: “R/Python is cited”, “Data are cited”, “Class paper”, “LLM usage is documented”, “Title”, “Author, date, and repo”, “Abstract”, “Introduction”, “Data”, “Measurement”, “Results”, “Discussion”, “Prose”, “Cross-references”, “Captions”, “Graphs/tables/etc”, “Referencing”, “Commits”, “Sketches”, “Simulation”, “Tests”, and “Reproducible workflow”.
Navarro, Danielle. 2022.
“Binding Apache
Arrow to R,” January.
https://blog.djnavarro.net/posts/2022-01-18%5Fbinding-arrow-to-r/.
diff --git a/docs/search.json b/docs/search.json
index 7d42ff95..c9962524 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -170,7 +170,7 @@
"href": "01-introduction.html#exercises",
"title": "1 Telling stories with data",
"section": "1.6 Exercises",
- "text": "1.6 Exercises\n\nQuiz\n\nWhat is data science (in your own words)?\nFrom Register (2020), data decisions impact (pick one)?\n\nReal people.\nNo one.\nThose in the training set.\nThose in the test set.\n\nFrom Keyes (2019), what is data science (pick one)?\n\nData science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.\nThe quantitative analysis of large amounts of data for the purpose of decision-making.\nThe inhumane reduction of humanity down to what can be counted.\n\nFrom Keyes (2019), what is one consequence of data systems that require standardized categories (pick one)?\n\nWorse user experience.\nCompromised security measures.\nIncreased innovation in technology.\nErasure of individual identities and experiences.\n\nFrom Healy (2020), what is a common criticism about working with data (pick one)?\n\nThat it is too time-consuming and inefficient.\nThat it distances one from the reality of human lives behind the numbers.\nThat it requires expensive software and extensive training to analyse.\n\nFrom Healy (2020), what is a response that criticism (pick one)?\n\nWorking with data forces a confrontation with questions of meaning.\nData analysis should not be done.\nData should only be analyzed by automated processes.\nQualitative approaches should be the predominate approach.\n\nHow can you reconcile Keyes (2019) and Healy (2020)?\nWhy is ethics a key element of data science (pick one)?\n\nBecause data science always involves sensitive personal information.\nBecause ethical considerations make the analysis easier to do.\nBecause datasets likely concern humans and require consideration of context.\nBecause regulations require ethics approval for any data analysis.\n\nAccording to Crawford (2021), as described in this chapter, which of the following forces shape our world, and hence our data (select all that apply)?\n\nPolitical.\nPhysical.\nHistorical.\nCultural.\nSocial.\n\nFrom Ford (2015), what is a compiler (pick one)?\n\nSoftware that takes the symbols you typed into a file and transforms them into lower-level instructions.\nA sequence of symbols (using typical keyboard characters, saved to a file of some kind) that someone typed in, or copied, or pasted from elsewhere.\nA clock with benefits.\nPutting holes in punch cards, then into a box, then loading them, then the computer flips through the cards, identify where the holes were, and update parts of its memory.\n\nConsider the results of a survey that asked about gender. It finds the following counts: “man: 879”, “woman: 912”, “non-binary: 10” “prefer not to say: 3”, and “other: 1”. What is the appropriate way to consider “prefer not to say” (pick one)?\n\nDrop them.\nIt depends.\nInclude them.\nMerge it into “other”.\n\nImagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?\nWhat is meant by reproducibility in data science (pick one)?\n\nBeing able to produce similar results with different datasets.\nEnsuring that all steps of the analysis can be independently redone by others.\nPublishing results in peer-reviewed journals.\nUsing proprietary software to protect data.\n\nWhat is a challenge associated with measurement (pick one)?\n\nIt is usually straightforward and requires little attention.\nDeciding what and how to measure is complex and context-dependent.\nData collection is objective and free from bias.\nMeasurements are always accurate and consistent over time.\n\nIn the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?\n\nCreating complex models to fit the data.\nAcquiring raw data.\nCleaning and preparing the data to reveal the needed dataset.\nVisualizing the results.\n\nWhy is exploratory data analysis (EDA) an open-ended process (pick one)?\n\nBecause it has a fixed set of steps to follow.\nBecause it requires ongoing iteration to understand the data’s shape and patterns.\nBecause it involves testing hypotheses in a structured way.\nBecause it can be automated.\n\nWhy should statistical models be used carefully (pick one)?\n\nBecause they always provide definitive results.\nBecause they can reflect the decisions made in earlier stages.\nBecause they are too complicated for most audiences.\nBecause they are unnecessary if the data are well-presented.\n\nWhat is one lesson from thinking about the challenges of measuring height (pick one)?\n\nHeight is a straightforward measurement with little variability.\nAll measurements are accurate if done with the right instrument.\nEven simple measurements can have complexities that affect data quality.\nHeight is not a useful variable in data analysis.\n\nWhat is the danger of not considering who is missing from a dataset (pick one)?\n\nIt has no significant impact on the analysis.\nIt simplifies the analysis by reducing the amount of data.\nIt can lead to conclusions that do not represent the full context.\n\nWhat is a purpose of statistical modeling (pick one)?\n\nAs a tool to help explore and understand the data.\nTo prove hypotheses.\nTo replace exploratory data analysis.\n\nWhat is meant by “our data are a simplification of the messy, complex world” (pick one)?\n\nData perfectly capture all aspects of reality.\nData simplify reality to make analysis possible, but they cannot capture every detail.\nData are always inaccurate and useless.\n\n\n\n\nActivity\nThe purpose of this activity is to clarify in your mind the difficulty of measurement, even of seemingly simple things, and hence the likelihood of measurement issues in more complicated areas.\nPlease obtain some seeds for a fast-growing plant such as radishes, mustard greens, or arugula. Plant the seeds and measure how much soil you used. Water them and measure the water you used. Each day take a note of any changes. More generally, measure and record as much as you can. Note your thoughts about the difficulty of measurement. Eventually your seeds will sprout, and you should measure how they grow.\nWe will return to use the data that you gathered.\n\n\n\n\nBandy, John, and Nicholas Vincent. 2021. “Addressing ‘Documentation Debt’ in Machine Learning: A Retrospective Datasheet for BookCorpus.” In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, edited by J. Vanschoren and S. Yeung. Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf.\n\n\nBaumer, Benjamin, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science With R. 2nd ed. Chapman; Hall/CRC. https://mdsr-book.github.io/mdsr2e/.\n\n\nBlei, David, Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.\n\n\nBoland, Philip. 1984. “A Biographical Glimpse of William Sealy Gosset.” The American Statistician 38 (3): 179–83. https://doi.org/10.2307/2683648.\n\n\nBowley, Arthur Lyon. 1901. Elements of Statistics. London: P. S. King.\n\n\nBradley, Valerie, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, and Seth Flaxman. 2021. “Unrepresentative Big Surveys Significantly Overestimated US Vaccine Uptake.” Nature 600 (7890): 695–700. https://doi.org/10.1038/s41586-021-04198-4.\n\n\nCook, Dianne, Nancy Reid, and Emi Tanaka. 2021. “The Foundation Is Available for Thinking about Data Visualization Inferentially.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.8453435d.\n\n\nCraiu, Radu. 2019. “The Hiring Gambit: In Search of the Twofer Data Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Massachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nDattani, Saloni. 2024. “The Rise in Reported Maternal Mortality Rates in the US Is Largely Due to a Change in Measurement.” Our World in Data.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed, and Allison Jones-Farmer. 2021. “Explaining Predictive Model Performance: An Experimental Study of Data Preparation and Model Choice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nEdgeworth, Francis Ysidro. 1885. “Methods of Statistics.” Journal of the Statistical Society of London, 181–217.\n\n\nFord, Paul. 2015. “What Is Code?” Bloomberg Businessweek, June. https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.\n\n\nForster, Edward Morgan. 1927. Aspects of the Novel. London: Edward Arnold.\n\n\nFoster, Gordon. 1968. “Computers, Statistics and Planning: Systems or Chaos?” Geary Lecture. https://www.esri.ie/system/files/publications/GLS2.pdf.\n\n\nFranklin, Laura. 2005. “Exploratory Experiments.” Philosophy of Science 72 (5): 888–99. https://doi.org/10.1086/508117.\n\n\nGould, Elliot, Hannah Fraser, Timothy Parker, Shinichi Nakagawa, Simon Griffith, Peter Vesk, and Fiona Fidler. 2023. “Same Data, Different Analysts: Variation in Effect Sizes Due to Analytical Decisions in Ecology and Evolutionary Biology,” October. https://doi.org/10.32942/x2gg62.\n\n\nHamming, Richard. (1997) 2020. The Art of Doing Science and Engineering. 2nd ed. Stripe Press.\n\n\nHealy, Kieran. 2020. “The Kitchen Counter Observatory,” May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey Greene, and Stephanie Hicks. 2021. “Reproducibility Standards for Machine Learning in the Life Sciences.” Nature Methods 18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHullman, Jessica, and Andrew Gelman. 2021. “Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.3ab8a587.\n\n\nHuntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The Influence of Hidden Researcher Decisions in Applied Microeconomics.” Economic Inquiry 59: 944–60. https://doi.org/10.1111/ecin.12992.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. (2013) 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.\n\n\nJordan, Michael. 2019. “Artificial Intelligence–The Revolution Hasn’t Happened Yet.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKent, William. 1993. “My Height: A Model for Numeric Information.” https://www.bkent.net/Doc/myheight.htm.\n\n\nKeyes, Os. 2019. “Counting the Countless.” Real Life. https://reallifemag.com/counting-the-countless/.\n\n\nLeek, Jeff, and Roger Peng. 2020. “Advanced Data Science 2020.” http://jtleek.com/ads2020/index.html.\n\n\nMacDorman, Marian, and Eugene Declercq. 2018. “The Failure of United States Maternal Mortality Reporting and Its Impact on Women’s Lives.” Birth 45 (2): 105–8. https://doi.org/1111/birt.12333.\n\n\nMcElreath, Richard. (2015) 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. Chapman; Hall/CRC.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nRegister, Yim. 2020. “Data Science Ethics in 6 Minutes.” YouTube, December. https://youtu.be/mA4gypAiRYU.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet. Penguin Classics.\n\n\nThornhill, John. 2021. “Lunch with the FT: Mathematician Hannah Fry.” Financial Times, July. https://www.ft.com/content/a5e33e5a-99b9-4bbc-948f-8a527c7675c3.\n\n\nTimbers, Tiffany, Trevor Campbell, and Melissa Lee. 2022. Data Science: A First Introduction. Chapman; Hall/CRC. https://datasciencebook.ca.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016) 2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWorld Health Organization. 2019. “Trends in Maternal Mortality 2000 to 2017: Estimates by WHO, UNICEF, UNFPA, World Bank Group and the United Nations Population Division.” https://apps.who.int/iris/handle/10665/327596.",
+ "text": "1.6 Exercises\n\nQuiz\n\nWhat is data science (in your own words)?\nFrom Register (2020), data decisions impact (pick one)?\n\nReal people.\nNo one.\nThose in the training set.\nThose in the test set.\n\nFrom Keyes (2019), what is data science (pick one)?\n\nData science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.\nThe quantitative analysis of large amounts of data for the purpose of decision-making.\nThe inhumane reduction of humanity down to what can be counted.\n\nFrom Keyes (2019), what is one consequence of data systems that require standardized categories (pick one)?\n\nWorse user experience.\nCompromised security measures.\nIncreased innovation in technology.\nErasure of individual identities and experiences.\n\nFrom Healy (2020), what is a common criticism about working with data (pick one)?\n\nThat it is too time-consuming and inefficient.\nThat it distances one from the reality of human lives behind the numbers.\nThat it requires expensive software and extensive training to analyse.\n\nFrom Healy (2020), what is a response that criticism (pick one)?\n\nWorking with data forces a confrontation with questions of meaning.\nData analysis should not be done.\nData should only be analyzed by automated processes.\nQualitative approaches should be the predominate approach.\n\nHow can you reconcile Keyes (2019) and Healy (2020)?\nWhy is ethics a key element of data science (pick one)?\n\nBecause data science always involves sensitive personal information.\nBecause ethical considerations make the analysis easier to do.\nBecause datasets likely concern humans and require consideration of context.\nBecause regulations require ethics approval for any data analysis.\n\nAccording to Crawford (2021), as described in this chapter, which of the following forces shape our world, and hence our data (select all that apply)?\n\nPolitical.\nPhysical.\nHistorical.\nCultural.\nSocial.\n\nFrom Ford (2015), what is a compiler (pick one)?\n\nSoftware that takes the symbols you typed into a file and transforms them into lower-level instructions.\nA sequence of symbols (using typical keyboard characters, saved to a file of some kind) that someone typed in, or copied, or pasted from elsewhere.\nA clock with benefits.\nPutting holes in punch cards, then into a box, then loading them, then the computer flips through the cards, identify where the holes were, and update parts of its memory.\n\nConsider the results of a survey that asked about gender. It finds the following counts: “man: 879”, “woman: 912”, “non-binary: 10” “prefer not to say: 3”, and “other: 1”. What is the appropriate way to consider “prefer not to say” (pick one)?\n\nDrop them.\nIt depends.\nInclude them.\nMerge it into “other”.\n\nImagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?\nWhat is meant by reproducibility in data science (pick one)?\n\nBeing able to produce similar results with different datasets.\nEnsuring that all steps of the analysis can be independently redone by others.\nPublishing results in peer-reviewed journals.\nUsing proprietary software to protect data.\n\nWhat is a challenge associated with measurement (pick one)?\n\nIt is usually straightforward and requires little attention.\nDeciding what and how to measure is complex and context-dependent.\nData collection is objective and free from bias.\nMeasurements are always accurate and consistent over time.\n\nIn the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?\n\nCreating complex models to fit the data.\nAcquiring raw data.\nCleaning and preparing the data to reveal the needed dataset.\nVisualizing the results.\n\nWhy is exploratory data analysis (EDA) an open-ended process (pick one)?\n\nBecause it has a fixed set of steps to follow.\nBecause it requires ongoing iteration to understand the data’s shape and patterns.\nBecause it involves testing hypotheses in a structured way.\nBecause it can be automated.\n\nWhy should statistical models be used carefully (pick one)?\n\nBecause they always provide definitive results.\nBecause they can reflect the decisions made in earlier stages.\nBecause they are too complicated for most audiences.\nBecause they are unnecessary if the data are well-presented.\n\nWhat is one lesson from thinking about the challenges of measuring height (pick one)?\n\nHeight is a straightforward measurement with little variability.\nAll measurements are accurate if done with the right instrument.\nEven simple measurements can have complexities that affect data quality.\nHeight is not a useful variable in data analysis.\n\nWhat is the danger of not considering who is missing from a dataset (pick one)?\n\nIt has no significant impact on the analysis.\nIt simplifies the analysis by reducing the amount of data.\nIt can lead to conclusions that do not represent the full context.\n\nWhat is a purpose of statistical modeling (pick one)?\n\nAs a tool to help explore and understand the data.\nTo prove hypotheses.\nTo replace exploratory data analysis.\n\nWhat is meant by “our data are a simplification of the messy, complex world” (pick one)?\n\nData perfectly capture all aspects of reality.\nData simplify reality to make analysis possible, but they cannot capture every detail.\nData are always inaccurate and useless.\n\n\n\n\nActivity\nThe purpose of this activity is to clarify in your mind the difficulty of measurement, even of seemingly simple things, and hence the likelihood of measurement issues in more complicated areas.\nPlease obtain some seeds for a fast-growing plant such as radishes, mustard greens, or arugula. Plant the seeds and measure how much soil you used. Water them and measure the water you used. Each day take a note of any changes. More generally, measure and record as much as you can. Note your thoughts about the difficulty of measurement. Eventually your seeds will sprout, and you should measure how they grow.\n\n\n\n\nBandy, John, and Nicholas Vincent. 2021. “Addressing ‘Documentation Debt’ in Machine Learning: A Retrospective Datasheet for BookCorpus.” In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, edited by J. Vanschoren and S. Yeung. Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf.\n\n\nBaumer, Benjamin, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science With R. 2nd ed. Chapman; Hall/CRC. https://mdsr-book.github.io/mdsr2e/.\n\n\nBlei, David, Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.\n\n\nBoland, Philip. 1984. “A Biographical Glimpse of William Sealy Gosset.” The American Statistician 38 (3): 179–83. https://doi.org/10.2307/2683648.\n\n\nBowley, Arthur Lyon. 1901. Elements of Statistics. London: P. S. King.\n\n\nBradley, Valerie, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, and Seth Flaxman. 2021. “Unrepresentative Big Surveys Significantly Overestimated US Vaccine Uptake.” Nature 600 (7890): 695–700. https://doi.org/10.1038/s41586-021-04198-4.\n\n\nCook, Dianne, Nancy Reid, and Emi Tanaka. 2021. “The Foundation Is Available for Thinking about Data Visualization Inferentially.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.8453435d.\n\n\nCraiu, Radu. 2019. “The Hiring Gambit: In Search of the Twofer Data Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Massachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nDattani, Saloni. 2024. “The Rise in Reported Maternal Mortality Rates in the US Is Largely Due to a Change in Measurement.” Our World in Data.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed, and Allison Jones-Farmer. 2021. “Explaining Predictive Model Performance: An Experimental Study of Data Preparation and Model Choice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nEdgeworth, Francis Ysidro. 1885. “Methods of Statistics.” Journal of the Statistical Society of London, 181–217.\n\n\nFord, Paul. 2015. “What Is Code?” Bloomberg Businessweek, June. https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.\n\n\nForster, Edward Morgan. 1927. Aspects of the Novel. London: Edward Arnold.\n\n\nFoster, Gordon. 1968. “Computers, Statistics and Planning: Systems or Chaos?” Geary Lecture. https://www.esri.ie/system/files/publications/GLS2.pdf.\n\n\nFranklin, Laura. 2005. “Exploratory Experiments.” Philosophy of Science 72 (5): 888–99. https://doi.org/10.1086/508117.\n\n\nGould, Elliot, Hannah Fraser, Timothy Parker, Shinichi Nakagawa, Simon Griffith, Peter Vesk, and Fiona Fidler. 2023. “Same Data, Different Analysts: Variation in Effect Sizes Due to Analytical Decisions in Ecology and Evolutionary Biology,” October. https://doi.org/10.32942/x2gg62.\n\n\nHamming, Richard. (1997) 2020. The Art of Doing Science and Engineering. 2nd ed. Stripe Press.\n\n\nHealy, Kieran. 2020. “The Kitchen Counter Observatory,” May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey Greene, and Stephanie Hicks. 2021. “Reproducibility Standards for Machine Learning in the Life Sciences.” Nature Methods 18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHullman, Jessica, and Andrew Gelman. 2021. “Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.3ab8a587.\n\n\nHuntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The Influence of Hidden Researcher Decisions in Applied Microeconomics.” Economic Inquiry 59: 944–60. https://doi.org/10.1111/ecin.12992.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. (2013) 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.\n\n\nJordan, Michael. 2019. “Artificial Intelligence–The Revolution Hasn’t Happened Yet.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKent, William. 1993. “My Height: A Model for Numeric Information.” https://www.bkent.net/Doc/myheight.htm.\n\n\nKeyes, Os. 2019. “Counting the Countless.” Real Life. https://reallifemag.com/counting-the-countless/.\n\n\nLeek, Jeff, and Roger Peng. 2020. “Advanced Data Science 2020.” http://jtleek.com/ads2020/index.html.\n\n\nMacDorman, Marian, and Eugene Declercq. 2018. “The Failure of United States Maternal Mortality Reporting and Its Impact on Women’s Lives.” Birth 45 (2): 105–8. https://doi.org/1111/birt.12333.\n\n\nMcElreath, Richard. (2015) 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. Chapman; Hall/CRC.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nRegister, Yim. 2020. “Data Science Ethics in 6 Minutes.” YouTube, December. https://youtu.be/mA4gypAiRYU.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet. Penguin Classics.\n\n\nThornhill, John. 2021. “Lunch with the FT: Mathematician Hannah Fry.” Financial Times, July. https://www.ft.com/content/a5e33e5a-99b9-4bbc-948f-8a527c7675c3.\n\n\nTimbers, Tiffany, Trevor Campbell, and Melissa Lee. 2022. Data Science: A First Introduction. Chapman; Hall/CRC. https://datasciencebook.ca.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016) 2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWorld Health Organization. 2019. “Trends in Maternal Mortality 2000 to 2017: Estimates by WHO, UNICEF, UNFPA, World Bank Group and the United Nations Population Division.” https://apps.who.int/iris/handle/10665/327596.",
"crumbs": [
"Foundations",
"
1 Telling stories with data"
@@ -203,7 +203,7 @@
"href": "02-drinking_from_a_fire_hose.html#australian-elections",
"title": "2 Drinking from a fire hose",
"section": "2.2 Australian elections",
- "text": "2.2 Australian elections\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\n\n2.2.1 Plan\nFor this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.\nThe basic requirement for the dataset is that it has the name of the seat (sometimes called a “division” in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is Figure 2.2 (a).\n\n\n\n\n\n\n\n\n\n\n\n(a) Quick sketch of a dataset that could be useful for analyzing Australian elections\n\n\n\n\n\n\n\n\n\n\n\n(b) Quick sketch of a possible graph of the number of seats won by each party\n\n\n\n\n\n\n\nFigure 2.2: Sketches of a potential dataset and graph related to an Australian election\n\n\n\nWe also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is Figure 2.2 (b).\n\n\n2.2.2 Simulate\nWe now simulate some data, to bring some specificity to our sketches.\nTo get started, within Posit Cloud, make a new Quarto document: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto document\\(\\dots\\)”. Give it a title, such as “Exploring the 2022 Australian Election”, add your name as author, and unclick “Use visual markdown editor” (Figure 2.3 (a)). Leave the other options as their default, and then click “Create”.\n\n\n\n\n\n\n\n\n\n\n\n(a) Creating a new Quarto document\n\n\n\n\n\n\n\n\n\n\n\n(b) Installing rmarkdown if necessary\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) After initial setup and with a preamble\n\n\n\n\n\n\n\n\n\n\n\n(d) Highlighting the green arrow to run the chunk\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Highlighting the cross to remove the messages\n\n\n\n\n\n\n\n\n\n\n\n(f) Highlighting the render button\n\n\n\n\n\n\n\nFigure 2.3: Getting started with a Quarto document\n\n\n\nYou may get a notification along the lines of “Package rmarkdown required\\(\\dots\\).” (Figure 2.3 (b)). If that happens, click “Install”. For this example, we will put everything into this one Quarto document. You should save it as “australian_elections.qmd”: “File” \\(\\rightarrow\\) “Save As\\(\\dots\\)”.\nRemove almost all the default content, and then beneath the heading material create a new R code chunk: “Code” \\(\\rightarrow\\) “Insert Chunk”. Then add preamble documentation that explains:\n\nthe purpose of the document;\nthe author and contact details;\nwhen the file was written or last updated; and\nprerequisites that the file relies on.\n\n\n#### Preamble ####\n# Purpose: Read in data from the 2022 Australian Election and make\n# a graph of the number of seats each party won.\n# Author: Rohan Alexander\n# Email: rohan.alexander@utoronto.ca\n# Date: 1 January 2023\n# Prerequisites: Know where to get Australian elections data.\n\nIn R, lines that start with “#” are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a “#”. Also make it clear that this is the preamble section by surrounding that with “####”. The result should look like Figure 2.3 (c).\nAfter this we need to setup the workspace. This involves installing and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the tidyverse and janitor packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nHadley Wickham is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013. He developed the tidyverse collection of packages, and has published many books including R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023) and Advanced R (Wickham 2019). He was awarded the COPSS Presidents’ Award in 2019.\n\n\nAn example of installing the packages follows. Run this code by clicking the small green arrow associated with the R code chunk (Figure 2.3 (d)).\n\n#### Workspace setup ####\ninstall.packages(\"tidyverse\")\ninstall.packages(\"janitor\")\n\nNow that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code should be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (Figure 2.3 (e)).\n\n#### Workspace setup ####\n# install.packages(\"tidyverse\")\n# install.packages(\"janitor\")\n\nlibrary(tidyverse)\nlibrary(janitor)\n\nWe can render the entire document by clicking “Render” (Figure 2.3 (f)). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\nFor an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance ?tidyverse.\nTo simulate our data, we need to create a dataset with two variables: “Division” and “Party”, and some values for each. In the case of “Division” reasonable values would be a name of one of the 151 Australian divisions. In the case of “Party” reasonable values would be one of the following five: “Liberal”, “Labor”, “National”, “Green”, or “Other”. Again, this code can be run by clicking the small green arrow associated with the R code chunk.\n\nsimulated_data <-\n tibble(\n # Use 1 through to 151 to represent each division\n \"Division\" = 1:151,\n # Randomly pick an option, with replacement, 151 times\n \"Party\" = sample(\n x = c(\"Liberal\", \"Labor\", \"National\", \"Green\", \"Other\"),\n size = 151,\n replace = TRUE\n )\n )\n\nsimulated_data\n\n# A tibble: 151 × 2\n Division Party \n <int> <chr> \n 1 1 Green \n 2 2 Labor \n 3 3 Labor \n 4 4 Other \n 5 5 National\n 6 6 National\n 7 7 Labor \n 8 8 Liberal \n 9 9 Labor \n10 10 Green \n# ℹ 141 more rows\n\n\nAt a certain point, your code will not run and you will want to ask others for help. Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub is more completely in Chapter 3, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on GitHub (Figure 2.4 (a)). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a “+” in the top right, and select “New gist” (Figure 2.4 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) GitHub sign-up screen\n\n\n\n\n\n\n\n\n\n\n\n(b) New GitHub Gist\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Create a public GitHub Gist to share code\n\n\n\n\n\n\n\nFigure 2.4: Creating a Gist to share code when asking for help\n\n\n\nFrom here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes “.R” at the end, for instance, “australian_elections.R”. In Figure 2.4 (c) it will turn out that we have incorrect capitalization, library(Tidyverse) instead of library(tidyverse).\nClick “Create public gist”. We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.\n\n\n2.2.3 Acquire\nNow we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC), which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to read_csv() from readr. We do not need to explicitly load readr because it is part of the tidyverse. The <- or “assignment operator” allocates the output of read_csv() to an object called “raw_elections_data”.\n\n#### Read in the data ####\nraw_elections_data <-\n read_csv(\n file = \n \"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv\",\n show_col_types = FALSE,\n skip = 1\n )\n\n# We have read the data from the AEC website. We may like to save\n# it in case something happens or they move it.\nwrite_csv(\n x = raw_elections_data,\n file = \"australian_voting.csv\"\n)\n\nWe can take a quick look at the dataset using head() which will show the first six rows, and tail() which will show the last six rows.\n\nhead(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Austral… ALP \n2 197 Aston VIC 36704 Alan TUDGE Liberal LP \n3 198 Ballarat VIC 36409 Catherine KING Austral… ALP \n4 103 Banks NSW 37018 David COLEMAN Liberal LP \n5 180 Barker SA 37083 Tony PASIN Liberal LP \n6 104 Barton NSW 36820 Linda BURNEY Austral… ALP \n\ntail(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 152 Wentworth NSW 37451 Allegra SPENDER Indepen… IND \n2 153 Werriwa NSW 36810 Anne Maree STANLEY Austral… ALP \n3 150 Whitlam NSW 36811 Stephen JONES Austral… ALP \n4 178 Wide Bay QLD 37506 Llew O'BRIEN Liberal… LNP \n5 234 Wills VIC 36452 Peter KHALIL Austral… ALP \n6 316 Wright QLD 37500 Scott BUCHHOLZ Liberal… LNP \n\n\nWe need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using clean_names() from janitor.\n\n#### Basic cleaning ####\nraw_elections_data <-\n read_csv(\n file = \"australian_voting.csv\",\n show_col_types = FALSE\n )\n\n\n# Make the names easier to type\ncleaned_elections_data <-\n clean_names(raw_elections_data)\n\n# Have a look at the first six rows\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 8\n division_id division_nm state_ab candidate_id given_nm surname party_nm \n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Australian …\n2 197 Aston VIC 36704 Alan TUDGE Liberal \n3 198 Ballarat VIC 36409 Catherine KING Australian …\n4 103 Banks NSW 37018 David COLEMAN Liberal \n5 180 Barker SA 37083 Tony PASIN Liberal \n6 104 Barton NSW 36820 Linda BURNEY Australian …\n# ℹ 1 more variable: party_ab <chr>\n\n\nThe names are faster to type because RStudio will auto-complete them. To do this, we begin typing the name of a variable and then use the “tab” key to complete it.\nThere are many variables in the dataset, and we are primarily interested in two: “division_nm” and “party_nm”. We can choose certain variables of interest with select() from dplyr which we loaded as part of the tidyverse. The “pipe operator”, |>, pushes the output of one line to be the first input of the function on the next line.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n select(\n division_nm,\n party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division_nm party_nm \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nSome of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with names(). And we can change the names using rename() from dplyr.\n\nnames(cleaned_elections_data)\n\n[1] \"division_nm\" \"party_nm\" \n\n\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n rename(\n division = division_nm,\n elected_party = party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nWe could now look at the unique values in the “elected_party” column using unique().\n\ncleaned_elections_data$elected_party |>\n unique()\n\n[1] \"Australian Labor Party\" \n[2] \"Liberal\" \n[3] \"Liberal National Party of Queensland\"\n[4] \"The Greens\" \n[5] \"The Nationals\" \n[6] \"Independent\" \n[7] \"Katter's Australian Party (KAP)\" \n[8] \"Centre Alliance\" \n\n\nAs there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using case_match() from dplyr.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n mutate(\n elected_party =\n case_match(\n elected_party,\n \"Australian Labor Party\" ~ \"Labor\",\n \"Liberal National Party of Queensland\" ~ \"Liberal\",\n \"Liberal\" ~ \"Liberal\",\n \"The Nationals\" ~ \"Nationals\",\n \"The Greens\" ~ \"Greens\",\n \"Independent\" ~ \"Other\",\n \"Katter's Australian Party (KAP)\" ~ \"Other\",\n \"Centre Alliance\" ~ \"Other\"\n )\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party\n <chr> <chr> \n1 Adelaide Labor \n2 Aston Liberal \n3 Ballarat Labor \n4 Banks Liberal \n5 Barker Liberal \n6 Barton Labor \n\n\nOur data now matches our plan (Figure 2.2 (a)). For every electoral division we have the party of the person that won it.\nHaving now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.\n\nwrite_csv(\n x = cleaned_elections_data,\n file = \"cleaned_elections_data.csv\"\n)\n\n\n\n2.2.4 Explore\nWe may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in Figure 2.2 (b).\nFirst, we read in the dataset that we just created.\n\n#### Read in the data ####\ncleaned_elections_data <-\n read_csv(\n file = \"cleaned_elections_data.csv\",\n show_col_types = FALSE\n )\n\nWe can get a quick count of how many seats each party won using count() from dplyr.\n\ncleaned_elections_data |>\n count(elected_party)\n\n# A tibble: 5 × 2\n elected_party n\n <chr> <int>\n1 Greens 4\n2 Labor 77\n3 Liberal 48\n4 Nationals 10\n5 Other 12\n\n\nTo build the graph that we are interested in, we use ggplot2 which is part of the tidyverse. The key aspect of this package is that we build graphs by adding layers using “+”, which we call the “add operator”. In particular we will create a bar chart using geom_bar() from ggplot2 (Figure 2.5 (a)).\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) + # aes abbreviates \"aesthetics\" \n geom_bar()\n\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) +\n geom_bar() +\n theme_minimal() + # Make the theme neater\n labs(x = \"Party\", y = \"Number of seats\") # Make labels more meaningful\n\n\n\n\n\n\n\n\n\n\n\n(a) Default options\n\n\n\n\n\n\n\n\n\n\n\n(b) Improved theme and labels\n\n\n\n\n\n\n\nFigure 2.5: Number of seats won, by political party, at the 2022 Australian Federal Election\n\n\n\nFigure 2.5 (a) accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (Figure 2.5 (b)).\n\n\n2.2.5 Share\nTo this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.\n\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.\nWe downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R (R Core Team 2023) including the tidyverse (Wickham et al. 2019) and janitor (Firke 2023). We then created a graph of the number of seats that each political party won (Figure 2.5).\nWe found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.\nThe distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.\n\nOne aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, Cardoso (2020) and Bronner (2020).",
+ "text": "2.2 Australian elections\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\n\n2.2.1 Plan\nFor this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.\nThe basic requirement for the dataset is that it has the name of the seat (sometimes called a “division” in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is Figure 2.2 (a).\n\n\n\n\n\n\n\n\n\n\n\n(a) Quick sketch of a dataset that could be useful for analyzing Australian elections\n\n\n\n\n\n\n\n\n\n\n\n(b) Quick sketch of a possible graph of the number of seats won by each party\n\n\n\n\n\n\n\nFigure 2.2: Sketches of a potential dataset and graph related to an Australian election\n\n\n\nWe also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is Figure 2.2 (b).\n\n\n2.2.2 Simulate\nWe now simulate some data, to bring some specificity to our sketches.\nTo get started, within Posit Cloud, make a new Quarto document: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto document\\(\\dots\\)”. Give it a title, such as “Exploring the 2022 Australian Election”, add your name as author, and unclick “Use visual markdown editor” (Figure 2.3 (a)). Leave the other options as their default, and then click “Create”.\n\n\n\n\n\n\n\n\n\n\n\n(a) Creating a new Quarto document\n\n\n\n\n\n\n\n\n\n\n\n(b) Installing rmarkdown if necessary\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) After initial setup and with a preamble\n\n\n\n\n\n\n\n\n\n\n\n(d) Highlighting the green arrow to run the chunk\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Highlighting the cross to remove the messages\n\n\n\n\n\n\n\n\n\n\n\n(f) Highlighting the render button\n\n\n\n\n\n\n\nFigure 2.3: Getting started with a Quarto document\n\n\n\nYou may get a notification along the lines of “Package rmarkdown required\\(\\dots\\).” (Figure 2.3 (b)). If that happens, click “Install”. For this example, we will put everything into this one Quarto document. You should save it as “australian_elections.qmd”: “File” \\(\\rightarrow\\) “Save As\\(\\dots\\)”.\nRemove almost all the default content, and then beneath the heading material create a new R code chunk: “Code” \\(\\rightarrow\\) “Insert Chunk”. Then add preamble documentation that explains:\n\nthe purpose of the document;\nthe author and contact details;\nwhen the file was written or last updated; and\nprerequisites that the file relies on.\n\n\n#### Preamble ####\n# Purpose: Read in data from the 2022 Australian Election and make\n# a graph of the number of seats each party won.\n# Author: Rohan Alexander\n# Email: rohan.alexander@utoronto.ca\n# Date: 1 January 2023\n# Prerequisites: Know where to get Australian elections data.\n\nIn R, lines that start with “#” are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a “#”. Also make it clear that this is the preamble section by surrounding that with “####”. The result should look like Figure 2.3 (c).\nAfter this we need to setup the workspace. This involves installing and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the tidyverse and janitor packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nHadley Wickham is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013. He developed the tidyverse collection of packages, and has published many books including R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023) and Advanced R (Wickham 2019). He was awarded the COPSS Presidents’ Award in 2019.\n\n\nAn example of installing the packages follows. Run this code by clicking the small green arrow associated with the R code chunk (Figure 2.3 (d)).\n\n#### Workspace setup ####\ninstall.packages(\"tidyverse\")\ninstall.packages(\"janitor\")\n\nNow that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code should be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (Figure 2.3 (e)).\n\n#### Workspace setup ####\n# install.packages(\"tidyverse\")\n# install.packages(\"janitor\")\n\nlibrary(tidyverse)\nlibrary(janitor)\n\nWe can render the entire document by clicking “Render” (Figure 2.3 (f)). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\nFor an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance ?tidyverse.\nTo simulate our data, we need to create a dataset with two variables: “Division” and “Party”, and some values for each. In the case of “Division” reasonable values would be a name of one of the 151 Australian divisions. In the case of “Party” reasonable values would be one of the following five: “Liberal”, “Labor”, “National”, “Green”, or “Other”. Again, this code can be run by clicking the small green arrow associated with the R code chunk.\n\nsimulated_data <-\n tibble(\n # Use 1 through to 151 to represent each division\n \"Division\" = 1:151,\n # Randomly pick an option, with replacement, 151 times\n \"Party\" = sample(\n x = c(\"Liberal\", \"Labor\", \"National\", \"Green\", \"Other\"),\n size = 151,\n replace = TRUE\n )\n )\n\nsimulated_data\n\n# A tibble: 151 × 2\n Division Party \n <int> <chr> \n 1 1 Other \n 2 2 Green \n 3 3 Liberal \n 4 4 National\n 5 5 Other \n 6 6 Liberal \n 7 7 Liberal \n 8 8 Labor \n 9 9 Liberal \n10 10 Labor \n# ℹ 141 more rows\n\n\nAt a certain point, your code will not run and you will want to ask others for help. Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub is more completely in Chapter 3, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on GitHub (Figure 2.4 (a)). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a “+” in the top right, and select “New gist” (Figure 2.4 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) GitHub sign-up screen\n\n\n\n\n\n\n\n\n\n\n\n(b) New GitHub Gist\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Create a public GitHub Gist to share code\n\n\n\n\n\n\n\nFigure 2.4: Creating a Gist to share code when asking for help\n\n\n\nFrom here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes “.R” at the end, for instance, “australian_elections.R”. In Figure 2.4 (c) it will turn out that we have incorrect capitalization, library(Tidyverse) instead of library(tidyverse).\nClick “Create public gist”. We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.\n\n\n2.2.3 Acquire\nNow we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC), which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to read_csv() from readr. We do not need to explicitly load readr because it is part of the tidyverse. The <- or “assignment operator” allocates the output of read_csv() to an object called “raw_elections_data”.\n\n#### Read in the data ####\nraw_elections_data <-\n read_csv(\n file = \n \"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv\",\n show_col_types = FALSE,\n skip = 1\n )\n\n# We have read the data from the AEC website. We may like to save\n# it in case something happens or they move it.\nwrite_csv(\n x = raw_elections_data,\n file = \"australian_voting.csv\"\n)\n\nWe can take a quick look at the dataset using head() which will show the first six rows, and tail() which will show the last six rows.\n\nhead(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Austral… ALP \n2 197 Aston VIC 36704 Alan TUDGE Liberal LP \n3 198 Ballarat VIC 36409 Catherine KING Austral… ALP \n4 103 Banks NSW 37018 David COLEMAN Liberal LP \n5 180 Barker SA 37083 Tony PASIN Liberal LP \n6 104 Barton NSW 36820 Linda BURNEY Austral… ALP \n\ntail(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 152 Wentworth NSW 37451 Allegra SPENDER Indepen… IND \n2 153 Werriwa NSW 36810 Anne Maree STANLEY Austral… ALP \n3 150 Whitlam NSW 36811 Stephen JONES Austral… ALP \n4 178 Wide Bay QLD 37506 Llew O'BRIEN Liberal… LNP \n5 234 Wills VIC 36452 Peter KHALIL Austral… ALP \n6 316 Wright QLD 37500 Scott BUCHHOLZ Liberal… LNP \n\n\nWe need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using clean_names() from janitor.\n\n#### Basic cleaning ####\nraw_elections_data <-\n read_csv(\n file = \"australian_voting.csv\",\n show_col_types = FALSE\n )\n\n\n# Make the names easier to type\ncleaned_elections_data <-\n clean_names(raw_elections_data)\n\n# Have a look at the first six rows\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 8\n division_id division_nm state_ab candidate_id given_nm surname party_nm \n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Australian …\n2 197 Aston VIC 36704 Alan TUDGE Liberal \n3 198 Ballarat VIC 36409 Catherine KING Australian …\n4 103 Banks NSW 37018 David COLEMAN Liberal \n5 180 Barker SA 37083 Tony PASIN Liberal \n6 104 Barton NSW 36820 Linda BURNEY Australian …\n# ℹ 1 more variable: party_ab <chr>\n\n\nThe names are faster to type because RStudio will auto-complete them. To do this, we begin typing the name of a variable and then use the “tab” key to complete it.\nThere are many variables in the dataset, and we are primarily interested in two: “division_nm” and “party_nm”. We can choose certain variables of interest with select() from dplyr which we loaded as part of the tidyverse. The “pipe operator”, |>, pushes the output of one line to be the first input of the function on the next line.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n select(\n division_nm,\n party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division_nm party_nm \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nSome of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with names(). And we can change the names using rename() from dplyr.\n\nnames(cleaned_elections_data)\n\n[1] \"division_nm\" \"party_nm\" \n\n\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n rename(\n division = division_nm,\n elected_party = party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nWe could now look at the unique values in the “elected_party” column using unique().\n\ncleaned_elections_data$elected_party |>\n unique()\n\n[1] \"Australian Labor Party\" \n[2] \"Liberal\" \n[3] \"Liberal National Party of Queensland\"\n[4] \"The Greens\" \n[5] \"The Nationals\" \n[6] \"Independent\" \n[7] \"Katter's Australian Party (KAP)\" \n[8] \"Centre Alliance\" \n\n\nAs there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using case_match() from dplyr.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n mutate(\n elected_party =\n case_match(\n elected_party,\n \"Australian Labor Party\" ~ \"Labor\",\n \"Liberal National Party of Queensland\" ~ \"Liberal\",\n \"Liberal\" ~ \"Liberal\",\n \"The Nationals\" ~ \"Nationals\",\n \"The Greens\" ~ \"Greens\",\n \"Independent\" ~ \"Other\",\n \"Katter's Australian Party (KAP)\" ~ \"Other\",\n \"Centre Alliance\" ~ \"Other\"\n )\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party\n <chr> <chr> \n1 Adelaide Labor \n2 Aston Liberal \n3 Ballarat Labor \n4 Banks Liberal \n5 Barker Liberal \n6 Barton Labor \n\n\nOur data now matches our plan (Figure 2.2 (a)). For every electoral division we have the party of the person that won it.\nHaving now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.\n\nwrite_csv(\n x = cleaned_elections_data,\n file = \"cleaned_elections_data.csv\"\n)\n\n\n\n2.2.4 Explore\nWe may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in Figure 2.2 (b).\nFirst, we read in the dataset that we just created.\n\n#### Read in the data ####\ncleaned_elections_data <-\n read_csv(\n file = \"cleaned_elections_data.csv\",\n show_col_types = FALSE\n )\n\nWe can get a quick count of how many seats each party won using count() from dplyr.\n\ncleaned_elections_data |>\n count(elected_party)\n\n# A tibble: 5 × 2\n elected_party n\n <chr> <int>\n1 Greens 4\n2 Labor 77\n3 Liberal 48\n4 Nationals 10\n5 Other 12\n\n\nTo build the graph that we are interested in, we use ggplot2 which is part of the tidyverse. The key aspect of this package is that we build graphs by adding layers using “+”, which we call the “add operator”. In particular we will create a bar chart using geom_bar() from ggplot2 (Figure 2.5 (a)).\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) + # aes abbreviates \"aesthetics\" \n geom_bar()\n\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) +\n geom_bar() +\n theme_minimal() + # Make the theme neater\n labs(x = \"Party\", y = \"Number of seats\") # Make labels more meaningful\n\n\n\n\n\n\n\n\n\n\n\n(a) Default options\n\n\n\n\n\n\n\n\n\n\n\n(b) Improved theme and labels\n\n\n\n\n\n\n\nFigure 2.5: Number of seats won, by political party, at the 2022 Australian Federal Election\n\n\n\nFigure 2.5 (a) accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (Figure 2.5 (b)).\n\n\n2.2.5 Share\nTo this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.\n\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.\nWe downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R (R Core Team 2023) including the tidyverse (Wickham et al. 2019) and janitor (Firke 2023). We then created a graph of the number of seats that each political party won (Figure 2.5).\nWe found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.\nThe distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.\n\nOne aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, Cardoso (2020) and Bronner (2020).",
"crumbs": [
"Foundations",
"
2 Drinking from a fire hose"
@@ -258,7 +258,7 @@
"href": "03-workflow.html",
"title": "3 Reproducible workflows",
"section": "",
- "text": "3.1 Introduction\nIf science is about systematically building and organizing knowledge in terms of testable explanations and predictions, then data science takes this and focuses on data. This means that building, organizing, and sharing knowledge is a critical aspect. Creating knowledge, once, in a way that only you can do it, does not meet this standard. Hence, there is a need for reproducible data science workflows.\nAlexander (2019) defines reproducible research as that which can be exactly redone, given all the materials used. This underscores the importance of providing the code, data, and environment. The minimum expectation is that another person is independently able to use your code, data, and environment to get your results, including figures and tables. Ironically, there are different definitions of reproducibility between disciplines. Barba (2018) surveys a variety of disciplines and concludes that the predominant language usage implies the following definitions:\nRegardless of what it is specifically called, Gelman (2016) identifies how large an issue the lack of it is in various social sciences. Work that is not reproducible does not contribute to our stock of knowledge about the world. This is wasteful and potentially even unethical. Since Gelman (2016), a great deal of work has been done in many social sciences and the situation has improved a little, but much work remains. That is also the case in the life sciences (Heil et al. 2021), cancer research (Begley and Ellis 2012; Mullard 2021), and computer science (Pineau et al. 2021).\nSome of the examples that Gelman (2016) talks about are not that important in the scheme of things. But at the same time, we saw, and continue to see, similar approaches being used in areas with big impacts. For instance, many governments have created “nudge” units that implement public policy (Sunstein and Reisch 2017) even though there is evidence that some of the claims lack credibility (Maier et al. 2022; Szaszi et al. 2022). Governments are increasingly using algorithms that they do not make open (Chouldechova et al. 2018). And Herndon, Ash, and Pollin (2014) document how research in economics that was used by governments to justify austerity policies following the 2007–2008 financial crisis turned out to not be reproducible.\nAt a minimum, and with few exceptions, we must release our code, datasets, and environment. Without these, it is difficult to know what a finding speaks to (Miyakawa 2020). More banally, we also do not know if there are mistakes or aspects that were inadvertently overlooked (Merali 2010; Hillel 2017; Silver 2020). Increasingly, following Buckheit and Donoho (1995), we consider a paper to be an advertisement, and for the associated code, data, and environment to be the actual work. Steve Jobs, a co-founder of Apple, talked about how people who are the best at their craft ensure that even the aspects of their work that no one else will ever see are as well finished and high quality as the aspects that are public facing (Isaacson 2011). The same is true in data science, where often one of the distinguishing aspects of high-quality work is that the README and code comments are as polished as, say, the abstract of the associated paper.\nWorkflows exist within a cultural and social context, which imposes an additional ethical reason for the need for them to be reproducible. For instance, Wang and Kosinski (2018) train a neural network to distinguish between the faces of gay and heterosexual men. (Murphy (2017) provides a summary of the paper, the associated issues, and comments from its authors.) To do this, Wang and Kosinski (2018, 248) needed a dataset of photos of people that were “adult, Caucasian, fully visible, and of a gender that matched the one reported on the user’s profile”. They verified this using Amazon Mechanical Turk, an online platform that pays workers a small amount of money to complete specific tasks. The instructions provided to the Mechanical Turk workers for this task specify that Barack Obama, the 44th US President, who had a white mother and a black father, should be classified as “Black”; and that Latino is an ethnicity, rather than a race (Mattson 2017). The classification task may seem objective, but, perhaps unthinkingly, echoes the views of Americans with a certain class and background.\nThis is just one specific concern about one part of the Wang and Kosinski (2018) workflow. Broader concerns are raised by others including Gelman, Mattson, and Simpson (2018). The main issue is that statistical models are specific to the data on which they were trained. And the only reason that we can identify likely issues in the model of Wang and Kosinski (2018) is because, despite not releasing the specific dataset that they used, they were nonetheless open about their procedure. For our work to be credible, it needs to be reproducible by others.\nSome of the steps that we can take to make our work more reproducible include:\nThe workflow that we advocate in this book is:\n\\[\n\\mbox{Plan}\\rightarrow\\mbox{Simulate}\\rightarrow\\mbox{Acquire}\\rightarrow\\mbox{Explore}\\rightarrow\\mbox{Share}\n\\]\nBut it can be alternatively considered as: “Think an awful lot, mostly read and write, sometimes code”.\nThere are various tools that we can use at the different stages that will improve the reproducibility of this workflow. This includes Quarto, R Projects, and Git and GitHub.",
+ "text": "3.1 Introduction\nIf science is about systematically building and organizing knowledge in terms of testable explanations and predictions, then data science takes this and focuses on data. This means that building, organizing, and sharing knowledge is a critical aspect. Creating knowledge, once, in a way that only you can do it, does not meet this standard. Hence, there is a need for reproducible data science workflows.\nAlexander (2019) defines reproducible research as that which can be exactly redone, given all the materials used. This underscores the importance of providing the code, data, and environment. The minimum expectation is that another person is independently able to use your code, data, and environment to get your results, including figures and tables. Ironically, there are different definitions of reproducibility between disciplines. Barba (2018) surveys a variety of disciplines and concludes that the predominant language usage implies the following definitions:\nFor our purposes we use the definition from National Academies of Sciences, Engineering, and Medicine (2019, 46): “Reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis.” Regardless of what it is specifically called, Gelman (2016) identifies how large an issue the lack of it is in various social sciences. Work that is not reproducible does not contribute to our stock of knowledge about the world. This is wasteful and potentially even unethical. Since Gelman (2016), a great deal of work has been done in many social sciences and the situation has improved a little, but much work remains. That is also the case in the life sciences (Heil et al. 2021), cancer research (Begley and Ellis 2012; Mullard 2021), and computer science (Pineau et al. 2021).\nSome of the examples that Gelman (2016) talks about are not that important in the scheme of things. But at the same time, we saw, and continue to see, similar approaches being used in areas with big impacts. For instance, many governments have created “nudge” units that implement public policy (Sunstein and Reisch 2017) even though there is evidence that some of the claims lack credibility (Maier et al. 2022; Szaszi et al. 2022). Governments are increasingly using algorithms that they do not make open (Chouldechova et al. 2018). And Herndon, Ash, and Pollin (2014) document how research in economics that was used by governments to justify austerity policies following the 2007–2008 financial crisis turned out to not be reproducible.\nAt a minimum, and with few exceptions, we must release our code, datasets, and environment. Without these, it is difficult to know what a finding speaks to (Miyakawa 2020). More banally, we also do not know if there are mistakes or aspects that were inadvertently overlooked (Merali 2010; Hillel 2017; Silver 2020). Increasingly, following Buckheit and Donoho (1995), we consider a paper to be an advertisement, and for the associated code, data, and environment to be the actual work. Steve Jobs, a co-founder of Apple, talked about how people who are the best at their craft ensure that even the aspects of their work that no one else will ever see are as well finished and high quality as the aspects that are public facing (Isaacson 2011). The same is true in data science, where often one of the distinguishing aspects of high-quality work is that the README and code comments are as polished as, say, the abstract of the associated paper.\nWorkflows exist within a cultural and social context, which imposes an additional ethical reason for the need for them to be reproducible. For instance, Wang and Kosinski (2018) train a neural network to distinguish between the faces of gay and heterosexual men. (Murphy (2017) provides a summary of the paper, the associated issues, and comments from its authors.) To do this, Wang and Kosinski (2018, 248) needed a dataset of photos of people that were “adult, Caucasian, fully visible, and of a gender that matched the one reported on the user’s profile”. They verified this using Amazon Mechanical Turk, an online platform that pays workers a small amount of money to complete specific tasks. The instructions provided to the Mechanical Turk workers for this task specify that Barack Obama, the 44th US President, who had a white mother and a black father, should be classified as “Black”; and that Latino is an ethnicity, rather than a race (Mattson 2017). The classification task may seem objective, but, perhaps unthinkingly, echoes the views of Americans with a certain class and background.\nThis is just one specific concern about one part of the Wang and Kosinski (2018) workflow. Broader concerns are raised by others including Gelman, Mattson, and Simpson (2018). The main issue is that statistical models are specific to the data on which they were trained. And the only reason that we can identify likely issues in the model of Wang and Kosinski (2018) is because, despite not releasing the specific dataset that they used, they were nonetheless open about their procedure. For our work to be credible, it needs to be reproducible by others.\nSome of the steps that we can take to make our work more reproducible include:\nThe workflow that we advocate in this book is:\n\\[\n\\mbox{Plan}\\rightarrow\\mbox{Simulate}\\rightarrow\\mbox{Acquire}\\rightarrow\\mbox{Explore}\\rightarrow\\mbox{Share}\n\\]\nBut it can be alternatively considered as: “Think an awful lot, mostly read and write, sometimes code”.\nThere are various tools that we can use at the different stages that will improve the reproducibility of this workflow. This includes Quarto, R Projects, and Git and GitHub.",
"crumbs": [
"Foundations",
"
3 Reproducible workflows"
@@ -280,7 +280,7 @@
"href": "03-workflow.html#quarto",
"title": "3 Reproducible workflows",
"section": "3.2 Quarto",
- "text": "3.2 Quarto\n\n3.2.1 Getting started\nQuarto integrates code and natural language in a way that is called “literate programming” (Knuth 1984). It is the successor to R Markdown, which was a variant of Markdown specifically designed to allow R code chunks to be included. Quarto uses a mark-up language similar to HyperText Markup Language (HTML) or LaTeX, in comparison to a “What You See Is What You Get” (WYSIWYG) language, such as Microsoft Word. This means that all the aspects are consistent, for instance, all top-level headings will look the same. But it means that we must designate or “mark up” how we would like certain aspects to appear. And it is only when we render the document that we get to see what it looks like. A visual editor option can also be used, and this hides the need for the user to do this mark-up themselves.\nWhile it makes sense to use Quarto going forward, there are many resources written for and in R Markdown. For this reason we provide R Markdown equivalents in Online Appendix D.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nFernando Pérez is an associate professor of statistics at the University of California, Berkeley and a Faculty Scientist, Data Science and Technology Division, at Lawrence Berkeley National Laboratory. He earned a PhD in particle physics from the University of Colorado, Boulder. During his PhD he created iPython, which enables Python to be used interactively, and now underpins Project Jupyter, which inspired similar notebook approaches such as R Markdown and now Quarto. Somers (2018) describes how open-source notebook approaches create virtuous feedback loops that result in dramatically improved scientific computing. And Romer (2018) aligns the features of open-source approaches, such as Jupyter, with the features that enable scientific consensus and progress. In 2017 Pérez was awarded the Association for Computing Machinery (ACM) Software System Award.\n\n\nOne advantage of literate programming is that we get a “live” document in which code executes and then forms part of the document. Another advantage of Quarto is that similar code can compile into a variety of documents, including HTML and PDFs. Quarto also has default options for including a title, author, and date. One disadvantage is that it can take a while for a document to compile because the code needs to run.\nWe need to download Quarto from here. (Skip this step if you are using Posit Cloud because it is already installed.) We can then create a new Quarto document within RStudio: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto Document\\(\\dots\\)”.\nAfter opening a new Quarto document and selecting “Source” view, you will see the default top matter, contained within a pair of three dashes, as well as some examples of text showing a few of the markdown essential commands and R chunks, each of which are discussed further in the following sections.\n\n\n3.2.2 Top matter\nTop matter consists of defining aspects such as the title, author, and date. It is contained within three dashes at the top of a Quarto document. For instance, the following would specify a title, a date that automatically updated to the date the document was rendered, and an author.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nformat: html\n---\nAn abstract is a short summary of the paper, and we could add that to the top matter.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nabstract: \"This is my abstract.\"\nformat: html\n---\nBy default, Quarto will create an HTML document, but we can change the output format to produce a PDF. This uses LaTeX in the background and requires the installation of supporting packages. To do this install tinytex. But as it is used in the background we should not need to load it.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nabstract: \"This is my abstract.\"\nformat: pdf\n---\nWe can include references by specifying a BibTeX file in the top matter and then calling it within the text, as needed.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nformat: pdf\nabstract: \"This is my abstract.\"\nbibliography: bibliography.bib\n---\nWe would need to make a separate file called “bibliography.bib” and save it next to the Quarto file. In the BibTeX file we need an entry for the item that is to be referenced. For instance, the citation for R can be obtained with citation() and this can be added to the “bibliography.bib” file. The citation for a package can be found by including the package name, for instance citation(\"tidyverse\"), and again adding the output to the “.bib” file. It can be helpful to use Google Scholar or doi2bib to get citations for books or articles.\nWe need to create a unique key that we use to refer to this item in the text. This can be anything, provided it is unique, but meaningful ones can be easier to remember, for instance “citeR”.\n@Manual{citeR,\n title = {R: A Language and Environment for Statistical Computing},\n author = {{R Core Team}},\n organization = {R Foundation for Statistical Computing},\n address = {Vienna, Austria},\n year = {2021},\n url = {https://www.R-project.org/},\n }\n@book{tellingstories,\n title = {Telling Stories with Data},\n author = {Rohan Alexander},\n year = {2023},\n publisher = {Chapman and Hall/CRC},\n url = {https://tellingstorieswithdata.com}\n }\nTo cite R in the Quarto document we then include @citeR, which would put brackets around the year: R Core Team (2023), or [@citeR], which would put brackets around the whole thing: (R Core Team 2023).\nThe reference list at the end of the paper is automatically built based on calling the BibTeX file and including references in the paper. At the end of the Quarto document, include a heading “# References” and the actual citations will be included after that. When the Quarto file is rendered, Quarto sees these in the content, goes to the BibTeX file to get the reference details that it needs, builds the reference list, and then adds it at the end of the rendered document.\n\n\n3.2.3 Essential commands\nQuarto uses a variation of Markdown as its underlying syntax. Essential Markdown commands include those for emphasis, headers, lists, links, and images. A reminder of these is included in RStudio: “Help” \\(\\rightarrow\\) “Markdown Quick Reference”. It is your choice as to whether you want to use the visual or source editor. But either way, it is good to understand these essentials because it will not always be possible to use a visual editor (for instance if you are quickly looking at a Quarto document in GitHub). As you get more experience it can be useful to use a text editor such as Sublime Text, or an alternative Integrated Development Environment such as VS Code.\n\nEmphasis: *italic*, **bold**\nHeaders (these go on their own line with a blank line before and after):\n\n # First level header\n \n ## Second level header\n \n ### Third level header\n\nUnordered list, with sub-lists:\n\n * Item 1\n * Item 2\n + Item 2a\n + Item 2b\n\nOrdered list, with sub-lists:\n\n 1. Item 1\n 2. Item 2\n 3. Item 3\n + Item 3a\n + Item 3b\n\nURLs can be added: [this book](https://www.tellingstorieswithdata.com) results in this book.\nA paragraph is created by leaving a blank line.\n\nA paragraph about an idea, nicely spaced from the following paragraph.\n\nA paragraph about another idea, again spaced from the earlier paragraph.\nOnce we have added some aspects, then we may want to see the actual document. To build the document click “Render”.\n\n\n3.2.4 R chunks\nWe can include code for R and many other languages in code chunks within a Quarto document. When we render the document the code will run and be included in the document.\nTo create an R chunk, we start with three backticks and then within curly braces we tell Quarto that this is an R chunk. Anything inside this chunk will be considered R code and run as such. We use data from Kleiber and Zeileis (2008) who provide the R package AER to accompany their book Applied Econometrics with R. We could load the tidyverse and install and load AER and make a graph of the number of times a survey respondent visited the doctor in the past two weeks.\n```{r}\nlibrary(tidyverse)\nlibrary(AER)\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = illness)) +\n geom_histogram(stat = \"count\")\n```\nThe output of that code is Figure 3.1.\n\n\n\n\n\n\n\n\nFigure 3.1: Number of illnesses in the past two weeks, based on the 1977–1978 Australian Health Survey\n\n\n\n\n\nThere are various evaluation options that are available in chunks. We include these, each on a new line, by opening the line with the chunk-specific comment delimiter “#|” and then the option. Helpful options include:\n\necho: This controls whether the code itself is included in the document. For instance, #| echo: false would mean the code will be run and its output will show, but the code itself would not be included in the document.\ninclude: This controls whether the output of the code is included in the document. For instance, #| include: false would run the code, but would not result in any output, and the code itself would not be included in the document.\neval: This controls whether the code should be included in the document. For instance, #| eval: false would mean that the code is not run, and hence there would not be any output to include, but the code itself would be included in the document.\nwarning: This controls whether warnings should be included in the document. For instance, #| warning: false would mean that warnings are not included.\nmessage: This controls whether messages should be included in the document. For instance, #| message: false would mean that messages are not included in the document.\n\nFor instance, we could include the output, but not the code, and suppress any warnings.\n```{r}\n#| echo: false\n#| warning: false\n\nlibrary(tidyverse)\nlibrary(AER)\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = visits)) +\n geom_histogram(stat = \"count\")\n```\nLeave a blank line on either side of an R chunk, otherwise it may not run properly. And use lower case for logical values, i.e. “false” not “FALSE”.\nMost people did not visit a doctor in the past week.\n\n```{r}\n#| echo: false\n#| warning: false\n\nlibrary(tidyverse)\nlibrary(AER)\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = visits)) +\n geom_histogram(stat = \"count\")\n```\n\nThere were some people that visited a doctor once, and then...\nThe Quarto document itself must load any datasets that are needed. It is not enough that they are in the environment. This is because the Quarto document evaluates the code in the document when it is rendered, not necessarily the environment.\nOften when writing code, we may want to make the same change across multiple lines or change all instances of a particular thing. We achieve this with multiple cursors. If we want a cursor across multiple, consecutive lines, then hold “option” on Mac or “Alt” on PC, while you drag your cursor over the relevant lines. If you want to select all instances of something, then highlight one instance, say a variable name, then use Find/Replace (Command + F on Mac or CTRL + F on PC) and select “All”. This will then enable a cursor at all the other instances.\n\n\n3.2.5 Equations\nWe can include equations by using LaTeX, which is based on the programming language TeX. We invoke math mode in LaTeX by using two dollar signs as opening and closing tags. Then whatever is inside is evaluated as LaTeX mark-up. For instance we can produce the compound interest formula with:\n$$\nA = P\\left(1+\\frac{r}{n}\\right)^{nt}\n$$\n\\[\nA = P\\left(1+\\frac{r}{n}\\right)^{nt}\n\\]\nLaTeX is a comprehensive mark-up language but we will mostly just use it to specify the model of interest. We include some examples here that contain the critical aspects we will draw on starting in Chapter 12.\n$$\ny_i|\\mu_i, \\sigma \\sim \\mbox{Normal}(\\mu_i, \\sigma)\n$$\n\\[\ny_i|\\mu_i, \\sigma \\sim \\mbox{Normal}(\\mu_i, \\sigma)\n\\]\nUnderscores are used to get subscripts: y_i for \\(y_i\\). And we can get a subscript of more than one item by surrounding it with curly braces: y_{i,c} for \\(y_{i,c}\\). In this case we wanted math mode within the line, and so we surround these with only one dollar sign as opening and closing tags.\nGreek letters are typically preceded by a backslash. Common Greek letters include: \\alpha for \\(\\alpha\\), \\beta for \\(\\beta\\), \\delta for \\(\\delta\\), \\epsilon for \\(\\epsilon\\), \\gamma for \\(\\gamma\\), \\lambda for \\(\\lambda\\), \\mu for \\(\\mu\\), \\phi for \\(\\phi\\), \\pi for \\(\\pi\\), \\Pi for \\(\\Pi\\), \\rho for \\(\\rho\\), \\sigma for \\(\\sigma\\), \\Sigma for \\(\\Sigma\\), \\tau for \\(\\tau\\), and \\theta for \\(\\theta\\).\nLaTeX math mode assumes letters are variables and so makes them italic, but sometimes we want a word to appear in normal font because it is not a variable, such as “Normal”. In that case we surround it with \\mbox{}, for instance \\mbox{Normal} for \\(\\mbox{Normal}\\).\nWe line up equations across multiple lines using \\begin{aligned} and \\end{aligned}. Then the item that is to be lined up is noted by an ampersand. The following is a model that we will estimate in Chapter 16.\n$$\n\\begin{aligned}\ny_i|\\pi_i & \\sim \\mbox{Bern}(\\pi_i) \\\\\n\\mbox{logit}(\\pi_i) & = \\beta_0+ \\alpha_{g[i]}^{\\mbox{gender}} + \\alpha_{a[i]}^{\\mbox{age}} + \\alpha_{s[i]}^{\\mbox{state}} + \\alpha_{e[i]}^{\\mbox{edu}} \\\\\n\\beta_0 & \\sim \\mbox{Normal}(0, 2.5)\\\\\n\\alpha_{g}^{\\mbox{gender}} & \\sim \\mbox{Normal}(0, 2.5)\\mbox{ for }g=1, 2\\\\\n\\alpha_{a}^{\\mbox{age}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{age}}\\right)\\mbox{ for }a = 1, 2, \\dots, A\\\\\n\\alpha_{s}^{\\mbox{state}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{state}}\\right)\\mbox{ for }s = 1, 2, \\dots, S\\\\\n\\alpha_{e}^{\\mbox{edu}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{edu}}\\right)\\mbox{ for }e = 1, 2, \\dots, E\\\\\n\\sigma_{\\mbox{gender}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{state}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{edu}} & \\sim \\mbox{Exponential}(1)\n\\end{aligned}\n$$\n\\[\n\\begin{aligned}\ny_i|\\pi_i & \\sim \\mbox{Bern}(\\pi_i) \\\\\n\\mbox{logit}(\\pi_i) & = \\beta_0+ \\alpha_{g[i]}^{\\mbox{gender}} + \\alpha_{a[i]}^{\\mbox{age}} + \\alpha_{s[i]}^{\\mbox{state}} + \\alpha_{e[i]}^{\\mbox{edu}} \\\\\n\\beta_0 & \\sim \\mbox{Normal}(0, 2.5)\\\\\n\\alpha_{g}^{\\mbox{gender}} & \\sim \\mbox{Normal}(0, 2.5)\\mbox{ for }g=1, 2\\\\\n\\alpha_{a}^{\\mbox{age}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{age}}\\right)\\mbox{ for }a = 1, 2, \\dots, A\\\\\n\\alpha_{s}^{\\mbox{state}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{state}}\\right)\\mbox{ for }s = 1, 2, \\dots, S\\\\\n\\alpha_{e}^{\\mbox{edu}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{edu}}\\right)\\mbox{ for }e = 1, 2, \\dots, E\\\\\n\\sigma_{\\mbox{gender}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{state}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{edu}} & \\sim \\mbox{Exponential}(1)\n\\end{aligned}\n\\]\nFinally, certain functions are built into LaTeX. For instance, we can appropriately typeset “log” with \\log.\n\n\n3.2.6 Cross-references\nIt can be useful to cross-reference figures, tables, and equations. This makes it easier to refer to them in the text. To do this for a figure we refer to the name of the R chunk that creates or contains the figure. For instance, consider the following code.\n\n```{r}\n#| label: fig-uniquename\n#| fig-cap: Number of illnesses in the past two weeks, based on the 1977--1978 Australian Health Survey\n#| warning: false\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = illness)) +\n geom_histogram(stat = \"count\")\n```\n\n\n\n\n\n\n\nFigure 3.2: Number of illnesses in the past two weeks, based on the 1977–1978 Australian Health Survey\n\n\n\n\n\nThen (@fig-uniquename) would produce: (Figure 3.2) as the name of the R chunk is fig-uniquename. We need to add “fig” to the start of the chunk name so that Quarto knows that this is a figure. We then include a “fig-cap:” in the R chunk that specifies a caption.\nWe can add #| layout-ncol: 2 in an R chunk within a Quarto document to have two graphs appear side by side (Figure 3.3). Here Figure 3.3 (a) uses the minimal theme, and Figure 3.3 (b) uses the classic theme. These both cross-reference the same label #| label: fig-doctorgraphsidebyside in the R chunk, with an additional option added in the R chunk of #| fig-subcap: [\"Number of illnesses\",\"Number of visits to the doctor\"] which provides the sub-captions. The addition of a letter in-text is accomplished by adding “-1” and “-2” to the end of the label when it is used in-text: (@fig-doctorgraphsidebyside), @fig-doctorgraphsidebyside-1, and @fig-doctorgraphsidebyside-2 for (Figure 3.3), Figure 3.3 (a), and Figure 3.3 (b), respectively.\n```{r}\n#| eval: true\n#| warning: false\n#| label: fig-doctorgraphsidebyside\n#| fig-cap: \"Two variants of graphs\"\n#| fig-subcap: [\"Illnesses\",\"Visits to the doctor\"]\n#| layout-ncol: 2\n\nDoctorVisits |>\n ggplot(aes(x = illness)) +\n geom_histogram(stat = \"count\") +\n theme_minimal()\n\nDoctorVisits |>\n ggplot(aes(x = visits)) +\n geom_histogram(stat = \"count\") +\n theme_classic()\n```\n\n\n\n\n\n\n\n\n\n\n\n(a) Illnesses\n\n\n\n\n\n\n\n\n\n\n\n(b) Visits to the doctor\n\n\n\n\n\n\n\nFigure 3.3: Two variants of graphs\n\n\n\nWe can take a similar approach to cross-reference tables. For instance, (@tbl-docvisittable) will produce: (Table 3.1). In this case we specify “tbl” at the start of the label so that Quarto knows that it is a table. And we specify a caption for the table with “tbl-cap:”.\n\n```{r}\n#| label: tbl-docvisittable\n#| tbl-cap: \"Distribution of the number of doctor visits\"\n\nDoctorVisits |>\n count(visits) |>\n kable()\n```\n\n\n\nTable 3.1: Distribution of the number of doctor visits\n\n\n\n\n\n\nvisits\nn\n\n\n\n\n0\n4141\n\n\n1\n782\n\n\n2\n174\n\n\n3\n30\n\n\n4\n24\n\n\n5\n9\n\n\n6\n12\n\n\n7\n12\n\n\n8\n5\n\n\n9\n1\n\n\n\n\n\n\n\n\nFinally, we can also cross-reference equations. To that we need to add a tag such as {#eq-macroidentity} which we then reference.\n$$\nY = C + I + G + (X - M)\n$$ {#eq-gdpidentity}\nFor instance, we then use @eq-gdpidentity to produce Equation 3.1\n\\[\nY = C + I + G + (X - M)\n\\tag{3.1}\\]\nLabels should be relatively simple when using cross-references. In general, try to keep the names simple but unique, avoid punctuation, and stick to letters and hyphens. Try not to use underscores, because they can cause an error.",
+ "text": "3.2 Quarto\n\n3.2.1 Getting started\nQuarto integrates code and natural language in a way that is called “literate programming” (Knuth 1984). It is the successor to R Markdown, which was a variant of Markdown specifically designed to allow R code chunks to be included. Quarto uses a mark-up language similar to HyperText Markup Language (HTML) or LaTeX, in comparison to a “What You See Is What You Get” (WYSIWYG) language, such as Microsoft Word. This means that all the aspects are consistent, for instance, all top-level headings will look the same. But it means that we must designate or “mark up” how we would like certain aspects to appear. And it is only when we render the document that we get to see what it looks like. A visual editor option can also be used, and this hides the need for the user to do this mark-up themselves.\nWhile it makes sense to use Quarto going forward, there are many resources written for and in R Markdown. For this reason we provide R Markdown equivalents in Online Appendix D.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nFernando Pérez is an associate professor of statistics at the University of California, Berkeley and a Faculty Scientist, Data Science and Technology Division, at Lawrence Berkeley National Laboratory. He earned a PhD in particle physics from the University of Colorado, Boulder. During his PhD he created iPython, which enables Python to be used interactively, and now underpins Project Jupyter, which inspired similar notebook approaches such as R Markdown and now Quarto. Somers (2018) describes how open-source notebook approaches create virtuous feedback loops that result in dramatically improved scientific computing. And Romer (2018) aligns the features of open-source approaches, such as Jupyter, with the features that enable scientific consensus and progress. In 2017 Pérez was awarded the Association for Computing Machinery (ACM) Software System Award.\n\n\nOne advantage of literate programming is that we get a “live” document in which code executes and then forms part of the document. Another advantage of Quarto is that similar code can compile into a variety of documents, including HTML and PDFs. Quarto also has default options for including a title, author, and date. One disadvantage is that it can take a while for a document to compile because the code needs to run.\nWe need to download Quarto from here. (Skip this step if you are using Posit Cloud because it is already installed.) We can then create a new Quarto document within RStudio: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto Document\\(\\dots\\)”.\nAfter opening a new Quarto document and selecting “Source” view, you will see the default top matter, contained within a pair of three dashes, as well as some examples of text showing a few of the markdown essential commands and R chunks, each of which are discussed further in the following sections.\n\n\n3.2.2 Top matter\nTop matter consists of defining aspects such as the title, author, and date. It is contained within three dashes at the top of a Quarto document. For instance, the following would specify a title, a date that automatically updated to the date the document was rendered, and an author.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nformat: html\n---\nAn abstract is a short summary of the paper, and we could add that to the top matter.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nabstract: \"This is my abstract.\"\nformat: html\n---\nBy default, Quarto will create an HTML document, but we can change the output format to produce a PDF. This uses LaTeX in the background and requires the installation of supporting packages. To do this install tinytex. But as it is used in the background we should not need to load it.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nabstract: \"This is my abstract.\"\nformat: pdf\n---\nWe can include references by specifying a BibTeX file in the top matter and then calling it within the text, as needed.\n---\ntitle: \"My document\"\nauthor: \"Rohan Alexander\"\ndate: format(Sys.time(), \"%d %B %Y\")\nformat: pdf\nabstract: \"This is my abstract.\"\nbibliography: bibliography.bib\n---\nWe would need to make a separate file called “bibliography.bib” and save it next to the Quarto file. In the BibTeX file we need an entry for the item that is to be referenced. For instance, the citation for R can be obtained with citation() and this can be added to the “bibliography.bib” file. The citation for a package can be found by including the package name, for instance citation(\"tidyverse\"), and again adding the output to the “.bib” file. It can be helpful to use Google Scholar or doi2bib to get citations for books or articles.\nWe need to create a unique key that we use to refer to this item in the text. This can be anything, provided it is unique, but meaningful ones can be easier to remember, for instance “citeR”.\n@Manual{citeR,\n title = {R: A Language and Environment for Statistical Computing},\n author = {{R Core Team}},\n organization = {R Foundation for Statistical Computing},\n address = {Vienna, Austria},\n year = {2021},\n url = {https://www.R-project.org/},\n }\n@book{tellingstories,\n title = {Telling Stories with Data},\n author = {Rohan Alexander},\n year = {2023},\n publisher = {Chapman and Hall/CRC},\n url = {https://tellingstorieswithdata.com}\n }\nTo cite R in the Quarto document we then include @citeR, which would put brackets around the year: R Core Team (2023), or [@citeR], which would put brackets around the whole thing: (R Core Team 2023).\nThe reference list at the end of the paper is automatically built based on calling the BibTeX file and including references in the paper. At the end of the Quarto document, include a heading “# References” and the actual citations will be included after that. When the Quarto file is rendered, Quarto sees these in the content, goes to the BibTeX file to get the reference details that it needs, builds the reference list, and then adds it at the end of the rendered document.\nBibTeX will try to adjust the capitalization of entries. This can be helpful, but sometimes it is better to insist on a specific capitalization. To force BibTeX to use a particular capitalization use double braces instead of single braces around the entry. For instance, in the above examples {R Core Team} will be printed with that exact capitalization, whereas {Telling Stories with Data} is subject to the whims of BibTeX. Insisting on a particular capitalization is important when citing R packages, which can have a specific capitalization, and when citing an organization as an author. For instance, when citing usethis, you need to use title = {{usethis: Automate Package and Project Setup}},, not title = {usethis: Automate Package and Project Setup},. And if, say, data were provided by the City of Toronto, then when specifying the author to cite that dataset you would want to use author = {{City of Toronto}}, not author = {City of Toronto},. The latter would result in the incorrect reference list entry “Toronto, City of” while the former would result in the correct reference list entry of “City of Toronto”.\n\n\n3.2.3 Essential commands\nQuarto uses a variation of Markdown as its underlying syntax. Essential Markdown commands include those for emphasis, headers, lists, links, and images. A reminder of these is included in RStudio: “Help” \\(\\rightarrow\\) “Markdown Quick Reference”. It is your choice as to whether you want to use the visual or source editor. But either way, it is good to understand these essentials because it will not always be possible to use a visual editor (for instance if you are quickly looking at a Quarto document in GitHub). As you get more experience it can be useful to use a text editor such as Sublime Text, or an alternative Integrated Development Environment such as VS Code.\n\nEmphasis: *italic*, **bold**\nHeaders (these go on their own line with a blank line before and after):\n\n # First level header\n \n ## Second level header\n \n ### Third level header\n\nUnordered list, with sub-lists:\n\n * Item 1\n * Item 2\n + Item 2a\n + Item 2b\n\nOrdered list, with sub-lists:\n\n 1. Item 1\n 2. Item 2\n 3. Item 3\n + Item 3a\n + Item 3b\n\nURLs can be added: [this book](https://www.tellingstorieswithdata.com) results in this book.\nA paragraph is created by leaving a blank line.\n\nA paragraph about an idea, nicely spaced from the following paragraph.\n\nA paragraph about another idea, again spaced from the earlier paragraph.\nOnce we have added some aspects, then we may want to see the actual document. To build the document click “Render”.\n\n\n3.2.4 R chunks\nWe can include code for R and many other languages in code chunks within a Quarto document. When we render the document the code will run and be included in the document.\nTo create an R chunk, we start with three backticks and then within curly braces we tell Quarto that this is an R chunk. Anything inside this chunk will be considered R code and run as such. We use data from Kleiber and Zeileis (2008) who provide the R package AER to accompany their book Applied Econometrics with R. We could load the tidyverse and install and load AER and make a graph of the number of times a survey respondent visited the doctor in the past two weeks.\n```{r}\nlibrary(tidyverse)\nlibrary(AER)\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = illness)) +\n geom_histogram(stat = \"count\")\n```\nThe output of that code is Figure 3.1.\n\n\n\n\n\n\n\n\nFigure 3.1: Number of illnesses in the past two weeks, based on the 1977–1978 Australian Health Survey\n\n\n\n\n\nThere are various evaluation options that are available in chunks. We include these, each on a new line, by opening the line with the chunk-specific comment delimiter “#|” and then the option. Helpful options include:\n\necho: This controls whether the code itself is included in the document. For instance, #| echo: false would mean the code will be run and its output will show, but the code itself would not be included in the document.\ninclude: This controls whether the output of the code is included in the document. For instance, #| include: false would run the code, but would not result in any output, and the code itself would not be included in the document.\neval: This controls whether the code should be included in the document. For instance, #| eval: false would mean that the code is not run, and hence there would not be any output to include, but the code itself would be included in the document.\nwarning: This controls whether warnings should be included in the document. For instance, #| warning: false would mean that warnings are not included.\nmessage: This controls whether messages should be included in the document. For instance, #| message: false would mean that messages are not included in the document.\n\nFor instance, we could include the output, but not the code, and suppress any warnings.\n```{r}\n#| echo: false\n#| warning: false\n\nlibrary(tidyverse)\nlibrary(AER)\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = visits)) +\n geom_histogram(stat = \"count\")\n```\nLeave a blank line on either side of an R chunk, otherwise it may not run properly. And use lower case for logical values, i.e. “false” not “FALSE”.\nMost people did not visit a doctor in the past week.\n\n```{r}\n#| echo: false\n#| warning: false\n\nlibrary(tidyverse)\nlibrary(AER)\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = visits)) +\n geom_histogram(stat = \"count\")\n```\n\nThere were some people that visited a doctor once, and then...\nThe Quarto document itself must load any datasets that are needed. It is not enough that they are in the environment. This is because the Quarto document evaluates the code in the document when it is rendered, not necessarily the environment.\nOften when writing code, we may want to make the same change across multiple lines or change all instances of a particular thing. We achieve this with multiple cursors. If we want a cursor across multiple, consecutive lines, then hold “option” on Mac or “Alt” on PC, while you drag your cursor over the relevant lines. If you want to select all instances of something, then highlight one instance, say a variable name, then use Find/Replace (Command + F on Mac or CTRL + F on PC) and select “All”. This will then enable a cursor at all the other instances.\n\n\n3.2.5 Equations\nWe can include equations by using LaTeX, which is based on the programming language TeX. We invoke math mode in LaTeX by using two dollar signs as opening and closing tags. Then whatever is inside is evaluated as LaTeX mark-up. For instance we can produce the compound interest formula with:\n$$\nA = P\\left(1+\\frac{r}{n}\\right)^{nt}\n$$\n\\[\nA = P\\left(1+\\frac{r}{n}\\right)^{nt}\n\\]\nLaTeX is a comprehensive mark-up language but we will mostly just use it to specify the model of interest. We include some examples here that contain the critical aspects we will draw on starting in Chapter 12.\n$$\ny_i|\\mu_i, \\sigma \\sim \\mbox{Normal}(\\mu_i, \\sigma)\n$$\n\\[\ny_i|\\mu_i, \\sigma \\sim \\mbox{Normal}(\\mu_i, \\sigma)\n\\]\nUnderscores are used to get subscripts: y_i for \\(y_i\\). And we can get a subscript of more than one item by surrounding it with curly braces: y_{i,c} for \\(y_{i,c}\\). In this case we wanted math mode within the line, and so we surround these with only one dollar sign as opening and closing tags.\nGreek letters are typically preceded by a backslash. Common Greek letters include: \\alpha for \\(\\alpha\\), \\beta for \\(\\beta\\), \\delta for \\(\\delta\\), \\epsilon for \\(\\epsilon\\), \\gamma for \\(\\gamma\\), \\lambda for \\(\\lambda\\), \\mu for \\(\\mu\\), \\phi for \\(\\phi\\), \\pi for \\(\\pi\\), \\Pi for \\(\\Pi\\), \\rho for \\(\\rho\\), \\sigma for \\(\\sigma\\), \\Sigma for \\(\\Sigma\\), \\tau for \\(\\tau\\), and \\theta for \\(\\theta\\).\nLaTeX math mode assumes letters are variables and so makes them italic, but sometimes we want a word to appear in normal font because it is not a variable, such as “Normal”. In that case we surround it with \\mbox{}, for instance \\mbox{Normal} for \\(\\mbox{Normal}\\).\nWe line up equations across multiple lines using \\begin{aligned} and \\end{aligned}. Then the item that is to be lined up is noted by an ampersand. The following is a model that we will estimate in Chapter 16.\n$$\n\\begin{aligned}\ny_i|\\pi_i & \\sim \\mbox{Bern}(\\pi_i) \\\\\n\\mbox{logit}(\\pi_i) & = \\beta_0+ \\alpha_{g[i]}^{\\mbox{gender}} + \\alpha_{a[i]}^{\\mbox{age}} + \\alpha_{s[i]}^{\\mbox{state}} + \\alpha_{e[i]}^{\\mbox{edu}} \\\\\n\\beta_0 & \\sim \\mbox{Normal}(0, 2.5)\\\\\n\\alpha_{g}^{\\mbox{gender}} & \\sim \\mbox{Normal}(0, 2.5)\\mbox{ for }g=1, 2\\\\\n\\alpha_{a}^{\\mbox{age}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{age}}\\right)\\mbox{ for }a = 1, 2, \\dots, A\\\\\n\\alpha_{s}^{\\mbox{state}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{state}}\\right)\\mbox{ for }s = 1, 2, \\dots, S\\\\\n\\alpha_{e}^{\\mbox{edu}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{edu}}\\right)\\mbox{ for }e = 1, 2, \\dots, E\\\\\n\\sigma_{\\mbox{gender}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{state}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{edu}} & \\sim \\mbox{Exponential}(1)\n\\end{aligned}\n$$\n\\[\n\\begin{aligned}\ny_i|\\pi_i & \\sim \\mbox{Bern}(\\pi_i) \\\\\n\\mbox{logit}(\\pi_i) & = \\beta_0+ \\alpha_{g[i]}^{\\mbox{gender}} + \\alpha_{a[i]}^{\\mbox{age}} + \\alpha_{s[i]}^{\\mbox{state}} + \\alpha_{e[i]}^{\\mbox{edu}} \\\\\n\\beta_0 & \\sim \\mbox{Normal}(0, 2.5)\\\\\n\\alpha_{g}^{\\mbox{gender}} & \\sim \\mbox{Normal}(0, 2.5)\\mbox{ for }g=1, 2\\\\\n\\alpha_{a}^{\\mbox{age}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{age}}\\right)\\mbox{ for }a = 1, 2, \\dots, A\\\\\n\\alpha_{s}^{\\mbox{state}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{state}}\\right)\\mbox{ for }s = 1, 2, \\dots, S\\\\\n\\alpha_{e}^{\\mbox{edu}} & \\sim \\mbox{Normal}\\left(0, \\sigma^2_{\\mbox{edu}}\\right)\\mbox{ for }e = 1, 2, \\dots, E\\\\\n\\sigma_{\\mbox{gender}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{state}} & \\sim \\mbox{Exponential}(1)\\\\\n\\sigma_{\\mbox{edu}} & \\sim \\mbox{Exponential}(1)\n\\end{aligned}\n\\]\nFinally, certain functions are built into LaTeX. For instance, we can appropriately typeset “log” with \\log.\n\n\n3.2.6 Cross-references\nIt can be useful to cross-reference figures, tables, and equations. This makes it easier to refer to them in the text. To do this for a figure we refer to the name of the R chunk that creates or contains the figure. For instance, consider the following code.\n\n```{r}\n#| label: fig-uniquename\n#| fig-cap: Number of illnesses in the past two weeks, based on the 1977--1978 Australian Health Survey\n#| warning: false\n\ndata(\"DoctorVisits\", package = \"AER\")\n\nDoctorVisits |>\n ggplot(aes(x = illness)) +\n geom_histogram(stat = \"count\")\n```\n\n\n\n\n\n\n\nFigure 3.2: Number of illnesses in the past two weeks, based on the 1977–1978 Australian Health Survey\n\n\n\n\n\nThen (@fig-uniquename) would produce: (Figure 3.2) as the name of the R chunk is fig-uniquename. We need to add “fig” to the start of the chunk name so that Quarto knows that this is a figure. We then include a “fig-cap:” in the R chunk that specifies a caption.\nWe can add #| layout-ncol: 2 in an R chunk within a Quarto document to have two graphs appear side by side (Figure 3.3). Here Figure 3.3 (a) uses the minimal theme, and Figure 3.3 (b) uses the classic theme. These both cross-reference the same label #| label: fig-doctorgraphsidebyside in the R chunk, with an additional option added in the R chunk of #| fig-subcap: [\"Number of illnesses\",\"Number of visits to the doctor\"] which provides the sub-captions. The addition of a letter in-text is accomplished by adding “-1” and “-2” to the end of the label when it is used in-text: (@fig-doctorgraphsidebyside), @fig-doctorgraphsidebyside-1, and @fig-doctorgraphsidebyside-2 for (Figure 3.3), Figure 3.3 (a), and Figure 3.3 (b), respectively.\n```{r}\n#| eval: true\n#| warning: false\n#| label: fig-doctorgraphsidebyside\n#| fig-cap: \"Two variants of graphs\"\n#| fig-subcap: [\"Illnesses\",\"Visits to the doctor\"]\n#| layout-ncol: 2\n\nDoctorVisits |>\n ggplot(aes(x = illness)) +\n geom_histogram(stat = \"count\") +\n theme_minimal()\n\nDoctorVisits |>\n ggplot(aes(x = visits)) +\n geom_histogram(stat = \"count\") +\n theme_classic()\n```\n\n\n\n\n\n\n\n\n\n\n\n(a) Illnesses\n\n\n\n\n\n\n\n\n\n\n\n(b) Visits to the doctor\n\n\n\n\n\n\n\nFigure 3.3: Two variants of graphs\n\n\n\nWe can take a similar approach to cross-reference tables. For instance, (@tbl-docvisittable) will produce: (Table 3.1). In this case we specify “tbl” at the start of the label so that Quarto knows that it is a table. And we specify a caption for the table with “tbl-cap:”.\n\n```{r}\n#| label: tbl-docvisittable\n#| tbl-cap: \"Distribution of the number of doctor visits\"\n\nDoctorVisits |>\n count(visits) |>\n kable()\n```\n\n\n\nTable 3.1: Distribution of the number of doctor visits\n\n\n\n\n\n\nvisits\nn\n\n\n\n\n0\n4141\n\n\n1\n782\n\n\n2\n174\n\n\n3\n30\n\n\n4\n24\n\n\n5\n9\n\n\n6\n12\n\n\n7\n12\n\n\n8\n5\n\n\n9\n1\n\n\n\n\n\n\n\n\nFinally, we can also cross-reference equations. To that we need to add a tag such as {#eq-macroidentity} which we then reference.\n$$\nY = C + I + G + (X - M)\n$$ {#eq-gdpidentity}\nFor instance, we then use @eq-gdpidentity to produce Equation 3.1\n\\[\nY = C + I + G + (X - M)\n\\tag{3.1}\\]\nLabels should be relatively simple when using cross-references. In general, try to keep the names simple but unique, avoid punctuation, and stick to letters and hyphens. Try not to use underscores, because they can cause an error.",
"crumbs": [
"Foundations",
"
3 Reproducible workflows"
@@ -346,7 +346,7 @@
"href": "03-workflow.html#exercises",
"title": "3 Reproducible workflows",
"section": "3.8 Exercises",
- "text": "3.8 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: In a certain country there are only ever four parties that could win a seat in parliament. Whichever candidate has a plurality of votes in the area associated with a given seat wins that seat. The parliament is made up of 175 total seats. An analyst is interested in the number of votes for each party by seat. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation. Carefully specify an appropriate situation using the code below. Then write five tests based on the simulated data.\n\n\nlibrary(tidyverse)\n\nelection_results <-\n tibble(\n seat = rep(1:175, each = 4),\n party = rep(x = 1:4, times = 175),\n votes = runif(n = 175 * 4, min = 0, max = 1000) |> floor()\n )\n\n\n(Acquire) Please specify a source of actual data about voting in a country of interest to you.\n(Explore) Start with the following code and create a table of the number of seats won by each party.\n\n\nlibrary(tidyverse)\n\nelection_results |> \n slice_max(votes, n = 1, by = seat) |> \n count(party) |>\n kable()\n\n\n(Share) Please write two paragraphs as if you had gathered data from the source you identified (rather than simulated), and if the table that you built using simulated data reflected the actual situation. The exact details contained in the paragraphs do not have to be factual but they should be reasonable (i.e. you do not actually have to get the data nor create the graphs). Separate the code appropriately into R files and a Quarto doc. Submit a link to a GitHub repo with a README.\n\n\n\nQuiz\n\nFrom Gelman (2016), which statistical concept refers to researchers exploiting flexibility in data analysis to find significant results (pick one)?\n\nRandom sampling.\nP-hacking.\nNull hypothesis testing.\nBayesian inference.\n\nFrom Gelman (2016), what is “p-hacking” (pick one)?\n\nA method for correcting p-values.\nManipulating data or analyses until nonsignificant results become significant.\nA technique for improving computational efficiency.\nAn ethical approach to data sharing.\n\nFrom Gelman (2016), what is the file drawer problem (pick one)?\n\nBias introduced by only publishing significant findings.\nDifficulty in accessing archived data.\nErrors in data coding and entry.\nChallenges in replicating old experiments.\n\nFrom Gelman (2016), which term refers to the tendency to publish only positive findings (pick one)?\n\nData mining.\nPublication bias.\nConfirmation bias.\nSampling error.\n\nFrom Gelman (2016), which term describes the multitude of choices researchers have in data analysis that can lead to significant results (pick one)?\n\nResearcher degrees of freedom.\nData mining.\nSample bias.\nEffect size manipulation.\n\nFrom Gelman (2016), the garden of forking paths refers to what issue (pick one)?\n\nThe complexity of decision trees in machine learning.\nThe multiple potential analyses that can be conducted with the same data.\nThe branching of theory and applied work over time.\nThe divergence of academic disciplines.\n\nFrom Gelman (2016), what is a “replication” in research (pick one)?\n\nA study that reproduces the original findings using new data.\nA study that critiques previous methodologies.\nA meta-analysis of several studies.\nAn exact copy of the original study’s manuscript.\n\nFrom Gelman (2016), what contributes to non-reproducible results in social sciences (pick one)?\n\nInadequate sample sizes.\nLack of advanced statistical software.\nResearcher degrees of freedom leading to selective reporting.\nOver-reliance on qualitative data.\n\nFrom Gelman (2016), what does the replication crisis refer to (pick one)?\n\nDifficulty in creating new theory.\nOverproduction of similar research.\nChallenges in replicating the findings of previous studies.\nShortage of participants for experiments.\n\nFrom Gelman (2016), what helps mitigate the replication crisis (pick one)?\n\nKeeping data confidential.\nPublishing only significant results.\nPreregistering studies and analysis plans.\nIncreasing the use of proprietary software.\n\nGelman (2016) focuses on the replication crisis in psychology. Pick another discipline to focus on, based on your own experience, perhaps in other classes, and write about the extent to which you think there may be replication issues in that discipline and why.\nPick a discipline you’re familiar with. What practices could improve reproducibility in that field? Provide a brief explanation.\nFrom Wilson et al. (2017), which of the following are important data management practices (select all that apply)?\n\nSaving both the raw data and cleaned versions.\nDocumenting the data processing steps.\nUsing non-proprietary file formats for data storage.\n\nFrom Wilson et al. (2017), why is it important to create a README file in the project’s home directory (pick one)?\n\nTo store raw data files.\nTo explain the purpose of the project and provide an overview.\nTo list all the errors and bugs in the project.\nTo keep track of all the versions of the project files.\n\nFrom Wilson et al. (2017), what is a primary benefit of using version control (pick one)?\n\nIt automatically writes code for the researcher.\nIt tracks changes and helps collaboration.\nIt replaces the need for backing up data.\nIt ensures that all data is encrypted.\n\nFrom Wilson et al. (2017), what is a recommended practice for naming files in a project (pick one)?\n\nReflect their content or function in the file names.\nUsing sequential numbers like result1.csv, result2.csv.\nIncluding special characters to make file names unique.\nUsing spaces and punctuation in file names.\n\nFrom Wilson et al. (2017), why should an unmodified copy of the raw data be saved (pick one)?\n\nTo preserve data storage space.\nTo comply with legal regulations.\nTo ensure an unaltered source for verification and reproducibility.\nTo maintain compatibility with software updates.\n\nFrom Wilson et al. (2017), what is a key advantage of using open file formats (pick one)?\n\nThey are faster to process.\nThey are accessible without proprietary software.\nThey compress data more efficiently.\nThey enhance data security.\n\nFrom Wilson et al. (2017), which of the following is a recommended practice when organizing data files (select all that apply)?\n\nUse meaningful and consistent file names.\nStore all files in a single folder.\nOrganize files into a clear directory structure.\nInclude dates in file names for version tracking.\n\nFrom Wilson et al. (2017), why is documenting data processing steps crucial (pick one)?\n\nIt speeds up data analysis.\nIt helps in data encryption.\nIt reduces storage requirements.\nIt allows others to understand and reproduce the analysis.\n\nWhat is a benefit of reproducibility?\n\nIt allows results to be independently verified.\nIt speeds up code execution.\nIt makes data visualization easier.\nIt reduces the need for documentation.\n\nAccording to Alexander (2019) research is reproducible if (pick one)?\n\nIt is published in peer-reviewed journals.\nAll the materials used in the study are provided.\nIt can be reproduced exactly without the authors providing materials.\nIt can be reproduced exactly, given all the materials used in the study.\n\nWhat is literate programming (pick one)?\n\nIt separates code and documentation into different files.\nIt automatically fixes syntax errors in code.\nIt automates the generation of code documentation.\nIt integrates code and natural language in the same document.\n\nWhat is the main function of git in a reproducible workflow (pick one)?\n\nTo automate data cleaning.\nTo run code in parallel.\nTo integrate data visualization into reports.\nTo provide a version control system for code.\n\nAccording to Wickham (2021), how would the files “00_get_data.R” and “get data.R” be classified (pick one)?\n\nbad; bad.\ngood; bad.\nbad; good.\ngood; good.\n\nWhat is a benefit of using Quarto for reproducible research (pick one)?\n\nIt automates statistical analysis.\nIt integrates code and text.\nIt replaces the need for version control.\nIt enhances data visualization capabilities.\n\nIn Quarto, how do you denote a top-level heading (pick one)?\n\n3.8.1 Heading\nHeading\n4 Heading\n\nHeading\n\n\nWhich of the following would result in bold text in Quarto (pick one)?\n\n**bold**\n##bold##\n*bold*\n#bold#\n\nWhat does the “echo” option do in a Quarto R code chunk (pick one)?\n\nTo suppress code output.\nTo control whether the code is displayed in the document.\nTo evaluate the code conditionally.\nTo include warnings in the output.\n\nWhich option would hide the warnings in a Quarto R chunk (pick one)?\n\necho: false\neval: false\nwarning: false\nmessage: false\n\nWhich option would run the R code chunk and display the results, but not show the code in a Quarto R chunk (pick one)?\n\necho: false\ninclude: false\neval: false\nwarning: false\nmessage: false\n\nWhy are R Projects important (select all that apply)?\n\nThey help with reproducibility.\nThey make it easier to share code.\nThey make your workspace more organized.\n\nWhy is it important that your R Project name reflects the content of your repo (select all that apply)?\n\nConsistency.\nProfessionalism.\nAttention to detail.\n\nAssume the packages and datasets have been loaded; what is the mistake in this code: DoctorVisits |> filter(visits) (pick one)?\n\nDoctorVisits\n|>\nfilter\nvisits\n\nWhat is a reprex and why is it important to be able to make one (select all that apply)?\n\nA reproducible example that enables your error to be reproduced.\nA reproducible example that helps others help you.\nA reproducible example during the construction of which you may solve your own problem.\nA reproducible example that demonstrates you have actually tried to help yourself.\n\nAccording to Gelfand (2021), what is the key part of “If you need help getting unstuck, the first step is to create a reprex, or reproducible example. The goal of a reprex is to package your problematic code in such a way that other people can run it and feel your pain. Then, hopefully, they can provide a solution and put you out of your misery.” (pick one)?\n\npackage your problematic code\nother people can run it and feel your pain\nthe first step is to create a reprex\nthey can provide a solution and put you out of your misery\n\nFrom Gelfand (2021), why is creating a reproducible example important when seeking help (pick one)?\n\nIt reduces the need for documentation.\nIt showcases your coding skills.\nIt allows others to replicate the issue and provide solutions.\nIt complies with software licensing.\n\nWhich practice enhances the efficiency of code when collaborating with others (pick one)?\n\nUsing absolute file paths.\nWriting clear comments and documentation.\nMinimizing the use of functions.\nObfuscating code to protect intellectual property.\n\nWhich of the following is an advantage of using Git for version control (select all that apply)?\n\nTracking changes over time.\nFacilitating collaboration among multiple users.\nAutomating data backups.\nEnhancing code execution speed.\n\nWhy do we avoid using setwd() in R scripts (pick one)?\n\nIt can slow down code execution.\nIt requires administrative privileges.\nIt makes code less portable and reproducible.\nIt is deprecated in recent R versions.\n\nIn the context of reproducibility, what is the function of the renv package (pick one)?\n\nTo run code chunks in parallel.\nTo document and share the software environment.\nTo automate code linting.\nTo improve code efficiency in simulations.\n\nWhich of the following does NOT contribute to a reproducible workflow (pick one)?\n\nUsing setwd() to set the working directory.\nSharing code and data as well as results.\nUsing Quarto for integrating R and Python code in papers.\nUsing version control with Git and GitHub.\n\nFrom Wickham (2021), which of the following variable names follows the recommended style (pick one)?\n\ntotal-Sales\nTotalSales\ntotal_sales\ntotal sales\n\nWhat is the primary function of the lintr package in R (pick one)?\n\nTo execute code in parallel.\nTo install package dependencies.\nTo provide code linting for stylistic consistency.\nTo visualize data distributions.\n\nWhat is code refactoring (pick one)?\n\nDebugging code to fix errors.\nRewriting code to improve its structure without changing its behavior.\nAdding new features to existing code.\nConverting code from one language to another.\n\nWhy should “magic numbers” be avoided in code (pick one)?\n\nThey slow down execution.\nThey reduce code readability and maintainability.\nThey are incompatible with certain software.\nThey cause syntax errors.\n\nWhat is the main purpose of using a linter when writing code (pick one)?\n\nTo find logical errors in algorithms.\nTo execute code faster.\nTo enforce coding style guidelines.\nTo compile code into machine language.\n\nIn the context of reproducibility, what does “future-you” refer to (pick one)?\n\nAutomated code generation.\nYour ability to understand and reuse your code at a later time.\nPredictive analytics in your code.\nCollaborative work with future colleagues.\n\n\n\n\nActivity\nThe purpose of this activity is to give and review peer review. Peer review generally, and code review specifically (Sadowski et al. 2018), is an important part of working as a professional.\nPlease update your work from the Activity in Chapter 2, to use the starter folder. This would involve, amongst other things, moving the downloading and cleaning to appropriate scripts, updating the README, adding a title, etc. Generally, you should look at the rubric for the Donaldson Paper in Online Appendix E and quickly try to comply with as much as possible, without doing too much extra work. Then exchange it with someone else.\nRead Google (2022) and Feldman (2024). Then using GitHub Issues please conduct peer review of the repo content. Following Feldman (2024), the peer review should use the following structure, and be nicely formatted:\n\nSummary\n[Add a brief summary of the manuscript you are reviewing.]\nStrong positive points:\n[Please keep this brief. Two or three dot points.]\nCritical improvements needed:\n[This is the most important section. These are issues that the authors of the paper must fix and/or address. Be very constructive and polite, be gentle but clear, and provide as much information as you can to assist authors. This might include mistakes/errors, missing information, oversights, misunderstandings, etc. If you can, explain why these are mistakes and provide either a correction or a link to where correct information can be found.]\nSuggestions for improvement:\n[This is you trying to help the authors do even better, as nice to have. You can comment on things that you are not sure of, or state an opinion, point out typos, or minor code issues, but be humble about this being a suggestion, and be very positive and constructive. There should be about five/six dot points.]\nEvaluation:\n[Add each element of the rubric and provide a comment and mark against it. This will NOT be used for grading, this is ONLY to provide authors with some idea of how much work they need to put into each element to improve.]\nEstimated overall mark:\n[X] out of [Y].\nAny other comments:\n[Any other .]\n\n\n\nPaper\nAt about this point the Donaldson Paper from Online Appendix E would be appropriate.\n\n\n\n\nAlexander, Monica. 2019. “Reproducibility in Demographic Research.” https://www.monicaalexander.com/posts/2019-10-20-reproducibility/.\n\n\n———. 2021. “Overcoming Barriers to Sharing Code.” YouTube, February. https://youtu.be/yvM2C6aZ94k.\n\n\nBackus, John. 1981. “The History of FORTRAN I, II, and III.” In History of Programming Languages, edited by Richard Wexelblat, 25–74. Academic Press.\n\n\nBarba, Lorena. 2018. “Terminologies for Reproducible Research.” https://arxiv.org/abs/1802.03311.\n\n\nBegley, Glenn, and Lee Ellis. 2012. “Raise Standards for Preclinical Cancer Research.” Nature 483 (7391): 531--533. https://doi.org/10.1038/483531a.\n\n\nBengtsson, Henrik. 2021. “A Unifying Framework for Parallel and Distributed Processing in R using Futures.” The R Journal 13 (2): 208–27. https://doi.org/10.32614/RJ-2021-048.\n\n\nBowers, Jake, and Maarten Voors. 2016. “How to Improve Your Relationship with Your Future Self.” Revista de Ciencia Polı́tica 36 (3): 829–48. https://doi.org/10.4067/S0718-090X2016000300011.\n\n\nBrown, Zack. 2018. “A Git Origin Story.” Linux Journal, July. https://www.linuxjournal.com/content/git-origin-story.\n\n\nBryan, Jenny. 2018a. “Excuse Me, Do You Have a Moment to Talk about Version Control?” The American Statistician 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928.\n\n\n———. 2018b. “Code Smells and Feels.” YouTube, July. https://youtu.be/7oyiPBjLAWY.\n\n\n———. 2020. Happy Git and GitHub for the useR. https://happygitwithr.com.\n\n\nBryan, Jenny, and Jim Hester. 2020. What They Forgot to Teach You About R. https://rstats.wtf/index.html.\n\n\nBryan, Jenny, Jim Hester, David Robinson, Hadley Wickham, and Christophe Dervieux. 2022. reprex: Prepare Reproducible Example Code via the Clipboard. https://CRAN.R-project.org/package=reprex.\n\n\nBuckheit, Jonathan, and David Donoho. 1995. “Wavelab and Reproducible Research.” In Wavelets and Statistics, 55–81. Springer. https://doi.org/10.1007/978-1-4612-2544-7_5.\n\n\nBurton, Jason, Nicole Cruz, and Ulrike Hahn. 2021. “Reconsidering Evidence of Moral Contagion in Online Social Networks.” Nature Human Behaviour 5 (12): 1629–35. https://doi.org/10.1038/s41562-021-01133-5.\n\n\nChawla, Dalmeet Singh. 2020. “Critiqued Coronavirus Simulation Gets Thumbs up from Code-Checking Efforts.” Nature 582: 323–24. https://doi.org/10.1038/d41586-020-01685-y.\n\n\nChouldechova, Alexandra, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. “A Case Study of Algorithm-Assisted Decision Making in Child Maltreatment Hotline Screening Decisions.” In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, edited by Sorelle Friedler and Christo Wilson, 81:134–48. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/chouldechova18a.html.\n\n\nCohen, Jason, Steven Teleki, and Eric Brown. 2006. Best Kept Secrets of Peer Code Review. Smart Bear Incorporated.\n\n\nCsárdi, Gábor. 2022. gitcreds: Query “git” Credentials from “R”. https://CRAN.R-project.org/package=gitcreds.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed, and Allison Jones-Farmer. 2021. “Explaining Predictive Model Performance: An Experimental Study of Data Preparation and Model Choice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nEghbal, Nadia. 2020. Working in Public: The Making and Maintenance of Open Source Software. California: Stripe Press.\n\n\nFeldman, Gilad. 2024. RRR Assessment Peer Review. https://mgto.org/rrrassessmentreviewtemplate.\n\n\nFowler, Martin, and Kent Beck. 2018. Refactoring: Improving the Design of Existing Code. 2nd ed. New York: Addison-Wesley Professional.\n\n\nGelfand, Sharla. 2021. “Make a ReprEx... Please.” YouTube, February. https://youtu.be/G5Nm-GpmrLw.\n\n\nGelman, Andrew. 2016. “What has happened down here is the winds have changed,” September. https://statmodeling.stat.columbia.edu/2016/09/21/what-has-happened-down-here-is-the-winds-have-changed/.\n\n\nGelman, Andrew, Greggor Mattson, and Daniel Simpson. 2018. “Gaydar and the Fallacy of Decontextualized Measurement.” Sociological Science 5 (12): 270–80. https://doi.org/10.15195/v5.a12.\n\n\nGoogle. 2022. “What to Look for in a Code Review.” Google Engineering Practices Documentation. https://google.github.io/eng-practices/review/reviewer/looking-for.html.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey Greene, and Stephanie Hicks. 2021. “Reproducibility Standards for Machine Learning in the Life Sciences.” Nature Methods 18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHerndon, Thomas, Michael Ash, and Robert Pollin. 2014. “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Cambridge Journal of Economics 38 (2): 257–79. https://doi.org/10.1093/cje/bet075.\n\n\nHester, Jim, Florent Angly, Russ Hyde, Michael Chirico, Kun Ren, Alexander Rosenstock, and Indrajeet Patil. 2022. lintr: A “Linter” for R Code. https://CRAN.R-project.org/package=lintr.\n\n\nHillel, Wayne. 2017. How Do We Trust Our Science Code? https://www.hillelwayne.com/how-do-we-trust-science-code/.\n\n\nIrving, Damien, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte Wickham, and Greg Wilson. 2021. Research Software Engineering with Python. Chapman; Hall/CRC.\n\n\nIsaacson, Walter. 2011. Steve Jobs. 1st ed. Simon & Schuster.\n\n\nKleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.\n\n\nKnuth, Donald. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111. https://doi.org/10.1093/comjnl/27.2.97.\n\n\nLandau, William Michael. 2021. “The targets R Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing.” Journal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.\n\n\nMaier, Maximilian, František Bartoš, Tom Stanley, David Shanks, Adam Harris, and Eric-Jan Wagenmakers. 2022. “No Evidence for Nudging After Adjusting for Publication Bias.” Proceedings of the National Academy of Sciences 119 (31): e2200300119. https://doi.org/10.1073/pnas.2200300119.\n\n\nMatsumoto, Yukihiro. 2007. “Treating Code as an Essay.” In Beautiful Code, edited by Andy Oram and Greg Wilson, 477–81. O’Reilly.\n\n\nMattson, Greggor. 2017. “Artificial Intelligence Discovers Gayface. Sigh.” https://greggormattson.com/2017/09/09/artificial-intelligence-discovers-gayface/amp/.\n\n\nMerali, Zeeya. 2010. “Computational Science:... Error.” Nature 467 (7317): 775–77. https://doi.org/10.1038/467775a.\n\n\nMineault, Patrick, and The Good Research Code Handbook Community. 2021. “The Good Research Code Handbook.” https://doi.org/10.5281/zenodo.5796873.\n\n\nMinsky, Yaron. 2011. “OCaml for the masses.” Communications of the ACM 54 (11): 53–58. https://doi.org/10.1145/2018396.2018413.\n\n\n———. 2015. “Automated Trading and OCaml with Yaron Minsky.” Hackers — Software Engineering Daily, November. https://softwareengineeringdaily.com/2015/11/09/automated-trading-and-ocaml-with-yaron-minsky/.\n\n\nMiyakawa, Tsuyoshi. 2020. “No Raw Data, No Science: Another Possible Source of the Reproducibility Crisis.” Molecular Brain 13 (1): 1–6. https://doi.org/10.1186/s13041-020-0552-2.\n\n\nMullard, Asher. 2021. “Half of Top Cancer Studies Fail High-Profile Reproducibility Effort.” Nature 600 (7889): 368--369. https://doi.org/10.1038/d41586-021-03691-0.\n\n\nMüller, Kirill, and Lorenz Walthert. 2022. styler: Non-Invasive Pretty Printing of R Code. https://CRAN.R-project.org/package=styler.\n\n\nMurphy, Heather. 2017. “Why Stanford Researchers Tried to Create a ‘Gaydar’ Machine.” The New York Times, October. https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html.\n\n\nPerkel, Jeffrey. 2023. “The Sleight-of-Hand Trick That Can Simplify Scientific Computing.” Nature 617 (7959): 212--213. https://doi.org/10.1038/d41586-023-01469-0.\n\n\nPineau, Joelle, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Hugo Larochelle. 2021. “Improving Reproducibility in Machine Learning Research (a Report from the NeurIPS 2019 Reproducibility Program).” Journal of Machine Learning Research 22 (164): 1–20. http://jmlr.org/papers/v22/20-303.html.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet. Penguin Classics.\n\n\nRomer, Paul. 2018. “Jupyter, Mathematica, and the Future of the Research Paper,” April. https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/.\n\n\nSadowski, Caitlin, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. “Modern Code Review: A Case Study at Google.” In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, 181–90. ICSE-SEIP ’18. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3183519.3183525.\n\n\nSilver, Nate. 2020. “We Fixed an Issue with How Our Primary Forecast Was Calculating Candidates’ Demographic Strengths.” FiveThirtyEight, February. https://fivethirtyeight.com/features/we-fixed-a-mistake-in-how-our-primary-forecast-was-calculating-candidates-demographic-strengths/.\n\n\nSimpkinson, Scott. 1971. “Testing to Ensure Mission Success.” In What Made Apollo a Success, edited by NASA, 21–29.\n\n\nSomers, James. 2018. “The Scientific Paper Is Obsolete.” The Atlantic, April. https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/.\n\n\nSprint, Gina, and Jason Conci. 2019. “Mining GitHub Classroom Commit Behavior in Elective and Introductory Computer Science Courses.” Journal of Computing Sciences in Colleges 35 (1): 76–84.\n\n\nSunstein, Cass, and Lucia Reisch. 2017. The Economics of Nudge. Routledge.\n\n\nSzaszi, Barnabas, Anthony Higney, Aaron Charlton, Andrew Gelman, Ignazio Ziano, Balazs Aczel, Daniel Goldstein, David Yeager, and Elizabeth Tipton. 2022. “No Reason to Expect Large and Consistent Effects of Nudge Interventions.” Proceedings of the National Academy of Sciences 119 (31): e2200732119. https://doi.org/10.1073/pnas.2200732119.\n\n\nTrisovic, Ana, Matthew Lau, Thomas Pasquier, and Mercè Crosas. 2022. “A Large-Scale Study on Research Code Quality and Execution.” Scientific Data 9 (1). https://doi.org/10.1038/s41597-022-01143-6.\n\n\nUshey, Kevin. 2022. renv: Project Environments. https://CRAN.R-project.org/package=renv.\n\n\nVidoni, Melina. 2021. “Evaluating Unit Testing Practices in R Packages.” In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 1523–34. https://doi.org/10.1109/ICSE43902.2021.00136.\n\n\nWang, Yilun, and Michal Kosinski. 2018. “Deep Neural Networks Are More Accurate Than Humans at Detecting Sexual Orientation from Facial Images.” Journal of Personality and Social Psychology 114 (2): 246–57. https://doi.org/10.1037/pspa0000098.\n\n\nWickham, Hadley. 2021. The Tidyverse Style Guide. https://style.tidyverse.org/index.html.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.\n\n\nWilson, Greg, Jenny Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal. 2017. “Good Enough Practices in Scientific Computing.” PLOS Computational Biology 13 (6): 1–20. https://doi.org/10.1371/journal.pcbi.1005510.\n\n\nXie, Yihui. 2019. “TinyTeX: A lightweight, cross-platform, and easy-to-maintain LaTeX distribution based on TeX Live.” TUGboat, no. 1: 30–32. https://tug.org/TUGboat/Contents/contents40-1.html.\n\n\n———. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.",
+ "text": "3.8 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: In a certain country there are only ever four parties that could win a seat in parliament. Whichever candidate has a plurality of votes in the area associated with a given seat wins that seat. The parliament is made up of 175 total seats. An analyst is interested in the number of votes for each party by seat. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation. Carefully specify an appropriate situation using the code below. Then write five tests based on the simulated data.\n\n\nlibrary(tidyverse)\n\nelection_results <-\n tibble(\n seat = rep(1:175, each = 4),\n party = rep(x = 1:4, times = 175),\n votes = runif(n = 175 * 4, min = 0, max = 1000) |> floor()\n )\n\n\n(Acquire) Please specify a source of actual data about voting in a country of interest to you.\n(Explore) Start with the following code and create a table of the number of seats won by each party.\n\n\nlibrary(tidyverse)\n\nelection_results |> \n slice_max(votes, n = 1, by = seat) |> \n count(party) |>\n kable()\n\n\n(Share) Please write two paragraphs as if you had gathered data from the source you identified (rather than simulated), and if the table that you built using simulated data reflected the actual situation. The exact details contained in the paragraphs do not have to be factual but they should be reasonable (i.e. you do not actually have to get the data nor create the graphs). Separate the code appropriately into R files and a Quarto doc. Submit a link to a GitHub repo with a README.\n\n\n\nQuiz\n\nFrom Gelman (2016), which statistical concept refers to researchers exploiting flexibility in data analysis to find significant results (pick one)?\n\nRandom sampling.\nP-hacking.\nNull hypothesis testing.\nBayesian inference.\n\nFrom Gelman (2016), what is “p-hacking” (pick one)?\n\nA method for correcting p-values.\nManipulating data or analyses until nonsignificant results become significant.\nA technique for improving computational efficiency.\nAn ethical approach to data sharing.\n\nFrom Gelman (2016), what is the file drawer problem (pick one)?\n\nBias introduced by only publishing significant findings.\nDifficulty in accessing archived data.\nErrors in data coding and entry.\nChallenges in replicating old experiments.\n\nFrom Gelman (2016), which term refers to the tendency to publish only positive findings (pick one)?\n\nData mining.\nPublication bias.\nConfirmation bias.\nSampling error.\n\nFrom Gelman (2016), which term describes the multitude of choices researchers have in data analysis that can lead to significant results (pick one)?\n\nResearcher degrees of freedom.\nData mining.\nSample bias.\nEffect size manipulation.\n\nFrom Gelman (2016), the garden of forking paths refers to what issue (pick one)?\n\nThe complexity of decision trees in machine learning.\nThe multiple potential analyses that can be conducted with the same data.\nThe branching of theory and applied work over time.\nThe divergence of academic disciplines.\n\nFrom Gelman (2016), what is a “replication” in research (pick one)?\n\nA study that reproduces the original findings using new data.\nA study that critiques previous methodologies.\nA meta-analysis of several studies.\nAn exact copy of the original study’s manuscript.\n\nFrom Gelman (2016), what contributes to non-reproducible results in social sciences (pick one)?\n\nInadequate sample sizes.\nLack of advanced statistical software.\nResearcher degrees of freedom leading to selective reporting.\nOver-reliance on qualitative data.\n\nFrom Gelman (2016), what does the replication crisis refer to (pick one)?\n\nDifficulty in creating new theory.\nOverproduction of similar research.\nChallenges in replicating the findings of previous studies.\nShortage of participants for experiments.\n\nFrom Gelman (2016), what helps mitigate the replication crisis (pick one)?\n\nKeeping data confidential.\nPublishing only significant results.\nPreregistering studies and analysis plans.\nIncreasing the use of proprietary software.\n\nGelman (2016) focuses on the replication crisis in psychology. Pick another discipline to focus on, based on your own experience, perhaps in other classes, and write about the extent to which you think there may be replication issues in that discipline and why.\nPick a discipline you’re familiar with. What practices could improve reproducibility in that field? Provide a brief explanation.\nFrom Wilson et al. (2017), which of the following are important data management practices (select all that apply)?\n\nSaving both the raw data and cleaned versions.\nDocumenting the data processing steps.\nUsing non-proprietary file formats for data storage.\n\nFrom Wilson et al. (2017), why is it important to create a README file in the project’s home directory (pick one)?\n\nTo store raw data files.\nTo explain the purpose of the project and provide an overview.\nTo list all the errors and bugs in the project.\nTo keep track of all the versions of the project files.\n\nFrom Wilson et al. (2017), what is a primary benefit of using version control (pick one)?\n\nIt automatically writes code for the researcher.\nIt tracks changes and helps collaboration.\nIt replaces the need for backing up data.\nIt ensures that all data is encrypted.\n\nFrom Wilson et al. (2017), what is a recommended practice for naming files in a project (pick one)?\n\nReflect their content or function in the file names.\nUsing sequential numbers like result1.csv, result2.csv.\nIncluding special characters to make file names unique.\nUsing spaces and punctuation in file names.\n\nFrom Wilson et al. (2017), why should an unmodified copy of the raw data be saved (pick one)?\n\nTo preserve data storage space.\nTo comply with legal regulations.\nTo ensure an unaltered source for verification and reproducibility.\nTo maintain compatibility with software updates.\n\nFrom Wilson et al. (2017), what is a key advantage of using open file formats (pick one)?\n\nThey are faster to process.\nThey are accessible without proprietary software.\nThey compress data more efficiently.\nThey enhance data security.\n\nFrom Wilson et al. (2017), which of the following is a recommended practice when organizing data files (select all that apply)?\n\nUse meaningful and consistent file names.\nStore all files in a single folder.\nOrganize files into a clear directory structure.\nInclude dates in file names for version tracking.\n\nFrom Wilson et al. (2017), why is documenting data processing steps crucial (pick one)?\n\nIt speeds up data analysis.\nIt helps in data encryption.\nIt reduces storage requirements.\nIt allows others to understand and reproduce the analysis.\n\nWhat is a benefit of reproducibility?\n\nIt allows results to be independently verified.\nIt speeds up code execution.\nIt makes data visualization easier.\nIt reduces the need for documentation.\n\nAccording to Alexander (2019) research is reproducible if (pick one)?\n\nIt is published in peer-reviewed journals.\nAll the materials used in the study are provided.\nIt can be reproduced exactly without the authors providing materials.\nIt can be reproduced exactly, given all the materials used in the study.\n\nWhat is literate programming (pick one)?\n\nIt separates code and documentation into different files.\nIt automatically fixes syntax errors in code.\nIt automates the generation of code documentation.\nIt integrates code and natural language in the same document.\n\nWhat is the main function of git in a reproducible workflow (pick one)?\n\nTo automate data cleaning.\nTo run code in parallel.\nTo integrate data visualization into reports.\nTo provide a version control system for code.\n\nAccording to Wickham (2021), how would the files “00_get_data.R” and “get data.R” be classified (pick one)?\n\nbad; bad.\ngood; bad.\nbad; good.\ngood; good.\n\nWhat is a benefit of using Quarto for reproducible research (pick one)?\n\nIt automates statistical analysis.\nIt integrates code and text.\nIt replaces the need for version control.\nIt enhances data visualization capabilities.\n\nIn Quarto, how do you denote a top-level heading (pick one)?\n\n3.8.1 Heading\nHeading\n4 Heading\n\nHeading\n\n\nWhich of the following would result in bold text in Quarto (pick one)?\n\n**bold**\n##bold##\n*bold*\n#bold#\n\nWhat does the “echo” option do in a Quarto R code chunk (pick one)?\n\nTo suppress code output.\nTo control whether the code is displayed in the document.\nTo evaluate the code conditionally.\nTo include warnings in the output.\n\nWhich option would hide the warnings in a Quarto R chunk (pick one)?\n\necho: false\neval: false\nwarning: false\nmessage: false\n\nWhich option would run the R code chunk and display the results, but not show the code in a Quarto R chunk (pick one)?\n\necho: false\ninclude: false\neval: false\nwarning: false\nmessage: false\n\nWhy are R Projects important (select all that apply)?\n\nThey help with reproducibility.\nThey make it easier to share code.\nThey make your workspace more organized.\n\nWhy is it important that your R Project name reflects the content of your repo (select all that apply)?\n\nConsistency.\nProfessionalism.\nAttention to detail.\n\nAssume the packages and datasets have been loaded; what is the mistake in this code: DoctorVisits |> filter(visits) (pick one)?\n\nDoctorVisits\n|>\nfilter\nvisits\n\nWhat is a reprex and why is it important to be able to make one (select all that apply)?\n\nA reproducible example that enables your error to be reproduced.\nA reproducible example that helps others help you.\nA reproducible example during the construction of which you may solve your own problem.\nA reproducible example that demonstrates you have actually tried to help yourself.\n\nAccording to Gelfand (2021), what is the key part of “If you need help getting unstuck, the first step is to create a reprex, or reproducible example. The goal of a reprex is to package your problematic code in such a way that other people can run it and feel your pain. Then, hopefully, they can provide a solution and put you out of your misery.” (pick one)?\n\npackage your problematic code\nother people can run it and feel your pain\nthe first step is to create a reprex\nthey can provide a solution and put you out of your misery\n\nFrom Gelfand (2021), why is creating a reproducible example important when seeking help (pick one)?\n\nIt reduces the need for documentation.\nIt showcases your coding skills.\nIt allows others to replicate the issue and provide solutions.\nIt complies with software licensing.\n\nWhich practice enhances the efficiency of code when collaborating with others (pick one)?\n\nUsing absolute file paths.\nWriting clear comments and documentation.\nMinimizing the use of functions.\nObfuscating code to protect intellectual property.\n\nWhich of the following is an advantage of using Git for version control (select all that apply)?\n\nTracking changes over time.\nFacilitating collaboration among multiple users.\nAutomating data backups.\nEnhancing code execution speed.\n\nWhy do we avoid using setwd() in R scripts (pick one)?\n\nIt can slow down code execution.\nIt requires administrative privileges.\nIt makes code less portable and reproducible.\nIt is deprecated in recent R versions.\n\nIn the context of reproducibility, what is the function of the renv package (pick one)?\n\nTo run code chunks in parallel.\nTo document and share the software environment.\nTo automate code linting.\nTo improve code efficiency in simulations.\n\nWhich of the following does NOT contribute to a reproducible workflow (pick one)?\n\nUsing setwd() to set the working directory.\nSharing code and data as well as results.\nUsing Quarto for integrating R and Python code in papers.\nUsing version control with Git and GitHub.\n\nFrom Wickham (2021), which of the following variable names follows the recommended style (pick one)?\n\ntotal-Sales\nTotalSales\ntotal_sales\ntotal sales\n\nWhat is the primary function of the lintr package in R (pick one)?\n\nTo execute code in parallel.\nTo install package dependencies.\nTo provide code linting for stylistic consistency.\nTo visualize data distributions.\n\nWhat is code refactoring (pick one)?\n\nDebugging code to fix errors.\nRewriting code to improve its structure without changing its behavior.\nAdding new features to existing code.\nConverting code from one language to another.\n\nWhy should “magic numbers” be avoided in code (pick one)?\n\nThey slow down execution.\nThey reduce code readability and maintainability.\nThey are incompatible with certain software.\nThey cause syntax errors.\n\nWhat is the main purpose of using a linter when writing code (pick one)?\n\nTo find logical errors in algorithms.\nTo execute code faster.\nTo enforce coding style guidelines.\nTo compile code into machine language.\n\nIn the context of reproducibility, what does “future-you” refer to (pick one)?\n\nAutomated code generation.\nYour ability to understand and reuse your code at a later time.\nPredictive analytics in your code.\nCollaborative work with future colleagues.\n\n\n\n\nActivity\nThe purpose of this activity is to give and review peer review. Peer review generally, and code review specifically (Sadowski et al. 2018), is an important part of working as a professional.\nPlease update your work from the Activity in Chapter 2, to use the starter folder. This would involve, amongst other things, moving the downloading and cleaning to appropriate scripts, updating the README, adding a title, etc. Generally, you should look at the rubric for the Donaldson Paper in Online Appendix E and quickly try to comply with as much as possible, without doing too much extra work. Then exchange it with someone else.\nRead Google (2022) and Feldman (2024). Then using GitHub Issues please conduct peer review of the repo content. Following Feldman (2024), the peer review should use the following structure, and be nicely formatted:\n\nSummary\n[Add a brief summary of the manuscript you are reviewing.]\nStrong positive points:\n[Please keep this brief. Two or three dot points.]\nCritical improvements needed:\n[This is the most important section. These are issues that the authors of the paper must fix and/or address. Be very constructive and polite, be gentle but clear, and provide as much information as you can to assist authors. This might include mistakes/errors, missing information, oversights, misunderstandings, etc. If you can, explain why these are mistakes and provide either a correction or a link to where correct information can be found.]\nSuggestions for improvement:\n[This is you trying to help the authors do even better, as nice to have. You can comment on things that you are not sure of, or state an opinion, point out typos, or minor code issues, but be humble about this being a suggestion, and be very positive and constructive. There should be about five/six dot points.]\nEvaluation:\n[Add each element of the rubric and provide a comment and mark against it. This will NOT be used for grading, this is ONLY to provide authors with some idea of how much work they need to put into each element to improve.]\nEstimated overall mark:\n[X] out of [Y].\nAny other comments:\n[Any other .]\n\n\n\nPaper\nAt about this point the Donaldson Paper from Online Appendix E would be appropriate.\n\n\n\n\nAlexander, Monica. 2019. “Reproducibility in Demographic Research.” https://www.monicaalexander.com/posts/2019-10-20-reproducibility/.\n\n\n———. 2021. “Overcoming Barriers to Sharing Code.” YouTube, February. https://youtu.be/yvM2C6aZ94k.\n\n\nBackus, John. 1981. “The History of FORTRAN I, II, and III.” In History of Programming Languages, edited by Richard Wexelblat, 25–74. Academic Press.\n\n\nBarba, Lorena. 2018. “Terminologies for Reproducible Research.” https://arxiv.org/abs/1802.03311.\n\n\nBegley, Glenn, and Lee Ellis. 2012. “Raise Standards for Preclinical Cancer Research.” Nature 483 (7391): 531--533. https://doi.org/10.1038/483531a.\n\n\nBengtsson, Henrik. 2021. “A Unifying Framework for Parallel and Distributed Processing in R using Futures.” The R Journal 13 (2): 208–27. https://doi.org/10.32614/RJ-2021-048.\n\n\nBowers, Jake, and Maarten Voors. 2016. “How to Improve Your Relationship with Your Future Self.” Revista de Ciencia Polı́tica 36 (3): 829–48. https://doi.org/10.4067/S0718-090X2016000300011.\n\n\nBrown, Zack. 2018. “A Git Origin Story.” Linux Journal, July. https://www.linuxjournal.com/content/git-origin-story.\n\n\nBryan, Jenny. 2018a. “Excuse Me, Do You Have a Moment to Talk about Version Control?” The American Statistician 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928.\n\n\n———. 2018b. “Code Smells and Feels.” YouTube, July. https://youtu.be/7oyiPBjLAWY.\n\n\n———. 2020. Happy Git and GitHub for the useR. https://happygitwithr.com.\n\n\nBryan, Jenny, and Jim Hester. 2020. What They Forgot to Teach You About R. https://rstats.wtf/index.html.\n\n\nBryan, Jenny, Jim Hester, David Robinson, Hadley Wickham, and Christophe Dervieux. 2022. reprex: Prepare Reproducible Example Code via the Clipboard. https://CRAN.R-project.org/package=reprex.\n\n\nBuckheit, Jonathan, and David Donoho. 1995. “Wavelab and Reproducible Research.” In Wavelets and Statistics, 55–81. Springer. https://doi.org/10.1007/978-1-4612-2544-7_5.\n\n\nBurton, Jason, Nicole Cruz, and Ulrike Hahn. 2021. “Reconsidering Evidence of Moral Contagion in Online Social Networks.” Nature Human Behaviour 5 (12): 1629–35. https://doi.org/10.1038/s41562-021-01133-5.\n\n\nChawla, Dalmeet Singh. 2020. “Critiqued Coronavirus Simulation Gets Thumbs up from Code-Checking Efforts.” Nature 582: 323–24. https://doi.org/10.1038/d41586-020-01685-y.\n\n\nChouldechova, Alexandra, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. “A Case Study of Algorithm-Assisted Decision Making in Child Maltreatment Hotline Screening Decisions.” In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, edited by Sorelle Friedler and Christo Wilson, 81:134–48. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/chouldechova18a.html.\n\n\nCohen, Jason, Steven Teleki, and Eric Brown. 2006. Best Kept Secrets of Peer Code Review. Smart Bear Incorporated.\n\n\nCsárdi, Gábor. 2022. gitcreds: Query “git” Credentials from “R”. https://CRAN.R-project.org/package=gitcreds.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed, and Allison Jones-Farmer. 2021. “Explaining Predictive Model Performance: An Experimental Study of Data Preparation and Model Choice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nEghbal, Nadia. 2020. Working in Public: The Making and Maintenance of Open Source Software. California: Stripe Press.\n\n\nFeldman, Gilad. 2024. RRR Assessment Peer Review. https://mgto.org/rrrassessmentreviewtemplate.\n\n\nFowler, Martin, and Kent Beck. 2018. Refactoring: Improving the Design of Existing Code. 2nd ed. New York: Addison-Wesley Professional.\n\n\nGelfand, Sharla. 2021. “Make a ReprEx... Please.” YouTube, February. https://youtu.be/G5Nm-GpmrLw.\n\n\nGelman, Andrew. 2016. “What has happened down here is the winds have changed,” September. https://statmodeling.stat.columbia.edu/2016/09/21/what-has-happened-down-here-is-the-winds-have-changed/.\n\n\nGelman, Andrew, Greggor Mattson, and Daniel Simpson. 2018. “Gaydar and the Fallacy of Decontextualized Measurement.” Sociological Science 5 (12): 270–80. https://doi.org/10.15195/v5.a12.\n\n\nGoogle. 2022. “What to Look for in a Code Review.” Google Engineering Practices Documentation. https://google.github.io/eng-practices/review/reviewer/looking-for.html.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey Greene, and Stephanie Hicks. 2021. “Reproducibility Standards for Machine Learning in the Life Sciences.” Nature Methods 18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHerndon, Thomas, Michael Ash, and Robert Pollin. 2014. “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Cambridge Journal of Economics 38 (2): 257–79. https://doi.org/10.1093/cje/bet075.\n\n\nHester, Jim, Florent Angly, Russ Hyde, Michael Chirico, Kun Ren, Alexander Rosenstock, and Indrajeet Patil. 2022. lintr: A “Linter” for R Code. https://CRAN.R-project.org/package=lintr.\n\n\nHillel, Wayne. 2017. How Do We Trust Our Science Code? https://www.hillelwayne.com/how-do-we-trust-science-code/.\n\n\nIrving, Damien, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte Wickham, and Greg Wilson. 2021. Research Software Engineering with Python. Chapman; Hall/CRC.\n\n\nIsaacson, Walter. 2011. Steve Jobs. 1st ed. Simon & Schuster.\n\n\nKleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.\n\n\nKnuth, Donald. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111. https://doi.org/10.1093/comjnl/27.2.97.\n\n\nLandau, William Michael. 2021. “The targets R Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing.” Journal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.\n\n\nMaier, Maximilian, František Bartoš, Tom Stanley, David Shanks, Adam Harris, and Eric-Jan Wagenmakers. 2022. “No Evidence for Nudging After Adjusting for Publication Bias.” Proceedings of the National Academy of Sciences 119 (31): e2200300119. https://doi.org/10.1073/pnas.2200300119.\n\n\nMatsumoto, Yukihiro. 2007. “Treating Code as an Essay.” In Beautiful Code, edited by Andy Oram and Greg Wilson, 477–81. O’Reilly.\n\n\nMattson, Greggor. 2017. “Artificial Intelligence Discovers Gayface. Sigh.” https://greggormattson.com/2017/09/09/artificial-intelligence-discovers-gayface/amp/.\n\n\nMerali, Zeeya. 2010. “Computational Science:... Error.” Nature 467 (7317): 775–77. https://doi.org/10.1038/467775a.\n\n\nMineault, Patrick, and The Good Research Code Handbook Community. 2021. “The Good Research Code Handbook.” https://doi.org/10.5281/zenodo.5796873.\n\n\nMinsky, Yaron. 2011. “OCaml for the masses.” Communications of the ACM 54 (11): 53–58. https://doi.org/10.1145/2018396.2018413.\n\n\n———. 2015. “Automated Trading and OCaml with Yaron Minsky.” Hackers — Software Engineering Daily, November. https://softwareengineeringdaily.com/2015/11/09/automated-trading-and-ocaml-with-yaron-minsky/.\n\n\nMiyakawa, Tsuyoshi. 2020. “No Raw Data, No Science: Another Possible Source of the Reproducibility Crisis.” Molecular Brain 13 (1): 1–6. https://doi.org/10.1186/s13041-020-0552-2.\n\n\nMullard, Asher. 2021. “Half of Top Cancer Studies Fail High-Profile Reproducibility Effort.” Nature 600 (7889): 368--369. https://doi.org/10.1038/d41586-021-03691-0.\n\n\nMüller, Kirill, and Lorenz Walthert. 2022. styler: Non-Invasive Pretty Printing of R Code. https://CRAN.R-project.org/package=styler.\n\n\nMurphy, Heather. 2017. “Why Stanford Researchers Tried to Create a ‘Gaydar’ Machine.” The New York Times, October. https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html.\n\n\nNational Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. 1st ed. National Academies Press. https://doi.org/10.17226/25303.\n\n\nPerkel, Jeffrey. 2023. “The Sleight-of-Hand Trick That Can Simplify Scientific Computing.” Nature 617 (7959): 212--213. https://doi.org/10.1038/d41586-023-01469-0.\n\n\nPineau, Joelle, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Hugo Larochelle. 2021. “Improving Reproducibility in Machine Learning Research (a Report from the NeurIPS 2019 Reproducibility Program).” Journal of Machine Learning Research 22 (164): 1–20. http://jmlr.org/papers/v22/20-303.html.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet. Penguin Classics.\n\n\nRomer, Paul. 2018. “Jupyter, Mathematica, and the Future of the Research Paper,” April. https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/.\n\n\nSadowski, Caitlin, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. “Modern Code Review: A Case Study at Google.” In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, 181–90. ICSE-SEIP ’18. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3183519.3183525.\n\n\nSilver, Nate. 2020. “We Fixed an Issue with How Our Primary Forecast Was Calculating Candidates’ Demographic Strengths.” FiveThirtyEight, February. https://fivethirtyeight.com/features/we-fixed-a-mistake-in-how-our-primary-forecast-was-calculating-candidates-demographic-strengths/.\n\n\nSimpkinson, Scott. 1971. “Testing to Ensure Mission Success.” In What Made Apollo a Success, edited by NASA, 21–29.\n\n\nSomers, James. 2018. “The Scientific Paper Is Obsolete.” The Atlantic, April. https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/.\n\n\nSprint, Gina, and Jason Conci. 2019. “Mining GitHub Classroom Commit Behavior in Elective and Introductory Computer Science Courses.” Journal of Computing Sciences in Colleges 35 (1): 76–84.\n\n\nSunstein, Cass, and Lucia Reisch. 2017. The Economics of Nudge. Routledge.\n\n\nSzaszi, Barnabas, Anthony Higney, Aaron Charlton, Andrew Gelman, Ignazio Ziano, Balazs Aczel, Daniel Goldstein, David Yeager, and Elizabeth Tipton. 2022. “No Reason to Expect Large and Consistent Effects of Nudge Interventions.” Proceedings of the National Academy of Sciences 119 (31): e2200732119. https://doi.org/10.1073/pnas.2200732119.\n\n\nTrisovic, Ana, Matthew Lau, Thomas Pasquier, and Mercè Crosas. 2022. “A Large-Scale Study on Research Code Quality and Execution.” Scientific Data 9 (1). https://doi.org/10.1038/s41597-022-01143-6.\n\n\nUshey, Kevin. 2022. renv: Project Environments. https://CRAN.R-project.org/package=renv.\n\n\nVidoni, Melina. 2021. “Evaluating Unit Testing Practices in R Packages.” In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 1523–34. https://doi.org/10.1109/ICSE43902.2021.00136.\n\n\nWang, Yilun, and Michal Kosinski. 2018. “Deep Neural Networks Are More Accurate Than Humans at Detecting Sexual Orientation from Facial Images.” Journal of Personality and Social Psychology 114 (2): 246–57. https://doi.org/10.1037/pspa0000098.\n\n\nWickham, Hadley. 2021. The Tidyverse Style Guide. https://style.tidyverse.org/index.html.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.\n\n\nWilson, Greg, Jenny Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal. 2017. “Good Enough Practices in Scientific Computing.” PLOS Computational Biology 13 (6): 1–20. https://doi.org/10.1371/journal.pcbi.1005510.\n\n\nXie, Yihui. 2019. “TinyTeX: A lightweight, cross-platform, and easy-to-maintain LaTeX distribution based on TeX Live.” TUGboat, no. 1: 30–32. https://tug.org/TUGboat/Contents/contents40-1.html.\n\n\n———. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.",
"crumbs": [
"Foundations",
"
3 Reproducible workflows"
@@ -566,7 +566,7 @@
"href": "06-farm.html#exercises",
"title": "6 Measurement, censuses, and sampling",
"section": "6.5 Exercises",
- "text": "6.5 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: Every day for a year two people—Mark and Lauren—record the amount of snow that fell that day in the two different states they are from. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation with every variable independent of each other. Then write five tests based on the simulated data.\n(Acquire) Please obtain some actual data about snowfall and add a script updating the simulated tests to these actual data.\n(Explore) Build a graph and table using the real data.\n(Communicate) Please write some text to accompany the graph and table. Separate the code appropriately into R files and a Quarto doc. Submit a link to a high-quality GitHub repo.\n\n\n\nQuiz\n\nWhat is a challenge when converting some phenomena in the world into a dataset (pick one)?\n\nThe high cost of data storage solutions.\nThe overabundance of unbiased data.\nThe lack of available data collection tools.\nDeciding what to measure and how to measure it appropriately.\n\nWith reference to Daston (2000), please discuss whether GDP and counts of population are invented or discovered?\nAccording to metrology, which of the following best defines measurement (pick one)?\n\nThe estimation of unknown variables using predictive models.\nThe calculation of statistical significance in data analysis.\nThe act of assigning numbers to objects arbitrarily.\nThe process of experimentally obtaining quantity values that can be attributed to a property of a phenomenon, body, or substance.\n\nIn at least two paragraphs, and using your own words, please define measurement error and provide an example from your own experience.\nWith reference to Gargiulo (2022), please discuss challenges of measurement in the real world.\nWhat does the validity of a measurement refer to (pick one)?\n\nThe statistical significance of the measurement results.\nThe degree to which a measurement accurately reflects the concept it is intended to measure.\nThe speed at which data can be collected.\nThe precision with which a measurement can be replicated.\n\nHow do Kennedy et al. (2022) define ethics (pick one)?\n\nRespecting the perspectives and dignity of individual survey respondents.\nGenerating estimates of the general population and for subpopulations of interest.\nUsing more complicated procedures only when they serve some useful function.\n\nWhich of the following best describes measurement error (pick one)?\n\nThe difference between the observed value and the true value of what is being measured.\nAn error that occurs only when data is not normally distributed.\nA type of error that can be eliminated with better instruments.\nA deliberate alteration of data to mislead analysis.\n\nWhat is censored data (pick one)?\n\nData that have been corrupted and are unreadable.\nData that have been intentionally omitted due to privacy concerns.\nData where the value of an observation is only partially known.\nData collected from unauthorized sources.\n\nHow do truncated data differ from censored data (pick one)?\n\nTruncated data deal with overestimation, censored data with underestimation.\nIn truncated data, certain values are omitted from the dataset, whereas in censored data, the values are partially known but incomplete.\nTruncated data are less accurate than censored data.\n\nWhat does missing completely at random (MCAR) mean (pick one)?\n\nMissing data that can be easily predicted using statistical models.\nMissing data where the likelihood of being missing is related to the unobserved data.\nMissing data where the likelihood of being missing is related to the observed data.\nMissing data where the missingness is entirely random and unrelated to any data, observed or unobserved.\n\nWhy are censuses considered crucial datasets (pick one)?\n\nThey are conducted infrequently and thus have novel information.\nThey aim to collect data on every unit in a population, providing comprehensive datasets designed for analysis.\nThey are private datasets that offer exclusive insights.\nThey focus exclusively on agricultural data, which is vital for economic planning.\n\nFrom Statistics Canada (2023), why is the quality of census data evaluated (pick one)?\n\nTo reduce the cost of future censuses.\nTo limit data dissemination.\nTo improve data privacy.\nTo ensure census data are reliable and meet user needs.\n\nFrom Statistics Canada (2023), what is the primary source of sampling error (pick one)?\n\nNon-response from individuals.\nThe use of a sample, rather than a population.\nData capture errors during processing.\nMisclassification of dwellings.\n\nFrom Statistics Canada (2023), which error occurs when people or dwellings are omitted or counted more than once (pick one)?\n\nNon-response error.\nCoverage error.\nSampling error.\nProcessing error.\n\nFrom Statistics Canada (2023), which type of error is related to misunderstanding or misreporting by respondents or enumerators (pick one)?\n\nSampling error.\nResponse error.\nProcessing error.\nCoverage error.\n\nFrom Statistics Canada (2023), what is the purpose of the Dwelling Classification Survey (DCS) (pick one)?\n\nTo improve the long-form questionnaire.\nTo classify new housing developments.\nTo collect data on household income.\nTo study classification errors in dwellings.\n\nFrom Statistics Canada (2023), what is the Census Undercoverage Study (CUS) designed to estimate (pick one)?\n\nThe number of people omitted from the census.\nThe variance of sampling errors.\nImputation rates for missing data.\nNon-response rates for long-form questionnaires.\n\nFrom Statistics Canada (2023), what does the Census Overcoverage Study (COS) identify (pick one)?\n\nDwellings that were misclassified.\nCases where individuals were counted more than once.\nPeople who were missed by the census.\nInvalid data entries during processing.\n\nFrom Statistics Canada (2023), how is the Total Non-Response (TNR) rate for the census defined (pick one)?\n\nThe number of imputed values in the census data.\nThe percentage of incorrect responses in the census.\nThe percentage of questionnaires with partial responses.\nThe proportion of dwellings where the questionnaires does not meet the minimum content.\n\nWith reference to W. Chen et al. (2019) and Martı́nez (2022), to what extent do you think we can trust government statistics? Please write at least a page and compare at least two governments in your answer.\nThe 2021 census in Canada asked, firstly, “What was this person’s sex at birth? Sex refers to sex assigned at birth. Male/Female”, and then “What is this person’s gender? Refers to current gender which may be different from sex assigned at birth and may be different from what is indicated on legal documents. Male/Female/Or please specify this person’s gender (space for a typed or handwritten answer)”. With reference to Statistics Canada (2020), please discuss the extent to which you think this is an appropriate way for the census to have proceeded. You are welcome to discuss the case of a different country if you are more familiar with that.\nPlease use IPUMS to access the 2020 ACS. Making use of the codebook, how many respondents were there in California (STATEICP) that had a Doctoral degree as their highest educational attainment (EDUC) (pick one)?\n\n2,007\n732\n5,765\n4,684\n\nPlease use IPUMS to access the 1940 1% sample. Making use of the codebook, how many respondents were there in California (STATEICP) with 5+ years of college as their highest educational attainment (EDUC) (pick one)?\n\n532\n1,056\n904\n1,789\n\nWith reference to Dean (2022), please discuss the difference between probability and non-probability sampling.\nWhat is a target population (pick one)?\n\nThe entire group about which we want to draw conclusions.\nThe list of all units in the target population from which a sample can be drawn.\nA subset of the population that is easily accessible for sampling.\nThe list of individuals who have agreed to participate in the study.\n\nWhat is a sampling frame (pick one)?\n\nThe method used to collect data from respondents.\nThe list of all units in the target population from which a sample can be drawn.\nThe timeframe during which data collection occurs.\nThe entire group about which the researcher wants to draw conclusions.\n\nWhat is a difference between probability and non-probability sampling (pick one)?\n\nProbability sampling does not require a sampling frame.\nProbability sampling is always more cost-effective than non-probability sampling.\nIn probability sampling every unit has a known chance of being selected, whereas in non-probability sampling selection is not based on probabilities.\nNon-probability sampling methods are more accurate than probability sampling methods.\n\nWith reference to Beaumont (2020), do you think that probability surveys will disappear, and why or why not (please write a paragraph or two)?\nWhich sampling method involves selecting units such that every observation in the sampling frame has an equal chance of being chosen (pick one)?\n\nSystematic sampling.\nStratified sampling.\nSimple random sampling.\nCluster sampling.\n\nIn which sampling method is the first unit selected at random, and subsequent units selected at regular intervals (pick one)?\n\nSystematic sampling.\nConvenience sampling.\nStratified sampling.\nCluster sampling.\n\nWhat characterizes stratified sampling (pick one)?\n\nThe population is divided into subgroups, and a random sample is taken from each subgroup.\nParticipants recruit other participants through their networks.\nEntire clusters are randomly selected, and all units within them are sampled.\nBy randomly selecting entire groups or clusters, then sampling all or some units within them.\n\nHow are units selected in cluster sampling (pick one)?\n\nThe population is divided into subgroups, and a random sample is taken from each subgroup.\nBy randomly selecting entire groups or clusters, then sampling all or some units within them.\nBy selecting every nth unit from a list.\nBy choosing units based on specific quotas.\n\nPlease name some reasons why you may wish to use cluster sampling (select all that apply)?\n\nBalance in responses in terms of sub-populations.\nAdministrative convenience.\nEfficiency in terms of money.\n\nPlease consider the integers [1:100]. If I were interested in implementing a sampling approach, based on a sample of only 10, to estimate the median, which approach would I choose (pick one)?\n\nSimple random sampling.\nSystematic sampling.\nCluster sampling.\nStratified sampling.\n\nWrite R code that considers the numbers 1 to 100, and estimates the mean, based on a cluster sample of 20 numbers. Re-run this code one hundred times, noting the estimate of the mean each time, and then plot the histogram. What do you notice about the graph? Add a paragraph of explanation and discussion.\nFrom Bowley (1913), how was the sample for the study of Reading’s households selected (pick one)?\n\nBy random selection of streets.\nBy selecting households based on income.\nBy marking one building in ten from the local directory.\nBy interviewing every fifth household.\n\nWhat is the sampling approach used by Bowley (1913) (pick one)?\n\nCluster sampling.\nSimple random sampling.\nStratified sampling.\nSystematic sampling.\n\nFrom Bowley (1913), which method was employed to provide estimates for the whole of Reading, based on the sample data (pick one)?\n\nUsing proportional census data.\nCalculating median income levels.\nApplying a multiplier.\n\nFrom Bowley (1913), what was the method of collecting rent and earnings data from the working-class households in Reading (pick one)?\n\nBy interviewing landlords.\nBy inspecting census records.\nBy collecting data from tax records.\nThrough volunteer interviews with households.\n\nPlease discuss the following statement from Bowley (1913, 673) “It may appear to persons who are not familiar with processes of sampling that a proportion of i in 21 is too small for any conclusion, and that in any case not more than a vague probability can be obtained. … [but] the precision of a sample depends not on its proportion to the whole, but on its own magnitude, if the conditions of random sampling are secured, as it is believed they have been in this inquiry.”\nFrom Neyman (1934), what is the primary goal of stratified sampling (pick one)?\n\nTo reduce the need for randomization.\nTo ensure all strata of the population are represented.\nTo increase bias in sample selection.\n\nWhat was the main focus of Neyman (1934) (pick one)?\n\nIntroduction of the simple random sample.\nDevelopment of quota sampling techniques.\nElimination of all biases in sample selection.\nThe distinction between stratified sampling and non-probability sampling.\n\nFrom Neyman (1934), what is one advantage of stratified sampling over simple random sampling (pick one)?\n\nIt eliminates the need for a sampling frame.\nIt requires fewer samples.\nIt is cheaper.\nIt may increase the precision of estimates.\n\nFrom Neyman (1934, sec. V), which approach allows a consistent estimate of the average collective character of a population, whatever the properties of the population (pick one)?\n\nRandom sampling.\nPurposive sampling.\n\nWhich of the following best describes convenience sampling (pick one)?\n\nSampling that ensures every subgroup is represented proportionally.\nSelecting participants based on random numbers.\nChoosing participants who are easiest to access.\nUsing algorithms to select a sample.\n\nWhat is snowball sampling commonly used for (pick one)?\n\nStudying well-defined populations with comprehensive sampling frames.\nResearching hidden or hard-to-reach populations by having existing participants recruit future participants.\nEnsuring equal representation across different demographic groups.\nEstimating the total population size using capture-recapture methods.\n\nWhat is respondent-driven sampling (pick one)?\n\nA sampling technique that uses automated systems to select respondents.\nA form of non-probability sampling where respondents refer other respondents, often used for hidden populations, and includes incentives for recruitment.\nA type of random sampling where respondents are selected based on a probability mechanism.\nA sampling method that relies on respondents volunteering without any recruitment efforts.\n\nPretend that we have conducted a survey of everyone in Canada, where we asked for age, sex, and gender. Your friend claims that there is no need to worry about uncertainty “because we have the whole population”. Is your friend right or wrong, and why?\nWith reference to Meng (2018), please discuss the claim: “When you have one million responses, you do not need to worry about randomization”.\nImagine you take a job at a bank and they already have a dataset for you to use. What are some questions that you should explore when deciding whether that data will be useful to you?\n\n\n\nActivity I\nThe purpose of this activity is to develop comfort with:\n\ndealing with larger datasets,\nunderstand ratio estimators, and\nsampling.\n\nPlease use IPUMS to access the 2022 ACS. Making use of the codebook, how many respondents were there in each state (STATEICP) that had a doctoral degree as their highest educational attainment (EDUC)? (Hint: Make this a column in a tibble.)\nIf there were 391,171 respondents in California (STATEICP) across all levels of education, then can you please use the ratio estimators approach of Laplace to estimate the total number of respondents in each state i.e. take the ratio that you worked out for California and apply it to the rest of the states. (Hint: You can now work out the ratio between the number of respondents with doctoral degrees in a state and number of respondents in a state and then apply that ratio to your column of the number of respondents with a doctoral degree in each state.) Compare it to the actual number of respondents in each state.\nWrite a short (2ish pages + appendices + references) paper using Quarto. Submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Do not forget to cite the data (but don’t upload the raw data to GitHub). Your paper should cover at least:\n\nInstructions on how to obtain the data (in the appendix).\nA brief overview of the ratio estimators approach.\nYour estimates and the actual number of respondents.\nSome explanation of why you think they are different i.e. the strengths and weaknesses of using ratio estimators.\nA discussion of the strengths and weaknesses of the sampling approaches used by the ACS. (Hint: One fun thing could be to find the actual population of one state, then given the number of respondents in that and every state, use the ratio estimators approach to estimate the population of every state and then compare it with the actual populations.)\n\nNote:\n\nIf you have issues opening the zipped file, open terminal, navigate to the folder then use: gunzip usa_00004.csv.gz (change the filename to whatever yours is).\nTo focus on doctoral degrees you need EDUCD rather than EDUC (but you can select that one once downloaded the data.)\nDo not upload the raw data to GitHub - it is too big and also IPUMS asks that you do not.\n\n\n\nActivity II\nThe purpose of this activity is to:\n\ncompare simple random with cluster sampling.\nconsider the difference between probability and non-probability sampling.\n\nA biobank contains de-identified biomedical data. For instance, UK Biobank contains samples from half a million UK participants. One use is to look at a respondent’s whole genome sequencing data and then relate it to various conditions they have to better understand the extent of genetic determinant.\nPlease simulate the UK Biobank dataset. A whole genome would be a sequence of the letters A, T, C, and G, of a length of around 3 billion. For each individual, simulate only a length of 12 in total, made up of four groups of three letters. Then consider four conditions: colon cancer, cystic fibrosis, Parkinson’s disease, and skin cancer. Create some probability association between different three letter combinations in your simulated genome sequences and whether a person has that condition. For instance, perhaps having AAA in the first three positions, is associated with five percent higher rate of colon cancer. Pretend that UK Biobank uses simple random sampling.\nDavies et al. (2024) argue that UK Biobank should use family-based sampling. That is, they would cluster at the level of families. We would expect that families would have similar (but not necessarily the same) sequences. Please redo your simulation to assume families of sizes one to five.\nNow analyze your simulations and compare your ability to find relationships between three letter positions and various conditions. Discuss this in relation to the difference between sample random sampling and cluster sampling.\nThe other aspect to note is that sampling at the level of the family may be easier for the collectors, because if they are collecting data about one member of the family then it may be slightly more convenient to collect data about another member of that family. Discuss the difference between probability and non-probability sampling and the nuances of the distinction.\nWrite a short (2ish pages + appendices + references) paper using Quarto. Submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course.\n\n\n\n\nAchen, Christopher. 1978. “Measuring Representation.” American Journal of Political Science 22 (3): 475–510. https://doi.org/10.2307/2110458.\n\n\nAnderson, Margo. (1988) 2015. The American Census: A Social History. 2nd ed. Yale University Press.\n\n\nAnderson, Margo, and Stephen Fienberg. 1999. Who Counts?: The Politics of Census-Taking in Contemporary America. Russell Sage Foundation. http://www.jstor.org/stable/10.7758/9781610440059.\n\n\nAu, Randy. 2022. “Celebrating Everyone Counting Things,” February. https://counting.substack.com/p/celebrating-everyone-counting-things.\n\n\nBaker, Reg, Michael Brick, Nancy Bates, Mike Battaglia, Mick Couper, Jill Dever, Krista Gile, and Roger Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1 (2): 90–143. https://doi.org/10.1093/jssam/smt008.\n\n\nBeaumont, Jean-Francois. 2020. “Are Probability Surveys Bound to Disappear for the Production of Official Statistics?” Survey Methodology 46 (1): 1–29.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex Deckmyn. 2022. maps: Draw Geographical Maps. https://CRAN.R-project.org/package=maps.\n\n\nBerdine, Gilbert, Vincent Geloso, and Benjamin Powell. 2018. “Cuban Infant Mortality and Longevity: Health Care or Repression?” Health Policy and Planning 33 (6): 755–57. https://doi.org/10.1093/heapol/czy033.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” American Political Science Review 113 (3): 838–59. https://doi.org/10.1017/S0003055419000194.\n\n\nBowen, Claire McKay. 2022. Protecting Your Privacy in a Data-Driven World. 1st ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781003122043.\n\n\nBowley, Arthur Lyon. 1913. “Working-Class Households in Reading.” Journal of the Royal Statistical Society 76 (7): 672–701. https://doi.org/10.2307/2339708.\n\n\nBreiman, Leo. 1994. “The 1991 Census Adjustment: Undercount or Bad Data?” Statistical Science 9 (4). https://doi.org/10.1214/ss/1177010259.\n\n\nBrewer, Ken. 2013. “Three Controversies in the History of Survey Sampling.” Survey Methodology 39 (2): 249–63.\n\n\nCardoso, Tom. 2020. “Bias behind bars: A Globe investigation finds a prison system stacked against Black and Indigenous inmates.” The Globe and Mail, October. https://www.theglobeandmail.com/canada/article-investigation-racial-bias-in-canadian-prison-risk-assessments/.\n\n\nChambliss, Daniel. 1989. “The Mundanity of Excellence: An Ethnographic Report on Stratification and Olympic Swimmers.” Sociological Theory 7 (1): 70–86. https://doi.org/10.2307/202063.\n\n\nChen, Heng, Marie-Hélène Felt, and Christopher Henry. 2018. “2017 Methods-of-Payment Survey: Sample Calibration and Variance Estimation.” Bank of Canada. https://doi.org/10.34989/tr-114.\n\n\nChen, Wei, Xilu Chen, Chang-Tai Hsieh, and Zheng Song. 2019. “A Forensic Examination of China’s National Accounts.” Brookings Papers on Economic Activity, 77–127. https://www.jstor.org/stable/26798817.\n\n\nColombo, Tommaso, Holger Fröning, Pedro Javier Garcı̀a, and Wainer Vandelli. 2016. “Optimizing the Data-Collection Time of a Large-Scale Data-Acquisition System Through a Simulation Framework.” The Journal of Supercomputing 72 (12): 4546–72. https://doi.org/10.1007/s11227-016-1764-1.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nCrosby, Alfred. 1997. The Measure of Reality: Quantification in Western Europe, 1250-1600. Cambridge: Cambridge University Press.\n\n\nDaston, Lorraine. 2000. “Why Statistics Tend Not Only to Describe the World but to Change It.” London Review of Books 22 (8). https://www.lrb.co.uk/the-paper/v22/n08/lorraine-daston/why-statistics-tend-not-only-to-describe-the-world-but-to-change-it.\n\n\nDavies, Neil M., Gibran Hemani, Jenae M. Neiderhiser, Hilary C. Martin, Melinda C. Mills, Peter M. Visscher, Loïc Yengo, Alexander Strudwick Young, and Matthew C. Keller. 2024. “The Importance of Family-Based Sampling for Biobanks.” Nature 634 (8035): 795–803. https://doi.org/10.1038/s41586-024-07721-5.\n\n\nDavis, Darren. 1997. “Nonrandom Measurement Error and Race of Interviewer Effects Among African Americans.” The Public Opinion Quarterly 61 (1): 183–207. https://doi.org/10.1086/297792.\n\n\nDean, Natalie. 2022. “Tracking COVID-19 Infections: Time for Change.” Nature 602 (7896): 185. https://doi.org/10.1038/d41586-022-00336-8.\n\n\nFisher, Ronald. (1925) 1928. Statistical Methods for Research Workers. 2nd ed. London: Oliver; Boyd.\n\n\nFlake, Jessica, and Eiko Fried. 2020. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” Advances in Methods and Practices in Psychological Science 3 (4): 456–65. https://doi.org/10.1177/2515245920952393.\n\n\nFrandell, Ashlee, Mary Feeney, Timothy Johnson, Eric Welch, Lesley Michalegko, and Heyjie Jung. 2021. “The Effects of Electronic Alert Letters for Internet Surveys of Academic Scientists.” Scientometrics 126 (8): 7167–81. https://doi.org/10.1007/s11192-021-04029-3.\n\n\nFried, Eiko, Jessica Flake, and Donald Robinaugh. 2022. “Revisiting the Theoretical and Methodological Foundations of Depression Measurement.” Nature Reviews Psychology 1 (6): 358–68. https://doi.org/10.1038/s44159-022-00050-2.\n\n\nFuller, Mark, and James Mosher. 1987. “Raptor Survey Techniques.” In Raptor Management Techniques Manual, edited by Beth Pendleton, Brian Millsap, Keith Cline, and David Bird, 37–65. National Wildlife Federation. https://www.sandiegocounty.gov/content/dam/sdc/pds/ceqa/JVR/AdminRecord/IncorporatedByReference/Appendices/Appendix-D---Biological-Resources-Report/Fuller%20and%20Mosher%201987.pdf.\n\n\nGarfinkel, Irwin, Lee Rainwater, and Timothy Smeeding. 2006. “A Re-Examination of Welfare States and Inequality in Rich Nations: How in-Kind Transfers and Indirect Taxes Change the Story.” Journal of Policy Analysis and Management 25 (4): 897–919. https://doi.org/10.1002/pam.20213.\n\n\nGargiulo, Maria. 2022. “Statistical Biases, Measurement Challenges, and Recommendations for Studying Patterns of Femicide in Conflict.” Peace Review 34 (2): 163–76. https://doi.org/10.1080/10402659.2022.2049002.\n\n\nGazeley, Ursula, Georges Reniers, Hallie Eilerts-Spinelli, Julio Romero Prieto, Momodou Jasseh, Sammy Khagayi, and Veronique Filippi. 2022. “Women’s Risk of Death Beyond 42 Days Post Partum: A Pooled Analysis of Longitudinal Health and Demographic Surveillance System Data in Sub-Saharan Africa.” The Lancet Global Health 10 (11): e1582–89. https://doi.org/10.1016/s2214-109x(22)00339-4.\n\n\nGelman, Andrew, Sharad Goel, Douglas Rivers, and David Rothschild. 2016. “The Mythical Swing Voter.” Quarterly Journal of Political Science 11 (1): 103–30. https://doi.org/10.1561/100.00015031.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press. https://avehtari.github.io/ROS-Examples/.\n\n\nGibney, Elizabeth. 2022. “The leap second’s time is up: world votes to stop pausing clocks.” Nature 612 (7938): 18–18. https://doi.org/10.1038/d41586-022-03783-5.\n\n\nGleick, James. 1990. “The Census: Why We Can’t Count.” The New York Times, July. https://www.nytimes.com/1990/07/15/magazine/the-census-why-we-can-t-count.html.\n\n\nGodfrey, Ernest. 1918. “History and Development of Statistics in Canada.” In The History of Statistics–Their Development and Progress in Many Countries. New York: Macmillan, edited by John Koren, 179–98. Macmillan Company of New York.\n\n\nGoodman, Leo. 1961. “Snowball Sampling.” The Annals of Mathematical Statistics 32 (1): 148–70. https://doi.org/10.1214/aoms/1177705148.\n\n\nGroves, Robert, and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.\n\n\nGutman, Robert. 1958. “Birth and Death Registration in Massachusetts: II. The Inauguration of a Modern System, 1800-1849.” The Milbank Memorial Fund Quarterly 36 (4): 373–402.\n\n\nHandcock, Mark, and Krista Gile. 2011. “Comment: On the Concept of Snowball Sampling.” Sociological Methodology 41 (1): 367–71. https://doi.org/10.1111/j.1467-9531.2011.01243.x.\n\n\nHartocollis, Anemona. 2022. “U.S. News Ranked Columbia No. 2, but a Math Professor Has His Doubts.” The New York Times, March. https://www.nytimes.com/2022/03/17/us/columbia-university-rank.html.\n\n\nHawes, Michael. 2020. “Implementing Differential Privacy: Seven Lessons From the 2020 United States Census.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.353c6f99.\n\n\nHeckathorn, Douglas. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174–99. https://doi.org/10.2307/3096941.\n\n\nHopper, Nate. 2022. “The Thorny Problem of Keeping the Internet’s Time.” The New Yorker, September. https://www.newyorker.com/tech/annals-of-technology/the-thorny-problem-of-keeping-the-internets-time.\n\n\nHyman, Michael, Luca Sartore, and Linda J Young. 2021. “Capture-Recapture Estimation of Characteristics of U.S. Local Food Farms Using a Web-Scraped List Frame.” Journal of Survey Statistics and Methodology 10 (4): 979–1004. https://doi.org/10.1093/jssam/smab008.\n\n\nInternational Organization Of Legal Metrology. 2007. International Vocabulary of Metrology – Basic and General Concepts and Associated Terms. 3rd ed. https://www.oiml.org/en/files/pdf%5Fv/v002-200-e07.pdf.\n\n\nJones, Arnold. 1953. “Census Records of the Later Roman Empire.” The Journal of Roman Studies 43: 49–64. https://doi.org/10.2307/297781.\n\n\nKalgin, Alexander. 2014. “Implementation of Performance Management in Regional Government in Russia: Evidence of Data Manipulation.” Public Management Review 18 (1): 110–38. https://doi.org/10.1080/14719037.2014.965271.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKoitsalu, Marie, Martin Eklund, Jan Adolfsson, Henrik Grönberg, and Yvonne Brandberg. 2018. “Effects of Pre-Notification, Invitation Length, Questionnaire Length and Reminder on Participation Rate: A Quasi-Randomised Controlled Trial.” BMC Medical Research Methodology 18 (3): 1–5. https://doi.org/10.1186/s12874-017-0467-5.\n\n\nLane, Nick. 2015. “The Unseen World: Reflections on Leeuwenhoek (1677) ‘Concerning Little Animals’.” Philosophical Transactions of the Royal Society B: Biological Sciences 370 (1666): 20140344. https://doi.org/10.1098/rstb.2014.0344.\n\n\nLeos-Barajas, Vianey, Theoni Photopoulou, Roland Langrock, Toby Patterson, Yuuki Watanabe, Megan Murgatroyd, and Yannis Papastamatiou. 2016. “Analysis of Animal Accelerometer Data Using Hidden Markov Models.” Methods in Ecology and Evolution 8 (2): 161–73. https://doi.org/10.1111/2041-210x.12657.\n\n\nLevine, Judah, Patrizia Tavella, and Martin Milton. 2022. “Towards a Consensus on a Continuous Coordinated Universal Time.” Metrologia 60 (1): 014001. https://doi.org/10.1088/1681-7575/ac9da5.\n\n\nLips, Hilary. 2020. Sex and Gender: An Introduction. 7th ed. Illinois: Waveland Press.\n\n\nLohr, Sharon. (1999) 2022. Sampling: Design and Analysis. 3rd ed. Chapman; Hall/CRC.\n\n\nLuebke, David Martin, and Sybil Milton. 1994. “Locating the Victim: An Overview of Census-Taking, Tabulation Technology, and Persecution in Nazi Germany.” IEEE Annals of the History of Computing 16 (3): 25–39. https://doi.org/10.1109/MAHC.1994.298418.\n\n\nLumley, Thomas. 2020. “survey: analysis of complex survey samples.” https://cran.r-project.org/web/packages/survey/index.html.\n\n\nMartı́nez, Luis. 2022. “How Much Should We Trust the Dictator’s GDP Growth Estimates?” Journal of Political Economy 130 (10): 2731–69. https://doi.org/10.1086/720458.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nMill, James. 1817. The History of British India. 1st ed. https://books.google.ca/books?id=Orw_AAAAcAAJ.\n\n\nMills, David L. 1991. “Internet Time Synchronization: The Network Time Protocol.” IEEE Transactions on Communications 39 (10): 1482–93.\n\n\nMitchell, Alanna. 2022a. “Get Ready for the New, Improved Second.” The New York Times, April. https://www.nytimes.com/2022/04/25/science/time-second-measurement.html.\n\n\n———. 2022b. “Time Has Run Out for the Leap Second.” The New York Times, November. https://www.nytimes.com/2022/11/14/science/time-leap-second.html.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe Biden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMolanphy, Chris. 2012. “100 & Single: Three Rules to Define the Term ‘One-Hit Wonder’ in 2012.” The Village Voice, September. https://www.villagevoice.com/2012/09/10/100-single-three-rules-to-define-the-term-one-hit-wonder-in-2012/.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey: Princeton University Press.\n\n\nNewman, Daniel. 2014. “Missing Data: Five Practical Guidelines.” Organizational Research Methods 17 (4): 372–411. https://doi.org/10.1177/1094428114548590.\n\n\nNeyman, Jerzy. 1934. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.” Journal of the Royal Statistical Society 97 (4): 558–625. https://doi.org/10.2307/2342192.\n\n\nNobles, Melissa. 2002. “Racial Categorization and Censuses.” In Census and Identity: The Politics of Race, Ethnicity, and Language in National Censuses, edited by David Kertzer and Dominique Arel, 43–70. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511606045.003.\n\n\nPlant, Anne, and Robert Hanisch. 2020. “Reproducibility in Science: A Metrology Perspective.” Harvard Data Science Review 2 (4). https://doi.org/10.1162/99608f92.eb6ddee4.\n\n\nPrévost, Jean-Guy, and Jean-Pierre Beaud. 2015. Statistics, Public Debate and the State, 1800–1945: A Social, Political and Intellectual History of Numbers. Routledge.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRegister, Yim. 2020. “Introduction to Sampling and Randomization.” YouTube, November. https://youtu.be/U272FFxG8LE.\n\n\nRockoff, Hugh. 2019. “On the Controversies Behind the Origins of the Federal Economic Statistics.” Journal of Economic Perspectives 33 (1): 147–64. https://doi.org/10.1257/jep.33.1.147.\n\n\nRose, Angela, Rebecca Grais, Denis Coulombier, and Helga Ritter. 2006. “A Comparison of Cluster and Systematic Sampling Methods for Measuring Crude Mortality.” Bulletin of the World Health Organization 84: 290–96. https://doi.org/10.2471/blt.05.029181.\n\n\nRuggles, Steven, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler, and Matthew Sobek. 2021. “IPUMS USA: Version 11.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/d010.v11.0.\n\n\nSakshaug, Joseph, Ting Yan, and Roger Tourangeau. 2010. “Nonresponse Error, Measurement Error, and Mode of Data Collection: Tradeoffs in a Multi-Mode Survey of Sensitive and Non-Sensitive Items.” Public Opinion Quarterly 74 (5): 907–33. https://doi.org/10.1093/poq/nfq057.\n\n\nSalganik, Matthew, and Douglas Heckathorn. 2004. “Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling.” Sociological Methodology 34 (1): 193–240. https://doi.org/10.1111/j.0081-1750.2004.00152.x.\n\n\nScott, James. 1998. Seeing Like a State. Yale University Press.\n\n\nSobek, Matthew, and Steven Ruggles. 1999. “The IPUMS Project: An Update.” Historical Methods: A Journal of Quantitative and Interdisciplinary History 32 (3): 102–10. https://doi.org/10.1080/01615449909598930.\n\n\nSomers, James. 2017. “Torching the Modern-Day Library of Alexandria.” The Atlantic, April. https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/.\n\n\nStatistics Canada. 2020. “Sex at Birth and Gender: Technical Report on Changes for the 2021 Census.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-20-0002/982000022020002-eng.pdf.\n\n\n———. 2023. “Guide to the Census of Population, 2021.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/98-304-x2021001-eng.pdf.\n\n\nSteckel, Richard. 1991. “The Quality of Census Data for Historical Inquiry: A Research Agenda.” Social Science History 15 (4): 579–99. https://doi.org/10.2307/1171470.\n\n\nStigler, Stephen. 1986. The History of Statistics. Massachusetts: Belknap Harvard.\n\n\nStoler, Ann Laura. 2002. “Colonial Archives and the Arts of Governance.” Archival Science 2 (March): 87–109. https://doi.org/10.1007/bf02435632.\n\n\nTal, Eran. 2020. “Measurement in Science.” In The Stanford Encyclopedia of Philosophy, edited by Edward Zalta, Fall 2020. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/; Metaphysics Research Lab, Stanford University.\n\n\nTaylor, Adam. 2015. “New Zealand Says No to Jedis.” The Washington Post, September. https://www.washingtonpost.com/news/worldviews/wp/2015/09/29/new-zealand-says-no-to-jedis/.\n\n\nTimbers, Tiffany. 2020. canlang: Canadian Census language data. https://ttimbers.github.io/canlang/.\n\n\nVanhoenacker, Mark. 2015. Skyfaring: A Journey with a Pilot. 1st ed. Alfred A. Knopf.\n\n\nvon Bergmann, Jens, Dmitry Shkolnik, and Aaron Jacobs. 2021. cancensus: R package to access, retrieve, and work with Canadian Census data and geography. https://mountainmath.github.io/cancensus/.\n\n\nWalby, Kevin, and Alex Luscombe. 2019. Freedom of Information and Social Science Research Design. Routledge.\n\n\nWalker, Kyle. 2022. Analyzing US Census Data. Chapman; Hall/CRC. https://walker-data.com/census-r/index.html.\n\n\nWalker, Kyle, and Matt Herman. 2022. tidycensus: Load US Census Boundary and Attribute Data as “tidyverse” and “sf”-Ready Data Frames. https://CRAN.R-project.org/package=tidycensus.\n\n\nWhitby, Andrew. 2020. The Sum of the People. New York: Basic Books.\n\n\nWhitelaw, James. 1805. An Essay on the Population of Dublin. Being the Result of an Actual Survey Taken in 1798, with Great Care and Precision, and Arranged in a Manner Entirely New. Graisberry; Campbell.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWu, Changbao, and Mary Thompson. 2020. Sampling Theory and Practice. Springer.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nZhang, Ping, XunPeng Shi, YongPing Sun, Jingbo Cui, and Shuai Shao. 2019. “Have China’s provinces achieved their targets of energy intensity reduction? Reassessment based on nighttime lighting data.” Energy Policy 128 (May): 276–83. https://doi.org/10.1016/j.enpol.2019.01.014.",
+ "text": "6.5 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: Every day for a year two people—Mark and Lauren—record the amount of snow that fell that day in the two different states they are from. Please sketch what a dataset could look like, and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation with every variable independent of each other. Then write five tests based on the simulated data.\n(Acquire) Please obtain some actual data about snowfall and add a script updating the simulated tests to these actual data.\n(Explore) Build a graph and table using the real data.\n(Communicate) Please write some text to accompany the graph and table. Separate the code appropriately into R files and a Quarto doc. Submit a link to a high-quality GitHub repo.\n\n\n\nQuiz\n\nWhat is a challenge when converting some phenomena in the world into a dataset (pick one)?\n\nThe high cost of data storage solutions.\nThe overabundance of unbiased data.\nThe lack of available data collection tools.\nDeciding what to measure and how to measure it appropriately.\n\nWith reference to Daston (2000), please discuss whether GDP and counts of population are invented or discovered?\nAccording to metrology, which of the following best defines measurement (pick one)?\n\nThe estimation of unknown variables using predictive models.\nThe calculation of statistical significance in data analysis.\nThe act of assigning numbers to objects arbitrarily.\nThe process of experimentally obtaining quantity values that can be attributed to a property of a phenomenon, body, or substance.\n\nIn at least two paragraphs, and using your own words, please define measurement error and provide an example from your own experience.\nWith reference to Gargiulo (2022), please discuss challenges of measurement in the real world.\nWhat does the validity of a measurement refer to (pick one)?\n\nThe statistical significance of the measurement results.\nThe degree to which a measurement accurately reflects the concept it is intended to measure.\nThe speed at which data can be collected.\nThe precision with which a measurement can be replicated.\n\nHow do Kennedy et al. (2022) define ethics (pick one)?\n\nRespecting the perspectives and dignity of individual survey respondents.\nGenerating estimates of the general population and for subpopulations of interest.\nUsing more complicated procedures only when they serve some useful function.\n\nWhich of the following best describes measurement error (pick one)?\n\nThe difference between the observed value and the true value of what is being measured.\nAn error that occurs only when data is not normally distributed.\nA type of error that can be eliminated with better instruments.\nA deliberate alteration of data to mislead analysis.\n\nWhat is censored data (pick one)?\n\nData that have been corrupted and are unreadable.\nData that have been intentionally omitted due to privacy concerns.\nData where the value of an observation is only partially known.\nData collected from unauthorized sources.\n\nHow do truncated data differ from censored data (pick one)?\n\nTruncated data deal with overestimation, censored data with underestimation.\nIn truncated data, certain values are omitted from the dataset, whereas in censored data, the values are partially known but incomplete.\nTruncated data are less accurate than censored data.\n\nWhat does missing completely at random (MCAR) mean (pick one)?\n\nMissing data that can be easily predicted using statistical models.\nMissing data where the likelihood of being missing is related to the unobserved data.\nMissing data where the likelihood of being missing is related to the observed data.\nMissing data where the missingness is entirely random and unrelated to any data, observed or unobserved.\n\nWhy are censuses considered crucial datasets (pick one)?\n\nThey are conducted infrequently and thus have novel information.\nThey aim to collect data on every unit in a population, providing comprehensive datasets designed for analysis.\nThey are private datasets that offer exclusive insights.\nThey focus exclusively on agricultural data, which is vital for economic planning.\n\nFrom Statistics Canada (2023), why is the quality of census data evaluated (pick one)?\n\nTo reduce the cost of future censuses.\nTo limit data dissemination.\nTo improve data privacy.\nTo ensure census data are reliable and meet user needs.\n\nFrom Statistics Canada (2023), what is the primary source of sampling error (pick one)?\n\nNon-response from individuals.\nThe use of a sample, rather than a population.\nData capture errors during processing.\nMisclassification of dwellings.\n\nFrom Statistics Canada (2023), which error occurs when people or dwellings are omitted or counted more than once (pick one)?\n\nNon-response error.\nCoverage error.\nSampling error.\nProcessing error.\n\nFrom Statistics Canada (2023), which type of error is related to misunderstanding or misreporting by respondents or enumerators (pick one)?\n\nSampling error.\nResponse error.\nProcessing error.\nCoverage error.\n\nFrom Statistics Canada (2023), what is the purpose of the Dwelling Classification Survey (DCS) (pick one)?\n\nTo improve the long-form questionnaire.\nTo classify new housing developments.\nTo collect data on household income.\nTo study classification errors in dwellings.\n\nFrom Statistics Canada (2023), what is the Census Undercoverage Study (CUS) designed to estimate (pick one)?\n\nThe number of people omitted from the census.\nThe variance of sampling errors.\nImputation rates for missing data.\nNon-response rates for long-form questionnaires.\n\nFrom Statistics Canada (2023), what does the Census Overcoverage Study (COS) identify (pick one)?\n\nDwellings that were misclassified.\nCases where individuals were counted more than once.\nPeople who were missed by the census.\nInvalid data entries during processing.\n\nFrom Statistics Canada (2023), how is the Total Non-Response (TNR) rate for the census defined (pick one)?\n\nThe number of imputed values in the census data.\nThe percentage of incorrect responses in the census.\nThe percentage of questionnaires with partial responses.\nThe proportion of dwellings where the questionnaires does not meet the minimum content.\n\nWith reference to W. Chen et al. (2019) and Martı́nez (2022), to what extent do you think we can trust government statistics? Please write at least a page and compare at least two governments in your answer.\nThe 2021 census in Canada asked, firstly, “What was this person’s sex at birth? Sex refers to sex assigned at birth. Male/Female”, and then “What is this person’s gender? Refers to current gender which may be different from sex assigned at birth and may be different from what is indicated on legal documents. Male/Female/Or please specify this person’s gender (space for a typed or handwritten answer)”. With reference to Statistics Canada (2020), please discuss the extent to which you think this is an appropriate way for the census to have proceeded. You are welcome to discuss the case of a different country if you are more familiar with that.\nPlease use IPUMS to access the 2020 ACS. Making use of the codebook, how many respondents were there in California (STATEICP) that had a Doctoral degree as their highest educational attainment (EDUC) (pick one)?\n\n2,007\n732\n5,765\n4,684\n\nPlease use IPUMS to access the 1940 1% sample. Making use of the codebook, how many respondents were there in California (STATEICP) with 5+ years of college as their highest educational attainment (EDUC) (pick one)?\n\n532\n1,056\n904\n1,789\n\nWith reference to Dean (2022), please discuss the difference between probability and non-probability sampling.\nWhat is a target population (pick one)?\n\nThe entire group about which we want to draw conclusions.\nThe list of all units in the target population from which a sample can be drawn.\nA subset of the population that is easily accessible for sampling.\nThe list of individuals who have agreed to participate in the study.\n\nWhat is a sampling frame (pick one)?\n\nThe method used to collect data from respondents.\nThe list of all units in the target population from which a sample can be drawn.\nThe timeframe during which data collection occurs.\nThe entire group about which the researcher wants to draw conclusions.\n\nWhat is a difference between probability and non-probability sampling (pick one)?\n\nProbability sampling does not require a sampling frame.\nProbability sampling is always more cost-effective than non-probability sampling.\nIn probability sampling every unit has a known chance of being selected, whereas in non-probability sampling selection is not based on probabilities.\nNon-probability sampling methods are more accurate than probability sampling methods.\n\nWith reference to Beaumont (2020), do you think that probability surveys will disappear, and why or why not (please write a paragraph or two)?\nWhich sampling method involves selecting units such that every observation in the sampling frame has an equal chance of being chosen (pick one)?\n\nSystematic sampling.\nStratified sampling.\nSimple random sampling.\nCluster sampling.\n\nIn which sampling method is the first unit selected at random, and subsequent units selected at regular intervals (pick one)?\n\nSystematic sampling.\nConvenience sampling.\nStratified sampling.\nCluster sampling.\n\nWhat characterizes stratified sampling (pick one)?\n\nThe population is divided into subgroups, and a random sample is taken from each subgroup.\nParticipants recruit other participants through their networks.\nEntire clusters are randomly selected, and all units within them are sampled.\nBy randomly selecting entire groups or clusters, then sampling all or some units within them.\n\nHow are units selected in cluster sampling (pick one)?\n\nThe population is divided into subgroups, and a random sample is taken from each subgroup.\nBy randomly selecting entire groups or clusters, then sampling all or some units within them.\nBy selecting every nth unit from a list.\nBy choosing units based on specific quotas.\n\nPlease name some reasons why you may wish to use cluster sampling (select all that apply)?\n\nBalance in responses in terms of sub-populations.\nAdministrative convenience.\nEfficiency in terms of money.\n\nPlease consider the integers [1:100]. If I were interested in implementing a sampling approach, based on a sample of only 10, to estimate the median, which approach would I choose (pick one)?\n\nSimple random sampling.\nSystematic sampling.\nCluster sampling.\nStratified sampling.\n\nWrite R code that considers the numbers 1 to 100, and estimates the mean, based on a cluster sample of 20 numbers. Re-run this code one hundred times, noting the estimate of the mean each time, and then plot the histogram. What do you notice about the graph? Add a paragraph of explanation and discussion.\nFrom Bowley (1913), how was the sample for the study of Reading’s households selected (pick one)?\n\nBy random selection of streets.\nBy selecting households based on income.\nBy marking one building in ten from the local directory.\nBy interviewing every fifth household.\n\nWhat is the sampling approach used by Bowley (1913) (pick one)?\n\nCluster sampling.\nSimple random sampling.\nStratified sampling.\nSystematic sampling.\n\nFrom Bowley (1913), which method was employed to provide estimates for the whole of Reading, based on the sample data (pick one)?\n\nUsing proportional census data.\nCalculating median income levels.\nApplying a multiplier.\n\nFrom Bowley (1913), what was the method of collecting rent and earnings data from the working-class households in Reading (pick one)?\n\nBy interviewing landlords.\nBy inspecting census records.\nBy collecting data from tax records.\nThrough volunteer interviews with households.\n\nPlease discuss the following statement from Bowley (1913, 673) “It may appear to persons who are not familiar with processes of sampling that a proportion of i in 21 is too small for any conclusion, and that in any case not more than a vague probability can be obtained. … [but] the precision of a sample depends not on its proportion to the whole, but on its own magnitude, if the conditions of random sampling are secured, as it is believed they have been in this inquiry.”\nFrom Neyman (1934), what is the primary goal of stratified sampling (pick one)?\n\nTo reduce the need for randomization.\nTo ensure all strata of the population are represented.\nTo increase bias in sample selection.\n\nWhat was the main focus of Neyman (1934) (pick one)?\n\nIntroduction of the simple random sample.\nDevelopment of quota sampling techniques.\nElimination of all biases in sample selection.\nThe distinction between stratified sampling and non-probability sampling.\n\nFrom Neyman (1934), what is one advantage of stratified sampling over simple random sampling (pick one)?\n\nIt eliminates the need for a sampling frame.\nIt requires fewer samples.\nIt is cheaper.\nIt may increase the precision of estimates.\n\nFrom Neyman (1934, sec. V), which approach allows a consistent estimate of the average collective character of a population, whatever the properties of the population (pick one)?\n\nRandom sampling.\nPurposive sampling.\n\nWhich of the following best describes convenience sampling (pick one)?\n\nSampling that ensures every subgroup is represented proportionally.\nSelecting participants based on random numbers.\nChoosing participants who are easiest to access.\nUsing algorithms to select a sample.\n\nWhat is snowball sampling commonly used for (pick one)?\n\nStudying well-defined populations with comprehensive sampling frames.\nResearching hidden or hard-to-reach populations by having existing participants recruit future participants.\nEnsuring equal representation across different demographic groups.\nEstimating the total population size using capture-recapture methods.\n\nWhat is respondent-driven sampling (pick one)?\n\nA sampling technique that uses automated systems to select respondents.\nA form of non-probability sampling where respondents refer other respondents, often used for hidden populations, and includes incentives for recruitment.\nA type of random sampling where respondents are selected based on a probability mechanism.\nA sampling method that relies on respondents volunteering without any recruitment efforts.\n\nPretend that we have conducted a survey of everyone in Canada, where we asked for age, sex, and gender. Your friend claims that there is no need to worry about uncertainty “because we have the whole population”. Is your friend right or wrong, and why?\nWith reference to Meng (2018), please discuss the claim: “When you have one million responses, you do not need to worry about randomization”.\nImagine you take a job at a bank and they already have a dataset for you to use. What are some questions that you should explore when deciding whether that data will be useful to you?\n\n\n\nActivity I\nThe purpose of this activity is to develop comfort with:\n\ndealing with larger datasets,\nunderstand ratio estimators, and\nsampling.\n\nPlease use IPUMS to access the 2022 ACS. Making use of the codebook, how many respondents were there in each state (STATEICP) that had a doctoral degree as their highest educational attainment (EDUC)? (Hint: Make this a column in a tibble.)\nIf there were 391,171 respondents in California (STATEICP) across all levels of education, then can you please use the ratio estimators approach of Laplace to estimate the total number of respondents in each state i.e. take the ratio that you worked out for California and apply it to the rest of the states. (Hint: You can now work out the ratio between the number of respondents with doctoral degrees in a state and number of respondents in a state and then apply that ratio to your column of the number of respondents with a doctoral degree in each state.) Compare it to the actual number of respondents in each state.\nWrite a brief paper using Quarto and submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Components of the rubric that are relevant are: “R/Python is cited”, “Data are appropriately cited”, “Class paper”, “LLM usage is documented”, “Title”, “Author, date, and repo”, “Abstract”, “Introduction”, “Data”, “Results”, “Discussion”, “Prose”, “Cross-references”, “Captions”, “Graphs and tables”, “Referencing”, “Commits”, “Sketches”, “Simulation”, “Tests”, and “Reproducible workflow”.\nYour paper should cover at least:\n\nInstructions on how to obtain the data (in the appendix).\nA brief overview of the ratio estimators approach.\nYour estimates and the actual number of respondents.\nSome explanation of why you think they are different i.e. the strengths and weaknesses of using ratio estimators.\nA discussion of the strengths and weaknesses of the sampling approaches used by the ACS. (Hint: One fun thing could be to find the actual population of one state, then given the number of respondents in that and every state, use the ratio estimators approach to estimate the population of every state and then compare it with the actual populations.)\n\nDo not forget to cite the data (but do not upload the raw data to GitHub).\nNote:\n\nIf you have issues opening the zipped file, open terminal, navigate to the folder then use: gunzip usa_00004.csv.gz (change the filename to whatever yours is).\nTo focus on doctoral degrees you need EDUCD rather than EDUC (but you can select that one once downloaded the data.)\nDo not upload the raw data to GitHub - it is too big and also IPUMS asks that you do not.\n\n\n\nActivity II\nThe purpose of this activity is to:\n\ncompare simple random with cluster sampling.\nconsider the difference between probability and non-probability sampling.\n\nA biobank contains de-identified biomedical data. For instance, UK Biobank contains samples from half a million UK participants. One use is to look at a respondent’s whole genome sequencing data and then relate it to various conditions they have to better understand the extent of genetic determinant.\nPlease simulate the UK Biobank dataset. A whole genome would be a sequence of the letters A, T, C, and G, of a length of around 3 billion. For each individual, simulate only a length of 12 in total, made up of four groups of three letters. Then consider four conditions: colon cancer, cystic fibrosis, Parkinson’s disease, and skin cancer. Create some probability association between different three letter combinations in your simulated genome sequences and whether a person has that condition. For instance, perhaps having AAA in the first three positions, is associated with five percent higher rate of colon cancer. Pretend that UK Biobank uses simple random sampling. Don’t forget to set a seed.\nDavies et al. (2024) argue that UK Biobank should use family-based sampling. That is, they would cluster at the level of families. We would expect that families would have similar (but not necessarily the same) sequences. Please redo your simulation to assume families of sizes one to five.\nNow analyze your simulations and compare your ability to find relationships between three letter positions and various conditions. Discuss this in relation to the difference between sample random sampling and cluster sampling.\nThe other aspect to note is that sampling at the level of the family may be easier for the collectors, because if they are collecting data about one member of the family then it may be slightly more convenient to collect data about another member of that family. Discuss the difference between probability and non-probability sampling and the nuances of the distinction.\nWrite a brief paper using Quarto and submit a link to a GitHub repo (one repo per group) that meets the general expectations of the course. Components of the rubric that are relevant are: “R/Python is cited”, “Class paper”, “LLM usage is documented”, “Title”, “Author, date, and repo”, “Abstract”, “Introduction”, “Data”, “Results”, “Discussion”, “Prose”, “Cross-references”, “Captions”, “Graphs and tables”, “Referencing”, “Commits”, “Sketches”, “Simulation”, “Tests”, and “Reproducible workflow”.\n\n\n\n\nAchen, Christopher. 1978. “Measuring Representation.” American Journal of Political Science 22 (3): 475–510. https://doi.org/10.2307/2110458.\n\n\nAnderson, Margo. (1988) 2015. The American Census: A Social History. 2nd ed. Yale University Press.\n\n\nAnderson, Margo, and Stephen Fienberg. 1999. Who Counts?: The Politics of Census-Taking in Contemporary America. Russell Sage Foundation. http://www.jstor.org/stable/10.7758/9781610440059.\n\n\nAu, Randy. 2022. “Celebrating Everyone Counting Things,” February. https://counting.substack.com/p/celebrating-everyone-counting-things.\n\n\nBaker, Reg, Michael Brick, Nancy Bates, Mike Battaglia, Mick Couper, Jill Dever, Krista Gile, and Roger Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1 (2): 90–143. https://doi.org/10.1093/jssam/smt008.\n\n\nBeaumont, Jean-Francois. 2020. “Are Probability Surveys Bound to Disappear for the Production of Official Statistics?” Survey Methodology 46 (1): 1–29.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex Deckmyn. 2022. maps: Draw Geographical Maps. https://CRAN.R-project.org/package=maps.\n\n\nBerdine, Gilbert, Vincent Geloso, and Benjamin Powell. 2018. “Cuban Infant Mortality and Longevity: Health Care or Repression?” Health Policy and Planning 33 (6): 755–57. https://doi.org/10.1093/heapol/czy033.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” American Political Science Review 113 (3): 838–59. https://doi.org/10.1017/S0003055419000194.\n\n\nBowen, Claire McKay. 2022. Protecting Your Privacy in a Data-Driven World. 1st ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781003122043.\n\n\nBowley, Arthur Lyon. 1913. “Working-Class Households in Reading.” Journal of the Royal Statistical Society 76 (7): 672–701. https://doi.org/10.2307/2339708.\n\n\nBreiman, Leo. 1994. “The 1991 Census Adjustment: Undercount or Bad Data?” Statistical Science 9 (4). https://doi.org/10.1214/ss/1177010259.\n\n\nBrewer, Ken. 2013. “Three Controversies in the History of Survey Sampling.” Survey Methodology 39 (2): 249–63.\n\n\nCardoso, Tom. 2020. “Bias behind bars: A Globe investigation finds a prison system stacked against Black and Indigenous inmates.” The Globe and Mail, October. https://www.theglobeandmail.com/canada/article-investigation-racial-bias-in-canadian-prison-risk-assessments/.\n\n\nChambliss, Daniel. 1989. “The Mundanity of Excellence: An Ethnographic Report on Stratification and Olympic Swimmers.” Sociological Theory 7 (1): 70–86. https://doi.org/10.2307/202063.\n\n\nChen, Heng, Marie-Hélène Felt, and Christopher Henry. 2018. “2017 Methods-of-Payment Survey: Sample Calibration and Variance Estimation.” Bank of Canada. https://doi.org/10.34989/tr-114.\n\n\nChen, Wei, Xilu Chen, Chang-Tai Hsieh, and Zheng Song. 2019. “A Forensic Examination of China’s National Accounts.” Brookings Papers on Economic Activity, 77–127. https://www.jstor.org/stable/26798817.\n\n\nColombo, Tommaso, Holger Fröning, Pedro Javier Garcı̀a, and Wainer Vandelli. 2016. “Optimizing the Data-Collection Time of a Large-Scale Data-Acquisition System Through a Simulation Framework.” The Journal of Supercomputing 72 (12): 4546–72. https://doi.org/10.1007/s11227-016-1764-1.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nCrosby, Alfred. 1997. The Measure of Reality: Quantification in Western Europe, 1250-1600. Cambridge: Cambridge University Press.\n\n\nDaston, Lorraine. 2000. “Why Statistics Tend Not Only to Describe the World but to Change It.” London Review of Books 22 (8). https://www.lrb.co.uk/the-paper/v22/n08/lorraine-daston/why-statistics-tend-not-only-to-describe-the-world-but-to-change-it.\n\n\nDavies, Neil M., Gibran Hemani, Jenae M. Neiderhiser, Hilary C. Martin, Melinda C. Mills, Peter M. Visscher, Loïc Yengo, Alexander Strudwick Young, and Matthew C. Keller. 2024. “The Importance of Family-Based Sampling for Biobanks.” Nature 634 (8035): 795–803. https://doi.org/10.1038/s41586-024-07721-5.\n\n\nDavis, Darren. 1997. “Nonrandom Measurement Error and Race of Interviewer Effects Among African Americans.” The Public Opinion Quarterly 61 (1): 183–207. https://doi.org/10.1086/297792.\n\n\nDean, Natalie. 2022. “Tracking COVID-19 Infections: Time for Change.” Nature 602 (7896): 185. https://doi.org/10.1038/d41586-022-00336-8.\n\n\nFisher, Ronald. (1925) 1928. Statistical Methods for Research Workers. 2nd ed. London: Oliver; Boyd.\n\n\nFlake, Jessica, and Eiko Fried. 2020. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” Advances in Methods and Practices in Psychological Science 3 (4): 456–65. https://doi.org/10.1177/2515245920952393.\n\n\nFrandell, Ashlee, Mary Feeney, Timothy Johnson, Eric Welch, Lesley Michalegko, and Heyjie Jung. 2021. “The Effects of Electronic Alert Letters for Internet Surveys of Academic Scientists.” Scientometrics 126 (8): 7167–81. https://doi.org/10.1007/s11192-021-04029-3.\n\n\nFried, Eiko, Jessica Flake, and Donald Robinaugh. 2022. “Revisiting the Theoretical and Methodological Foundations of Depression Measurement.” Nature Reviews Psychology 1 (6): 358–68. https://doi.org/10.1038/s44159-022-00050-2.\n\n\nFuller, Mark, and James Mosher. 1987. “Raptor Survey Techniques.” In Raptor Management Techniques Manual, edited by Beth Pendleton, Brian Millsap, Keith Cline, and David Bird, 37–65. National Wildlife Federation. https://www.sandiegocounty.gov/content/dam/sdc/pds/ceqa/JVR/AdminRecord/IncorporatedByReference/Appendices/Appendix-D---Biological-Resources-Report/Fuller%20and%20Mosher%201987.pdf.\n\n\nGarfinkel, Irwin, Lee Rainwater, and Timothy Smeeding. 2006. “A Re-Examination of Welfare States and Inequality in Rich Nations: How in-Kind Transfers and Indirect Taxes Change the Story.” Journal of Policy Analysis and Management 25 (4): 897–919. https://doi.org/10.1002/pam.20213.\n\n\nGargiulo, Maria. 2022. “Statistical Biases, Measurement Challenges, and Recommendations for Studying Patterns of Femicide in Conflict.” Peace Review 34 (2): 163–76. https://doi.org/10.1080/10402659.2022.2049002.\n\n\nGazeley, Ursula, Georges Reniers, Hallie Eilerts-Spinelli, Julio Romero Prieto, Momodou Jasseh, Sammy Khagayi, and Veronique Filippi. 2022. “Women’s Risk of Death Beyond 42 Days Post Partum: A Pooled Analysis of Longitudinal Health and Demographic Surveillance System Data in Sub-Saharan Africa.” The Lancet Global Health 10 (11): e1582–89. https://doi.org/10.1016/s2214-109x(22)00339-4.\n\n\nGelman, Andrew, Sharad Goel, Douglas Rivers, and David Rothschild. 2016. “The Mythical Swing Voter.” Quarterly Journal of Political Science 11 (1): 103–30. https://doi.org/10.1561/100.00015031.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press. https://avehtari.github.io/ROS-Examples/.\n\n\nGibney, Elizabeth. 2022. “The leap second’s time is up: world votes to stop pausing clocks.” Nature 612 (7938): 18–18. https://doi.org/10.1038/d41586-022-03783-5.\n\n\nGleick, James. 1990. “The Census: Why We Can’t Count.” The New York Times, July. https://www.nytimes.com/1990/07/15/magazine/the-census-why-we-can-t-count.html.\n\n\nGodfrey, Ernest. 1918. “History and Development of Statistics in Canada.” In The History of Statistics–Their Development and Progress in Many Countries. New York: Macmillan, edited by John Koren, 179–98. Macmillan Company of New York.\n\n\nGoodman, Leo. 1961. “Snowball Sampling.” The Annals of Mathematical Statistics 32 (1): 148–70. https://doi.org/10.1214/aoms/1177705148.\n\n\nGroves, Robert, and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.\n\n\nGutman, Robert. 1958. “Birth and Death Registration in Massachusetts: II. The Inauguration of a Modern System, 1800-1849.” The Milbank Memorial Fund Quarterly 36 (4): 373–402.\n\n\nHandcock, Mark, and Krista Gile. 2011. “Comment: On the Concept of Snowball Sampling.” Sociological Methodology 41 (1): 367–71. https://doi.org/10.1111/j.1467-9531.2011.01243.x.\n\n\nHartocollis, Anemona. 2022. “U.S. News Ranked Columbia No. 2, but a Math Professor Has His Doubts.” The New York Times, March. https://www.nytimes.com/2022/03/17/us/columbia-university-rank.html.\n\n\nHawes, Michael. 2020. “Implementing Differential Privacy: Seven Lessons From the 2020 United States Census.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.353c6f99.\n\n\nHeckathorn, Douglas. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174–99. https://doi.org/10.2307/3096941.\n\n\nHopper, Nate. 2022. “The Thorny Problem of Keeping the Internet’s Time.” The New Yorker, September. https://www.newyorker.com/tech/annals-of-technology/the-thorny-problem-of-keeping-the-internets-time.\n\n\nHyman, Michael, Luca Sartore, and Linda J Young. 2021. “Capture-Recapture Estimation of Characteristics of U.S. Local Food Farms Using a Web-Scraped List Frame.” Journal of Survey Statistics and Methodology 10 (4): 979–1004. https://doi.org/10.1093/jssam/smab008.\n\n\nInternational Organization Of Legal Metrology. 2007. International Vocabulary of Metrology – Basic and General Concepts and Associated Terms. 3rd ed. https://www.oiml.org/en/files/pdf%5Fv/v002-200-e07.pdf.\n\n\nJones, Arnold. 1953. “Census Records of the Later Roman Empire.” The Journal of Roman Studies 43: 49–64. https://doi.org/10.2307/297781.\n\n\nKalgin, Alexander. 2014. “Implementation of Performance Management in Regional Government in Russia: Evidence of Data Manipulation.” Public Management Review 18 (1): 110–38. https://doi.org/10.1080/14719037.2014.965271.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKoitsalu, Marie, Martin Eklund, Jan Adolfsson, Henrik Grönberg, and Yvonne Brandberg. 2018. “Effects of Pre-Notification, Invitation Length, Questionnaire Length and Reminder on Participation Rate: A Quasi-Randomised Controlled Trial.” BMC Medical Research Methodology 18 (3): 1–5. https://doi.org/10.1186/s12874-017-0467-5.\n\n\nLane, Nick. 2015. “The Unseen World: Reflections on Leeuwenhoek (1677) ‘Concerning Little Animals’.” Philosophical Transactions of the Royal Society B: Biological Sciences 370 (1666): 20140344. https://doi.org/10.1098/rstb.2014.0344.\n\n\nLeos-Barajas, Vianey, Theoni Photopoulou, Roland Langrock, Toby Patterson, Yuuki Watanabe, Megan Murgatroyd, and Yannis Papastamatiou. 2016. “Analysis of Animal Accelerometer Data Using Hidden Markov Models.” Methods in Ecology and Evolution 8 (2): 161–73. https://doi.org/10.1111/2041-210x.12657.\n\n\nLevine, Judah, Patrizia Tavella, and Martin Milton. 2022. “Towards a Consensus on a Continuous Coordinated Universal Time.” Metrologia 60 (1): 014001. https://doi.org/10.1088/1681-7575/ac9da5.\n\n\nLips, Hilary. 2020. Sex and Gender: An Introduction. 7th ed. Illinois: Waveland Press.\n\n\nLohr, Sharon. (1999) 2022. Sampling: Design and Analysis. 3rd ed. Chapman; Hall/CRC.\n\n\nLuebke, David Martin, and Sybil Milton. 1994. “Locating the Victim: An Overview of Census-Taking, Tabulation Technology, and Persecution in Nazi Germany.” IEEE Annals of the History of Computing 16 (3): 25–39. https://doi.org/10.1109/MAHC.1994.298418.\n\n\nLumley, Thomas. 2020. “survey: analysis of complex survey samples.” https://cran.r-project.org/web/packages/survey/index.html.\n\n\nMartı́nez, Luis. 2022. “How Much Should We Trust the Dictator’s GDP Growth Estimates?” Journal of Political Economy 130 (10): 2731–69. https://doi.org/10.1086/720458.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nMill, James. 1817. The History of British India. 1st ed. https://books.google.ca/books?id=Orw_AAAAcAAJ.\n\n\nMills, David L. 1991. “Internet Time Synchronization: The Network Time Protocol.” IEEE Transactions on Communications 39 (10): 1482–93.\n\n\nMitchell, Alanna. 2022a. “Get Ready for the New, Improved Second.” The New York Times, April. https://www.nytimes.com/2022/04/25/science/time-second-measurement.html.\n\n\n———. 2022b. “Time Has Run Out for the Leap Second.” The New York Times, November. https://www.nytimes.com/2022/11/14/science/time-leap-second.html.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe Biden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMolanphy, Chris. 2012. “100 & Single: Three Rules to Define the Term ‘One-Hit Wonder’ in 2012.” The Village Voice, September. https://www.villagevoice.com/2012/09/10/100-single-three-rules-to-define-the-term-one-hit-wonder-in-2012/.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey: Princeton University Press.\n\n\nNewman, Daniel. 2014. “Missing Data: Five Practical Guidelines.” Organizational Research Methods 17 (4): 372–411. https://doi.org/10.1177/1094428114548590.\n\n\nNeyman, Jerzy. 1934. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.” Journal of the Royal Statistical Society 97 (4): 558–625. https://doi.org/10.2307/2342192.\n\n\nNobles, Melissa. 2002. “Racial Categorization and Censuses.” In Census and Identity: The Politics of Race, Ethnicity, and Language in National Censuses, edited by David Kertzer and Dominique Arel, 43–70. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511606045.003.\n\n\nPlant, Anne, and Robert Hanisch. 2020. “Reproducibility in Science: A Metrology Perspective.” Harvard Data Science Review 2 (4). https://doi.org/10.1162/99608f92.eb6ddee4.\n\n\nPrévost, Jean-Guy, and Jean-Pierre Beaud. 2015. Statistics, Public Debate and the State, 1800–1945: A Social, Political and Intellectual History of Numbers. Routledge.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRegister, Yim. 2020. “Introduction to Sampling and Randomization.” YouTube, November. https://youtu.be/U272FFxG8LE.\n\n\nRockoff, Hugh. 2019. “On the Controversies Behind the Origins of the Federal Economic Statistics.” Journal of Economic Perspectives 33 (1): 147–64. https://doi.org/10.1257/jep.33.1.147.\n\n\nRose, Angela, Rebecca Grais, Denis Coulombier, and Helga Ritter. 2006. “A Comparison of Cluster and Systematic Sampling Methods for Measuring Crude Mortality.” Bulletin of the World Health Organization 84: 290–96. https://doi.org/10.2471/blt.05.029181.\n\n\nRuggles, Steven, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler, and Matthew Sobek. 2021. “IPUMS USA: Version 11.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/d010.v11.0.\n\n\nSakshaug, Joseph, Ting Yan, and Roger Tourangeau. 2010. “Nonresponse Error, Measurement Error, and Mode of Data Collection: Tradeoffs in a Multi-Mode Survey of Sensitive and Non-Sensitive Items.” Public Opinion Quarterly 74 (5): 907–33. https://doi.org/10.1093/poq/nfq057.\n\n\nSalganik, Matthew, and Douglas Heckathorn. 2004. “Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling.” Sociological Methodology 34 (1): 193–240. https://doi.org/10.1111/j.0081-1750.2004.00152.x.\n\n\nScott, James. 1998. Seeing Like a State. Yale University Press.\n\n\nSobek, Matthew, and Steven Ruggles. 1999. “The IPUMS Project: An Update.” Historical Methods: A Journal of Quantitative and Interdisciplinary History 32 (3): 102–10. https://doi.org/10.1080/01615449909598930.\n\n\nSomers, James. 2017. “Torching the Modern-Day Library of Alexandria.” The Atlantic, April. https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/.\n\n\nStatistics Canada. 2020. “Sex at Birth and Gender: Technical Report on Changes for the 2021 Census.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-20-0002/982000022020002-eng.pdf.\n\n\n———. 2023. “Guide to the Census of Population, 2021.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/98-304-x2021001-eng.pdf.\n\n\nSteckel, Richard. 1991. “The Quality of Census Data for Historical Inquiry: A Research Agenda.” Social Science History 15 (4): 579–99. https://doi.org/10.2307/1171470.\n\n\nStigler, Stephen. 1986. The History of Statistics. Massachusetts: Belknap Harvard.\n\n\nStoler, Ann Laura. 2002. “Colonial Archives and the Arts of Governance.” Archival Science 2 (March): 87–109. https://doi.org/10.1007/bf02435632.\n\n\nTal, Eran. 2020. “Measurement in Science.” In The Stanford Encyclopedia of Philosophy, edited by Edward Zalta, Fall 2020. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/; Metaphysics Research Lab, Stanford University.\n\n\nTaylor, Adam. 2015. “New Zealand Says No to Jedis.” The Washington Post, September. https://www.washingtonpost.com/news/worldviews/wp/2015/09/29/new-zealand-says-no-to-jedis/.\n\n\nTimbers, Tiffany. 2020. canlang: Canadian Census language data. https://ttimbers.github.io/canlang/.\n\n\nVanhoenacker, Mark. 2015. Skyfaring: A Journey with a Pilot. 1st ed. Alfred A. Knopf.\n\n\nvon Bergmann, Jens, Dmitry Shkolnik, and Aaron Jacobs. 2021. cancensus: R package to access, retrieve, and work with Canadian Census data and geography. https://mountainmath.github.io/cancensus/.\n\n\nWalby, Kevin, and Alex Luscombe. 2019. Freedom of Information and Social Science Research Design. Routledge.\n\n\nWalker, Kyle. 2022. Analyzing US Census Data. Chapman; Hall/CRC. https://walker-data.com/census-r/index.html.\n\n\nWalker, Kyle, and Matt Herman. 2022. tidycensus: Load US Census Boundary and Attribute Data as “tidyverse” and “sf”-Ready Data Frames. https://CRAN.R-project.org/package=tidycensus.\n\n\nWhitby, Andrew. 2020. The Sum of the People. New York: Basic Books.\n\n\nWhitelaw, James. 1805. An Essay on the Population of Dublin. Being the Result of an Actual Survey Taken in 1798, with Great Care and Precision, and Arranged in a Manner Entirely New. Graisberry; Campbell.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWu, Changbao, and Mary Thompson. 2020. Sampling Theory and Practice. Springer.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nZhang, Ping, XunPeng Shi, YongPing Sun, Jingbo Cui, and Shuai Shao. 2019. “Have China’s provinces achieved their targets of energy intensity reduction? Reassessment based on nighttime lighting data.” Energy Policy 128 (May): 276–83. https://doi.org/10.1016/j.enpol.2019.01.014.",
"crumbs": [
"Acquisition",
"
6 Measurement, censuses, and sampling"
@@ -797,7 +797,7 @@
"href": "09-clean_and_prepare.html#kenyan-census",
"title": "9 Clean and prepare",
"section": "9.7 2019 Kenyan census",
- "text": "9.7 2019 Kenyan census\nAs a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\nThe distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.\n\n9.7.1 Gather and clean\nWe first need to download and read in the PDF of the 2019 Kenyan census.4\n\ncensus_url <-\n paste0(\n \"https://www.knbs.or.ke/download/2019-kenya-population-and-\",\n \"housing-census-volume-iii-distribution-of-population-by-age-\",\n \"sex-and-administrative-units/?wpdmdl=5729&refresh=\",\n \"620561f1ce3ad1644519921\"\n )\n\ndownload.file(\n url = census_url,\n destfile = \"2019_Kenya_census.pdf\",\n mode = \"wb\"\n)\n\nWe can use pdf_text() from pdftools again here.\n\nkenya_census <-\n pdf_text(\n pdf = \"2019_Kenya_census.pdf\"\n )\n\nIn this example we will focus on the page of the PDF about Nairobi (Figure 9.8).\n\n\n\n\n\n\nFigure 9.8: Page from the 2019 Kenyan census about Nairobi\n\n\n\n\n9.7.1.1 Make rectangular\nThe first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.\n\n# Focus on the page of interest\njust_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]\n\n# Remove blank lines\njust_nairobi <- just_nairobi[just_nairobi != \"\"]\n\n# Remove titles, headings and other content at the top of the page\njust_nairobi <- just_nairobi[5:length(just_nairobi)]\n\n# Remove page numbers and other content at the bottom of the page\njust_nairobi <- just_nairobi[1:62]\n\n# Convert into a tibble\ndemography_data <- tibble(all = just_nairobi)\n\nAt this point the data are in a tibble. This allows us to use our familiar dplyr verbs. In particular we want to separate the columns.\n\ndemography_data <-\n demography_data |>\n mutate(all = str_squish(all)) |>\n mutate(all = str_replace(all, \"10 -14\", \"10-14\")) |>\n mutate(all = str_replace(all, \"Not Stated\", \"NotStated\")) |>\n # Deal with the two column set-up\n separate(\n col = all,\n into = c(\n \"age\", \"male\", \"female\", \"total\",\n \"age_2\", \"male_2\", \"female_2\", \"total_2\"\n ),\n sep = \" \",\n remove = TRUE,\n fill = \"right\",\n extra = \"drop\"\n )\n\nThey are side by side at the moment. We need to instead append to the bottom.\n\ndemography_data_long <-\n rbind(\n demography_data |> select(age, male, female, total),\n demography_data |>\n select(age_2, male_2, female_2, total_2) |>\n rename(\n age = age_2,\n male = male_2,\n female = female_2,\n total = total_2\n )\n )\n\n\n# There is one row of NAs, so remove it\ndemography_data_long <-\n demography_data_long |>\n remove_empty(which = c(\"rows\"))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total \n <chr> <chr> <chr> <chr> \n 1 Total 2,192,452 2,204,376 4,396,828\n 2 0 57,265 56,523 113,788 \n 3 1 56,019 54,601 110,620 \n 4 2 52,518 51,848 104,366 \n 5 3 51,115 51,027 102,142 \n 6 4 47,182 46,889 94,071 \n 7 0-4 264,099 260,888 524,987 \n 8 5 45,203 44,711 89,914 \n 9 6 43,635 44,226 87,861 \n10 7 43,507 43,655 87,162 \n# ℹ 113 more rows\n\n\nHaving got it into a rectangular format, we now need to clean the dataset to make it useful.\n\n\n9.7.1.2 Validity\nTo attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.\n\ndemography_data_long |>\n select(male, female, total) |>\n mutate(across(everything(), ~ str_remove_all(., \"[:digit:]\"))) |>\n distinct()\n\n# A tibble: 5 × 3\n male female total\n <chr> <chr> <chr>\n1 \",,\" \",,\" \",,\" \n2 \",\" \",\" \",\" \n3 \"\" \",\" \",\" \n4 \"\" \"\" \",\" \n5 \"\" \"\" \"\" \n\n\nWe need to remove commas. While we could use janitor here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(across(c(male, female, total), ~ str_remove_all(., \",\"))) |>\n mutate(across(c(male, female, total), ~ as.integer(.)))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total\n <chr> <int> <int> <int>\n 1 Total 2192452 2204376 4396828\n 2 0 57265 56523 113788\n 3 1 56019 54601 110620\n 4 2 52518 51848 104366\n 5 3 51115 51027 102142\n 6 4 47182 46889 94071\n 7 0-4 264099 260888 524987\n 8 5 45203 44711 89914\n 9 6 43635 44226 87861\n10 7 43507 43655 87162\n# ℹ 113 more rows\n\n\n\n\n9.7.1.3 Internal consistency\n\nThe census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as “ages 0 to 5”, or a single age, such as “1”.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age_type = if_else(str_detect(age, \"-\"), \n \"age-group\", \n \"single-year\"),\n age_type = if_else(str_detect(age, \"Total\"), \n \"age-group\", \n age_type)\n )\n\nAt the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age = as_factor(age)\n )\n\n\n\n\n9.7.2 Check and test\nHaving gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”, which are the only two gender categories available.\n\ndemography_data_long |>\n mutate(\n check_sum = male + female,\n totals_match = if_else(total == check_sum, 1, 0)\n ) |>\n filter(totals_match == 0)\n\n# A tibble: 0 × 7\n# ℹ 7 variables: age <fct>, male <int>, female <int>, total <int>,\n# age_type <chr>, check_sum <int>, totals_match <dbl>\n\n\nFinally, we want to check that the single-age counts sum to the age-groups.\n\ndemography_data_long |>\n mutate(age_groups = if_else(age_type == \"age-group\", \n age, \n NA_character_)) |>\n fill(age_groups, .direction = \"up\") |>\n mutate(\n group_sum = sum(total),\n group_sum = group_sum / 2,\n difference = total - group_sum,\n .by = c(age_groups)\n ) |>\n filter(age_type == \"age-group\" & age_groups != \"Total\") |> \n head()\n\n# A tibble: 6 × 8\n age male female total age_type age_groups group_sum difference\n <fct> <int> <int> <int> <chr> <chr> <dbl> <dbl>\n1 0-4 264099 260888 524987 age-group 0-4 524987 0\n2 5-9 215230 217482 432712 age-group 5-9 432712 0\n3 10-14 185008 193542 378550 age-group 10-14 378550 0\n4 15-19 159098 192755 351853 age-group 15-19 351853 0\n5 20-24 249534 313485 563019 age-group 20-24 563019 0\n6 25-29 282703 300845 583548 age-group 25-29 583548 0\n\n\n\n\n9.7.3 Tidy-up\nNow that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.\n\ndemography_data_tidy <-\n demography_data_long |>\n rename_with(~paste0(., \"_total\"), male:total) |>\n pivot_longer(cols = contains(\"_total\"), \n names_to = \"type\", \n values_to = \"number\") |>\n separate(\n col = type,\n into = c(\"gender\", \"part_of_area\"),\n sep = \"_\"\n ) |>\n select(age, age_type, gender, number)\n\nThe original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2022). We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (Figure 9.9).\n\ndemography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n filter(gender != \"total\") |>\n ggplot(aes(x = age, y = number, fill = gender)) +\n geom_col(aes(x = age, y = number, fill = gender), \n position = \"dodge\") +\n scale_y_continuous(labels = comma) +\n scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), \"100+\")) +\n theme_classic() +\n scale_fill_brewer(palette = \"Set1\") +\n labs(\n y = \"Number\",\n x = \"Age\",\n fill = \"Gender\",\n caption = \"Data source: 2019 Kenya Census\"\n ) +\n theme(legend.position = \"bottom\") +\n coord_flip()\n\n\n\n\n\n\n\nFigure 9.9: Distribution of age and gender in Nairobi in 2019, based on Kenyan census\n\n\n\n\n\nA variety of features are clear from Figure 9.9, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.\nFinally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.\n\ncolumn_names_as_contracts <-\n demography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n rename(\n \"fctr_group_age\" = \"age\",\n \"chr_group_gender\" = \"gender\",\n \"int_group_count\" = \"number\"\n )\n\nWe can then use pointblank to set-up tests for us.\n\nagent <-\n create_agent(tbl = column_names_as_contracts) |>\n col_is_character(columns = vars(chr_group_gender)) |>\n col_is_factor(columns = vars(fctr_group_age)) |>\n col_is_integer(columns = vars(int_group_count)) |>\n col_vals_in_set(\n columns = chr_group_gender,\n set = c(\"male\", \"female\", \"total\")\n ) |>\n interrogate()\n\nagent\n\n\n\n\n\n\n\n\nPointblank Validation\n\n\n\n\n[2024-10-28|20:38:53]\n\n\ntibble column_names_as_contracts\n\n\n\n\n\n\nSTEP\nCOLUMNS\nVALUES\nTBL\nEVAL\nUNITS\nPASS\nFAIL\nW\nS\nN\nEXT\n\n\n\n\n\n\n1\n\n\n\n\ncol_is_character\n\n \n\n\n col_is_character()\n\n\n▮chr_group_gender\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n2\n\n\n\n\ncol_is_factor\n\n \n\n\n col_is_factor()\n\n\n▮fctr_group_age\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n3\n\n\n\n\ncol_is_integer\n\n \n\n\n col_is_integer()\n\n\n▮int_group_count\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n4\n\n\n\n\ncol_vals_in_set\n\n \n\n\n col_vals_in_set()\n\n\n▮chr_group_gender\n\n\nmale, female, total\n\n\n\n \n\n\n✓\n306\n306\n1\n0\n0\n—\n—\n—\n—\n\n\n\n2024-10-28 20:38:53 EDT < 1 s 2024-10-28 20:38:53 EDT",
+ "text": "9.7 2019 Kenyan census\nAs a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\nThe distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.\n\n9.7.1 Gather and clean\nWe first need to download and read in the PDF of the 2019 Kenyan census.4\n\ncensus_url <-\n paste0(\n \"https://www.knbs.or.ke/download/2019-kenya-population-and-\",\n \"housing-census-volume-iii-distribution-of-population-by-age-\",\n \"sex-and-administrative-units/?wpdmdl=5729&refresh=\",\n \"620561f1ce3ad1644519921\"\n )\n\ndownload.file(\n url = census_url,\n destfile = \"2019_Kenya_census.pdf\",\n mode = \"wb\"\n)\n\nWe can use pdf_text() from pdftools again here.\n\nkenya_census <-\n pdf_text(\n pdf = \"2019_Kenya_census.pdf\"\n )\n\nIn this example we will focus on the page of the PDF about Nairobi (Figure 9.8).\n\n\n\n\n\n\nFigure 9.8: Page from the 2019 Kenyan census about Nairobi\n\n\n\n\n9.7.1.1 Make rectangular\nThe first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.\n\n# Focus on the page of interest\njust_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]\n\n# Remove blank lines\njust_nairobi <- just_nairobi[just_nairobi != \"\"]\n\n# Remove titles, headings and other content at the top of the page\njust_nairobi <- just_nairobi[5:length(just_nairobi)]\n\n# Remove page numbers and other content at the bottom of the page\njust_nairobi <- just_nairobi[1:62]\n\n# Convert into a tibble\ndemography_data <- tibble(all = just_nairobi)\n\nAt this point the data are in a tibble. This allows us to use our familiar dplyr verbs. In particular we want to separate the columns.\n\ndemography_data <-\n demography_data |>\n mutate(all = str_squish(all)) |>\n mutate(all = str_replace(all, \"10 -14\", \"10-14\")) |>\n mutate(all = str_replace(all, \"Not Stated\", \"NotStated\")) |>\n # Deal with the two column set-up\n separate(\n col = all,\n into = c(\n \"age\", \"male\", \"female\", \"total\",\n \"age_2\", \"male_2\", \"female_2\", \"total_2\"\n ),\n sep = \" \",\n remove = TRUE,\n fill = \"right\",\n extra = \"drop\"\n )\n\nThey are side by side at the moment. We need to instead append to the bottom.\n\ndemography_data_long <-\n rbind(\n demography_data |> select(age, male, female, total),\n demography_data |>\n select(age_2, male_2, female_2, total_2) |>\n rename(\n age = age_2,\n male = male_2,\n female = female_2,\n total = total_2\n )\n )\n\n\n# There is one row of NAs, so remove it\ndemography_data_long <-\n demography_data_long |>\n remove_empty(which = c(\"rows\"))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total \n <chr> <chr> <chr> <chr> \n 1 Total 2,192,452 2,204,376 4,396,828\n 2 0 57,265 56,523 113,788 \n 3 1 56,019 54,601 110,620 \n 4 2 52,518 51,848 104,366 \n 5 3 51,115 51,027 102,142 \n 6 4 47,182 46,889 94,071 \n 7 0-4 264,099 260,888 524,987 \n 8 5 45,203 44,711 89,914 \n 9 6 43,635 44,226 87,861 \n10 7 43,507 43,655 87,162 \n# ℹ 113 more rows\n\n\nHaving got it into a rectangular format, we now need to clean the dataset to make it useful.\n\n\n9.7.1.2 Validity\nTo attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.\n\ndemography_data_long |>\n select(male, female, total) |>\n mutate(across(everything(), ~ str_remove_all(., \"[:digit:]\"))) |>\n distinct()\n\n# A tibble: 5 × 3\n male female total\n <chr> <chr> <chr>\n1 \",,\" \",,\" \",,\" \n2 \",\" \",\" \",\" \n3 \"\" \",\" \",\" \n4 \"\" \"\" \",\" \n5 \"\" \"\" \"\" \n\n\nWe need to remove commas. While we could use janitor here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(across(c(male, female, total), ~ str_remove_all(., \",\"))) |>\n mutate(across(c(male, female, total), ~ as.integer(.)))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total\n <chr> <int> <int> <int>\n 1 Total 2192452 2204376 4396828\n 2 0 57265 56523 113788\n 3 1 56019 54601 110620\n 4 2 52518 51848 104366\n 5 3 51115 51027 102142\n 6 4 47182 46889 94071\n 7 0-4 264099 260888 524987\n 8 5 45203 44711 89914\n 9 6 43635 44226 87861\n10 7 43507 43655 87162\n# ℹ 113 more rows\n\n\n\n\n9.7.1.3 Internal consistency\n\nThe census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as “ages 0 to 5”, or a single age, such as “1”.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age_type = if_else(str_detect(age, \"-\"), \n \"age-group\", \n \"single-year\"),\n age_type = if_else(str_detect(age, \"Total\"), \n \"age-group\", \n age_type)\n )\n\nAt the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age = as_factor(age)\n )\n\n\n\n\n9.7.2 Check and test\nHaving gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”, which are the only two gender categories available.\n\ndemography_data_long |>\n mutate(\n check_sum = male + female,\n totals_match = if_else(total == check_sum, 1, 0)\n ) |>\n filter(totals_match == 0)\n\n# A tibble: 0 × 7\n# ℹ 7 variables: age <fct>, male <int>, female <int>, total <int>,\n# age_type <chr>, check_sum <int>, totals_match <dbl>\n\n\nFinally, we want to check that the single-age counts sum to the age-groups.\n\ndemography_data_long |>\n mutate(age_groups = if_else(age_type == \"age-group\", \n age, \n NA_character_)) |>\n fill(age_groups, .direction = \"up\") |>\n mutate(\n group_sum = sum(total),\n group_sum = group_sum / 2,\n difference = total - group_sum,\n .by = c(age_groups)\n ) |>\n filter(age_type == \"age-group\" & age_groups != \"Total\") |> \n head()\n\n# A tibble: 6 × 8\n age male female total age_type age_groups group_sum difference\n <fct> <int> <int> <int> <chr> <chr> <dbl> <dbl>\n1 0-4 264099 260888 524987 age-group 0-4 524987 0\n2 5-9 215230 217482 432712 age-group 5-9 432712 0\n3 10-14 185008 193542 378550 age-group 10-14 378550 0\n4 15-19 159098 192755 351853 age-group 15-19 351853 0\n5 20-24 249534 313485 563019 age-group 20-24 563019 0\n6 25-29 282703 300845 583548 age-group 25-29 583548 0\n\n\n\n\n9.7.3 Tidy-up\nNow that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.\n\ndemography_data_tidy <-\n demography_data_long |>\n rename_with(~paste0(., \"_total\"), male:total) |>\n pivot_longer(cols = contains(\"_total\"), \n names_to = \"type\", \n values_to = \"number\") |>\n separate(\n col = type,\n into = c(\"gender\", \"part_of_area\"),\n sep = \"_\"\n ) |>\n select(age, age_type, gender, number)\n\nThe original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2022). We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (Figure 9.9).\n\ndemography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n filter(gender != \"total\") |>\n ggplot(aes(x = age, y = number, fill = gender)) +\n geom_col(aes(x = age, y = number, fill = gender), \n position = \"dodge\") +\n scale_y_continuous(labels = comma) +\n scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), \"100+\")) +\n theme_classic() +\n scale_fill_brewer(palette = \"Set1\") +\n labs(\n y = \"Number\",\n x = \"Age\",\n fill = \"Gender\",\n caption = \"Data source: 2019 Kenya Census\"\n ) +\n theme(legend.position = \"bottom\") +\n coord_flip()\n\n\n\n\n\n\n\nFigure 9.9: Distribution of age and gender in Nairobi in 2019, based on Kenyan census\n\n\n\n\n\nA variety of features are clear from Figure 9.9, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.\nFinally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.\n\ncolumn_names_as_contracts <-\n demography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n rename(\n \"fctr_group_age\" = \"age\",\n \"chr_group_gender\" = \"gender\",\n \"int_group_count\" = \"number\"\n )\n\nWe can then use pointblank to set-up tests for us.\n\nagent <-\n create_agent(tbl = column_names_as_contracts) |>\n col_is_character(columns = vars(chr_group_gender)) |>\n col_is_factor(columns = vars(fctr_group_age)) |>\n col_is_integer(columns = vars(int_group_count)) |>\n col_vals_in_set(\n columns = chr_group_gender,\n set = c(\"male\", \"female\", \"total\")\n ) |>\n interrogate()\n\nagent\n\n\n\n\n\n\n\n\nPointblank Validation\n\n\n\n\n[2024-10-29|22:11:58]\n\n\ntibble column_names_as_contracts\n\n\n\n\n\n\nSTEP\nCOLUMNS\nVALUES\nTBL\nEVAL\nUNITS\nPASS\nFAIL\nW\nS\nN\nEXT\n\n\n\n\n\n\n1\n\n\n\n\ncol_is_character\n\n \n\n\n col_is_character()\n\n\n▮chr_group_gender\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n2\n\n\n\n\ncol_is_factor\n\n \n\n\n col_is_factor()\n\n\n▮fctr_group_age\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n3\n\n\n\n\ncol_is_integer\n\n \n\n\n col_is_integer()\n\n\n▮int_group_count\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n4\n\n\n\n\ncol_vals_in_set\n\n \n\n\n col_vals_in_set()\n\n\n▮chr_group_gender\n\n\nmale, female, total\n\n\n\n \n\n\n✓\n306\n306\n1\n0\n0\n—\n—\n—\n—\n\n\n\n2024-10-29 22:11:58 EDT < 1 s 2024-10-29 22:11:58 EDT",
"crumbs": [
"Preparation",
"
9 Clean and prepare"
@@ -951,7 +951,7 @@
"href": "11-eda.html#united-states-population-and-income-data",
"title": "11 Exploratory data analysis",
"section": "11.2 1975 United States population and income data",
- "text": "11.2 1975 United States population and income data\nAs a first example we consider US state populations as of 1975. This dataset is built into R with state.x77. Here is what the dataset looks like:\n\nus_populations <-\n state.x77 |>\n as_tibble() |>\n clean_names() |>\n mutate(state = rownames(state.x77)) |>\n select(state, population, income)\n\nus_populations\n\n# A tibble: 50 × 3\n state population income\n <chr> <dbl> <dbl>\n 1 Alabama 3615 3624\n 2 Alaska 365 6315\n 3 Arizona 2212 4530\n 4 Arkansas 2110 3378\n 5 California 21198 5114\n 6 Colorado 2541 4884\n 7 Connecticut 3100 5348\n 8 Delaware 579 4809\n 9 Florida 8277 4815\n10 Georgia 4931 4091\n# ℹ 40 more rows\n\n\nWe want to get a quick sense of the data. The first step is to have a look at the top and bottom of it with head() and tail(), then a random selection, and finally to focus on the variables and their class with glimpse(). The random selection is an important aspect, and when you use head() you should also quickly consider a random selection.\n\nus_populations |>\n head()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Alabama 3615 3624\n2 Alaska 365 6315\n3 Arizona 2212 4530\n4 Arkansas 2110 3378\n5 California 21198 5114\n6 Colorado 2541 4884\n\nus_populations |>\n tail()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Vermont 472 3907\n2 Virginia 4981 4701\n3 Washington 3559 4864\n4 West Virginia 1799 3617\n5 Wisconsin 4589 4468\n6 Wyoming 376 4566\n\nus_populations |>\n slice_sample(n = 6)\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Texas 12237 4188\n2 Tennessee 4173 3821\n3 Colorado 2541 4884\n4 Oklahoma 2715 3983\n5 Connecticut 3100 5348\n6 Florida 8277 4815\n\nus_populations |>\n glimpse()\n\nRows: 50\nColumns: 3\n$ state <chr> \"Alabama\", \"Alaska\", \"Arizona\", \"Arkansas\", \"California\", \"…\n$ population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …\n$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…\n\n\nWe are then interested in understanding key summary statistics, such as the minimum, median, and maximum values for numeric variables with summary() from base R and the number of observations.\n\nus_populations |>\n summary()\n\n state population income \n Length:50 Min. : 365 Min. :3098 \n Class :character 1st Qu.: 1080 1st Qu.:3993 \n Mode :character Median : 2838 Median :4519 \n Mean : 4246 Mean :4436 \n 3rd Qu.: 4968 3rd Qu.:4814 \n Max. :21198 Max. :6315 \n\n\nFinally, it is especially important to understand the behavior of these key summary statistics at the limits. In particular, one approach is to randomly remove some observations and compare what happens to them. For instance, we can randomly create five datasets that differ on the basis of which observations were removed. We can then compare the summary statistics. If any of them are especially different, then we would want to look at the observations that were removed as they may contain observations with high influence.\n\nsample_means <- tibble(seed = c(), mean = c(), states_ignored = c())\n\nfor (i in c(1:5)) {\n set.seed(i)\n dont_get <- c(sample(x = state.name, size = 5))\n sample_means <-\n sample_means |>\n rbind(tibble(\n seed = i,\n mean =\n us_populations |>\n filter(!state %in% dont_get) |>\n summarise(mean = mean(population)) |>\n pull(),\n states_ignored = str_c(dont_get, collapse = \", \")\n ))\n}\n\nsample_means |>\n kable(\n col.names = c(\"Seed\", \"Mean\", \"Ignored states\"),\n digits = 0,\n format.args = list(big.mark = \",\"),\n booktabs = TRUE\n )\n\n\n\nTable 11.1: Comparing the mean population when different states are randomly removed\n\n\n\n\n\n\n\n\n\n\n\nSeed\nMean\nIgnored states\n\n\n\n\n1\n4,469\nArkansas, Rhode Island, Alabama, North Dakota, Minnesota\n\n\n2\n4,027\nMassachusetts, Iowa, Colorado, West Virginia, New York\n\n\n3\n4,086\nCalifornia, Idaho, Rhode Island, Oklahoma, South Carolina\n\n\n4\n4,391\nHawaii, Arizona, Connecticut, Utah, New Jersey\n\n\n5\n4,340\nAlaska, Texas, Iowa, Hawaii, South Dakota\n\n\n\n\n\n\n\n\nIn the case of the populations of US states, we know that larger states, such as California and New York, will have an out sized effect on our estimate of the mean. Table 11.1 supports that, as we can see that when we use seeds 2 and 3, there is a lower mean.",
+ "text": "11.2 1975 United States population and income data\nAs a first example we consider US state populations as of 1975. This dataset is built into R with state.x77. Here is what the dataset looks like:\n\nus_populations <-\n state.x77 |>\n as_tibble() |>\n clean_names() |>\n mutate(state = rownames(state.x77)) |>\n select(state, population, income)\n\nus_populations\n\n# A tibble: 50 × 3\n state population income\n <chr> <dbl> <dbl>\n 1 Alabama 3615 3624\n 2 Alaska 365 6315\n 3 Arizona 2212 4530\n 4 Arkansas 2110 3378\n 5 California 21198 5114\n 6 Colorado 2541 4884\n 7 Connecticut 3100 5348\n 8 Delaware 579 4809\n 9 Florida 8277 4815\n10 Georgia 4931 4091\n# ℹ 40 more rows\n\n\nWe want to get a quick sense of the data. The first step is to have a look at the top and bottom of it with head() and tail(), then a random selection, and finally to focus on the variables and their class with glimpse(). The random selection is an important aspect, and when you use head() you should also quickly consider a random selection.\n\nus_populations |>\n head()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Alabama 3615 3624\n2 Alaska 365 6315\n3 Arizona 2212 4530\n4 Arkansas 2110 3378\n5 California 21198 5114\n6 Colorado 2541 4884\n\nus_populations |>\n tail()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Vermont 472 3907\n2 Virginia 4981 4701\n3 Washington 3559 4864\n4 West Virginia 1799 3617\n5 Wisconsin 4589 4468\n6 Wyoming 376 4566\n\nus_populations |>\n slice_sample(n = 6)\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 North Carolina 5441 3875\n2 Pennsylvania 11860 4449\n3 Maine 1058 3694\n4 Vermont 472 3907\n5 Wyoming 376 4566\n6 New Jersey 7333 5237\n\nus_populations |>\n glimpse()\n\nRows: 50\nColumns: 3\n$ state <chr> \"Alabama\", \"Alaska\", \"Arizona\", \"Arkansas\", \"California\", \"…\n$ population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …\n$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…\n\n\nWe are then interested in understanding key summary statistics, such as the minimum, median, and maximum values for numeric variables with summary() from base R and the number of observations.\n\nus_populations |>\n summary()\n\n state population income \n Length:50 Min. : 365 Min. :3098 \n Class :character 1st Qu.: 1080 1st Qu.:3993 \n Mode :character Median : 2838 Median :4519 \n Mean : 4246 Mean :4436 \n 3rd Qu.: 4968 3rd Qu.:4814 \n Max. :21198 Max. :6315 \n\n\nFinally, it is especially important to understand the behavior of these key summary statistics at the limits. In particular, one approach is to randomly remove some observations and compare what happens to them. For instance, we can randomly create five datasets that differ on the basis of which observations were removed. We can then compare the summary statistics. If any of them are especially different, then we would want to look at the observations that were removed as they may contain observations with high influence.\n\nsample_means <- tibble(seed = c(), mean = c(), states_ignored = c())\n\nfor (i in c(1:5)) {\n set.seed(i)\n dont_get <- c(sample(x = state.name, size = 5))\n sample_means <-\n sample_means |>\n rbind(tibble(\n seed = i,\n mean =\n us_populations |>\n filter(!state %in% dont_get) |>\n summarise(mean = mean(population)) |>\n pull(),\n states_ignored = str_c(dont_get, collapse = \", \")\n ))\n}\n\nsample_means |>\n kable(\n col.names = c(\"Seed\", \"Mean\", \"Ignored states\"),\n digits = 0,\n format.args = list(big.mark = \",\"),\n booktabs = TRUE\n )\n\n\n\nTable 11.1: Comparing the mean population when different states are randomly removed\n\n\n\n\n\n\n\n\n\n\n\nSeed\nMean\nIgnored states\n\n\n\n\n1\n4,469\nArkansas, Rhode Island, Alabama, North Dakota, Minnesota\n\n\n2\n4,027\nMassachusetts, Iowa, Colorado, West Virginia, New York\n\n\n3\n4,086\nCalifornia, Idaho, Rhode Island, Oklahoma, South Carolina\n\n\n4\n4,391\nHawaii, Arizona, Connecticut, Utah, New Jersey\n\n\n5\n4,340\nAlaska, Texas, Iowa, Hawaii, South Dakota\n\n\n\n\n\n\n\n\nIn the case of the populations of US states, we know that larger states, such as California and New York, will have an out sized effect on our estimate of the mean. Table 11.1 supports that, as we can see that when we use seeds 2 and 3, there is a lower mean.",
"crumbs": [
"Modeling",
"
11 Exploratory data analysis"
@@ -1919,7 +1919,7 @@
"href": "26-sql.html#exercises",
"title": "Online Appendix H — SQL essentials",
"section": "H.3 Exercises",
- "text": "H.3 Exercises\n\nPractice\nPlease submit a screenshot showing you got at least 70 per cent in the free w3school SQL Quiz. You may like to go through their tutorial, but the SQL content in this chapter (combined with your dplyr experience) is sufficient to get 70 per cent. Please include the time and date in the screenshot i.e. take a screenshot of your whole screen, not just the browser.\n\n\nQuiz\n\n\nActivity\nGet the SQL dataset from here: https://jacobfilipp.com/hammer/.\nUse SQL (not R or Python) to make some finding. Write a short paper using Quarto. In the discussion please have one sub-section each on of: 1) correlation vs. causation; 2) missing data; 3) sources of bias.\nSubmit a link to a GitHub repo that meets the general expectations.\n\n\n\n\nChamberlin, Donald. 2012. “Early History of SQL.” IEEE Annals of the History of Computing 34 (4): 78–82. https://doi.org/10.1109/mahc.2012.61.\n\n\nR Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2022. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.\n\n\nRobinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data Science. Shelter Island: Manning Publications. https://livebook.manning.com/book/build-a-career-in-data-science.\n\n\nWickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2022. dbplyr: A “dplyr” Back End for Databases. https://CRAN.R-project.org/package=dbplyr.",
+ "text": "H.3 Exercises\n\nPractice\nPlease submit a screenshot showing you got at least 70 per cent in the free w3school SQL Quiz. You may like to go through their tutorial, but the SQL content in this chapter (combined with your dplyr experience) is sufficient to get 70 per cent. Please include the time and date in the screenshot i.e. take a screenshot of your whole screen, not just the browser.\n\n\nQuiz\n\n\nActivity\nGet the SQL dataset from here: https://jacobfilipp.com/hammer/.\nUse SQL (not R or Python) to make some finding using this observational data. Write a short paper using Quarto (you are welcome to use R/Python to make graphs but not for data preparation/manipulation which should occur in SQL in a separate script). In the discussion please have one sub-section each on: 1) correlation vs. causation; 2) missing data; 3) sources of bias.\nSubmit a link to a GitHub repo (one repo per group) that meets the general expectations. Components of the rubric that are relevant are: “R/Python is cited”, “Data are cited”, “Class paper”, “LLM usage is documented”, “Title”, “Author, date, and repo”, “Abstract”, “Introduction”, “Data”, “Measurement”, “Results”, “Discussion”, “Prose”, “Cross-references”, “Captions”, “Graphs/tables/etc”, “Referencing”, “Commits”, “Sketches”, “Simulation”, “Tests”, and “Reproducible workflow”.\n\n\n\n\nChamberlin, Donald. 2012. “Early History of SQL.” IEEE Annals of the History of Computing 34 (4): 78–82. https://doi.org/10.1109/mahc.2012.61.\n\n\nR Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2022. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.\n\n\nRobinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data Science. Shelter Island: Manning Publications. https://livebook.manning.com/book/build-a-career-in-data-science.\n\n\nWickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2022. dbplyr: A “dplyr” Back End for Databases. https://CRAN.R-project.org/package=dbplyr.",
"crumbs": [
"Appendices",
"
H SQL essentials"
@@ -2205,7 +2205,7 @@
"href": "99-references.html",
"title": "References",
"section": "",
- "text": "Abadie, Alberto, Susan Athey, Guido Imbens, and Jeffrey Wooldridge.\n2017. “When Should You Adjust Standard Errors for\nClustering?” Working Paper 24003. Working Paper Series. National\nBureau of Economic Research. https://doi.org/10.3386/w24003.\n\n\nAbelson, Harold, and Gerald Jay Sussman. 1996. Structure and\nInterpretation of Computer Programs. Cambridge: The MIT Press.\n\n\nAbeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu, and Mark Ziemann.\n2021. “Gene Name Errors: Lessons Not Learned.” PLOS\nComputational Biology 17 (7): 1–13. https://doi.org/10.1371/journal.pcbi.1008984.\n\n\nAcemoglu, Daron, Simon Johnson, and James Robinson. 2001. “The\nColonial Origins of Comparative Development: An Empirical\nInvestigation.” American Economic Review 91\n(5): 1369–1401. https://doi.org/10.1257/aer.91.5.1369.\n\n\nAchen, Christopher. 1978. “Measuring Representation.”\nAmerican Journal of Political Science 22 (3): 475–510. https://doi.org/10.2307/2110458.\n\n\nAkerlof, George. 1970. “The Market for ‘Lemons’:\nQuality Uncertainty and the Market Mechanism.” The Quarterly\nJournal of Economics 84 (3): 488–500. https://doi.org/10.2307/1879431.\n\n\nAlexander, Monica. 2019a. “Reproducibility in Demographic\nResearch.” https://www.monicaalexander.com/posts/2019-10-20-reproducibility/.\n\n\n———. 2019b. “The Concentration and Uniqueness of Baby Names in\nAustralia and the US,” January. https://www.monicaalexander.com/posts/2019-20-01-babynames/.\n\n\n———. 2019c. “Analyzing Name Changes After Marriage Using a\nNon-Representative Survey,” August. https://www.monicaalexander.com/posts/2019-08-07-mrp/.\n\n\n———. 2021. “Overcoming Barriers to Sharing Code.”\nYouTube, February. https://youtu.be/yvM2C6aZ94k.\n\n\nAlexander, Monica, and Leontine Alkema. 2022. “A Bayesian Cohort Component Projection Model to Estimate\nWomen of Reproductive Age at the Subnational Level in Data-Sparse\nSettings.” Demography 59 (5): 1713–37. https://doi.org/10.1215/00703370-10216406.\n\n\nAlexander, Monica, Mathew Kiang, and Magali Barbieri. 2018.\n“Trends in Black and White Opioid Mortality in the United States,\n1979–2015.” Epidemiology 29 (5): 707–15. https://doi.org/10.1097/EDE.0000000000000858.\n\n\nAlexander, Rohan, and Monica Alexander. 2021. “The Increased\nEffect of Elections and Changing Prime Ministers on Topics Discussed in\nthe Australian Federal Parliament Between 1901 and 2018.” https://doi.org/10.48550/arXiv.2111.09299.\n\n\nAlexander, Rohan, and Paul Hodgetts. 2021.\nAustralianPoliticians: Provides Datasets About Australian\nPoliticians. https://CRAN.R-project.org/package=AustralianPoliticians.\n\n\nAlexander, Rohan, and A Mahfouz. 2021. heapsofpapers: Easily Download Heaps of PDF and CSV\nFiles. https://CRAN.R-project.org/package=heapsofpapers.\n\n\nAlexander, Rohan, and Zachary Ward. 2018. “Age at Arrival and\nAssimilation During the Age of Mass Migration.” The Journal\nof Economic History 78 (3): 904–37. https://doi.org/10.1017/S0022050718000335.\n\n\nAlexopoulos, Michelle, and Jon Cohen. 2015. “The power of print: Uncertainty shocks, markets, and the\neconomy.” International Review of Economics\n& Finance 40 (November): 8–28. https://doi.org/10.1016/j.iref.2015.02.002.\n\n\nAllen, Jeff. 2021. plumberDeploy: Plumber\nDeployment. https://CRAN.R-project.org/package=plumberDeploy.\n\n\nAlsan, Marcella, and Amy Finkelstein. 2021. “Beyond Causality:\nAdditional Benefits of Randomized Controlled Trials for Improving Health\nCare Delivery.” The Milbank Quarterly 99 (4): 864–81. https://doi.org/10.1111/1468-0009.12521.\n\n\nAlsan, Marcella, and Marianne Wanamaker. 2018. “Tuskegee and the\nHealth of Black Men.” The Quarterly Journal of Economics\n133 (1): 407–55. https://doi.org/10.1093/qje/qjx029.\n\n\nAltman, Douglas, and Martin Bland. 1995. “Statistics notes: The normal distribution.”\nBMJ 310 (6975): 298–98. https://doi.org/10.1136/bmj.310.6975.298.\n\n\nAmaka, Ofunne, and Amber Thomas. 2021. “The Naked Truth: How the\nNames of 6,816 Complexion Products Can Reveal Bias in Beauty.”\nThe Pudding, March. https://pudding.cool/2021/03/foundation-names/.\n\n\nAmerican Medical Association and New York Academy of Medicine. 1848.\nCode of Medical Ethics. Academy of Medicine. https://hdl.handle.net/2027/chi.57108026.\n\n\nAndersen, Robert, and David Armstrong. 2021. Presenting Statistical\nResults Effectively. London: Sage.\n\n\nAnderson, Margo. (1988) 2015. The American Census: A Social\nHistory. 2nd ed. Yale University Press.\n\n\nAnderson, Margo, and Stephen Fienberg. 1999. Who Counts?: The Politics of Census-Taking in\nContemporary America. Russell Sage Foundation. http://www.jstor.org/stable/10.7758/9781610440059.\n\n\nAndrews, David, and Agnes Herzberg. 2012. Data: A Collection of\nProblems from Many Fields for the Student and Research Worker. New\nYork: Springer Science & Business Media.\n\n\nAngelucci, Charles, and Julia Cagé. 2019. “Newspapers in Times of\nLow Advertising Revenues.” American Economic Journal:\nMicroeconomics 11 (3): 319–64. https://doi.org/10.1257/mic.20170306.\n\n\nAngrist, Joshua, and Alan Krueger. 2001. “Instrumental Variables\nand the Search for Identification: From Supply and Demand to Natural\nExperiments.” Journal of Economic Perspectives 15 (4):\n69–85. https://doi.org/10.1257/jep.15.4.69.\n\n\nAngrist, Joshua, and Jörn-Steffen Pischke. 2010. “The Credibility\nRevolution in Empirical Economics: How Better Research Design Is Taking\nthe Con Out of Econometrics.” Journal of Economic\nPerspectives 24 (2): 3–30. https://doi.org/10.1257/jep.24.2.3.\n\n\nAnnas, George. 2003. “HIPAA Regulations: A New Era of\nMedical-Record Privacy?” New England Journal of Medicine\n348 (15): 1486–90. https://doi.org/10.1056/NEJMlim035027.\n\n\nAnsolabehere, Stephen, Brian Schaffner, and Sam Luks. 2021. “Guide to the 2020 Cooperative Election\nStudy.” https://doi.org/10.7910/DVN/E9N6PH.\n\n\nAprameya, Lavanya. 2020. “Improving Duolingo, One Experiment at a\nTime.” Duolingo Blog, January. https://blog.duolingo.com/improving-duolingo-one-experiment-at-a-time/.\n\n\nArel-Bundock, Vincent. 2021. WDI: World\nDevelopment Indicators and Other World Bank Data. https://CRAN.R-project.org/package=WDI.\n\n\n———. 2022. “modelsummary: Data and\nModel Summaries in R.” Journal of Statistical\nSoftware 103 (1): 1–23. https://doi.org/10.18637/jss.v103.i01.\n\n\n———. 2023. marginaleffects: Predictions,\nComparisons, Slopes, Marginal Means, and Hypothesis Tests.\nhttps://vincentarelbundock.github.io/marginaleffects/.\n\n\nArel-Bundock, Vincent, Ryan Briggs, Hristos Doucouliagos, Marco Mendoza\nAviña, and T. D. Stanley. 2022. “Quantitative Political Science\nResearch Is Greatly Underpowered.” https://osf.io/bzj9y/.\n\n\nArmstrong, Zan. 2022. “Stop Aggregating Away the Signal in Your\nData.” The Overflow, March. https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/.\n\n\nArnold, Jeffrey. 2021. ggthemes: Extra Themes,\nScales and Geoms for “ggplot2”. https://CRAN.R-project.org/package=ggthemes.\n\n\nAsher, Sam, Tobias Lunt, Ryu Matsuura, and Paul Novosad. 2021.\n“Development Research at High Geographic Resolution: An Analysis\nof Night Lights, Firms, and Poverty in India Using the SHRUG Open Data\nPlatform.” World Bank Economic Review 35 (4). https://shrug-assets-ddl.s3.amazonaws.com/static/main/assets/other/almn-shrug.pdf.\n\n\nAthey, Susan, and Guido Imbens. 2017a. “The Econometrics of\nRandomized Experiments.” In Handbook of Field\nExperiments, 73–140. Elsevier. https://doi.org/10.1016/bs.hefe.2016.10.003.\n\n\n———. 2017b. “The State of Applied Econometrics: Causality and\nPolicy Evaluation.” Journal of Economic Perspectives 31\n(2): 3–32. https://doi.org/10.1257/jep.31.2.3.\n\n\nAthey, Susan, Guido Imbens, Jonas Metzger, and Evan Munro. 2021.\n“Using Wasserstein Generative Adversarial Networks for the Design\nof Monte Carlo Simulations.” Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2020.09.013.\n\n\nAu, Randy. 2020. “Data Cleaning IS Analysis, Not Grunt\nWork,” September. https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt.\n\n\n———. 2022. “Celebrating Everyone Counting Things,”\nFebruary. https://counting.substack.com/p/celebrating-everyone-counting-things.\n\n\nBååth, Rasmus. 2018. beepr: Easily Play\nNotification Sounds on any Platform. https://CRAN.R-project.org/package=beepr.\n\n\nBache, Stefan Milton, and Hadley Wickham. 2022. magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.\n\n\nBackus, John. 1981. “The History of FORTRAN\nI, II, and III.” In History of Programming\nLanguages, edited by Richard Wexelblat, 25–74. Academic Press.\n\n\nBailey, Rosemary. 2008. Design of Comparative Experiments.\nCambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511611483.\n\n\nBaio, Gianluca, and Marta Blangiardo. 2010. “Bayesian Hierarchical\nModel for the Prediction of Football Results.” Journal of\nApplied Statistics 37 (2): 253–64. https://doi.org/10.1080/02664760802684177.\n\n\nBaker, Dominique. 2023. “Scams Will Not Save Us (Tuition\nDollars),” February. http://www.dominiquebaker.com/blog/2023/2/16/scams-will-not-save-us-tuition-dollars.\n\n\nBaker, Reg, Michael Brick, Nancy Bates, Mike Battaglia, Mick Couper,\nJill Dever, Krista Gile, and Roger Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-Probability\nSampling.” Journal of Survey Statistics and\nMethodology 1 (2): 90–143. https://doi.org/10.1093/jssam/smt008.\n\n\nBandy, John, and Nicholas Vincent. 2021. “Addressing\n‘Documentation Debt’ in Machine Learning: A Retrospective\nDatasheet for BookCorpus.” In Proceedings of the Neural\nInformation Processing Systems Track on Datasets and Benchmarks,\nedited by J. Vanschoren and S. Yeung. Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf.\n\n\nBanerjee, Abhijit, and Esther Duflo. 2011. Poor Economics: A Radical\nRethinking of the Way to Fight Global Poverty. New York:\nPublicAffairs.\n\n\nBanerjee, Abhijit, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan.\n2015. “The Miracle of Microfinance? Evidence from a Randomized\nEvaluation.” American Economic Journal: Applied\nEconomics 7 (1): 22–53. https://doi.org/10.1257/app.20130533.\n\n\nBanes, Graham, Emily Fountain, Alyssa Karklus, Robert Fulton, Lucinda\nAntonacci-Fulton, and Joanne Nelson. 2022. “Nine out of ten samples were mistakenly switched by The\nOrang-utan Genome Consortium.” Scientific Data 9\n(1). https://doi.org/10.1038/s41597-022-01602-0.\n\n\nBarba, Lorena. 2018. “Terminologies for Reproducible\nResearch.” https://arxiv.org/abs/1802.03311.\n\n\nBarrett, Malcolm. 2021a. Data Science as an Atomic Habit. https://malco.io/articles/2021-01-04-data-science-as-an-atomic-habit.\n\n\n———. 2021b. ggdag: Analyze and Create Elegant\nDirected Acyclic Graphs. https://CRAN.R-project.org/package=ggdag.\n\n\nBarron, Alexander, Jenny Huang, Rebecca Spang, and Simon DeDeo. 2018.\n“Individuals, Institutions, and Innovation in the Debates of the\nFrench Revolution.” Proceedings of the National Academy of\nSciences 115 (18): 4607–12. https://doi.org/10.1073/pnas.1717729115.\n\n\nBaumer, Benjamin, Daniel Kaplan, and Nicholas Horton. 2021.\nModern Data Science With R. 2nd ed. Chapman;\nHall/CRC. https://mdsr-book.github.io/mdsr2e/.\n\n\nBaumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and\nJeremy Blackburn. 2020. “The Pushshift Reddit Dataset.”\narXiv. https://doi.org/10.48550/arxiv.2001.08435.\n\n\nBaumgartner, Peter. 2021. “Ways I Use Testing\nas a Data Scientist,” December. https://www.peterbaumgartner.com/blog/testing-for-data-science/.\n\n\nBeaumont, Jean-Francois. 2020. “Are Probability Surveys Bound to\nDisappear for the Production of Official Statistics?” Survey\nMethodology 46 (1): 1–29.\n\n\nBeauregard, Katrine, and Jill Sheppard. 2021. “Antiwomen but\nProquota: Disaggregating Sexism and Support for Gender Quota\nPolicies.” Political Psychology 42 (2): 219–37. https://doi.org/10.1111/pops.12696.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex\nDeckmyn. 2022. maps: Draw Geographical\nMaps. https://CRAN.R-project.org/package=maps.\n\n\nBeelen, Kaspar, Timothy Alberdingk Thim, Christopher Cochrane, Kees\nHalvemaan, Graeme Hirst, Michael Kimmins, Sander Lijbrink, et al. 2017.\n“Digitization of the Canadian Parliamentary Debates.”\nCanadian Journal of Political Science 50 (3): 849–64.\n\n\nBegley, Glenn, and Lee Ellis. 2012. “Raise Standards for\nPreclinical Cancer Research.” Nature 483 (7391):\n531--533. https://doi.org/10.1038/483531a.\n\n\nBender, Emily, Timnit Gebru, Angelina McMillan-Major, and Shmargaret\nShmitchell. 2021. “On the Dangers of Stochastic Parrots: Can\nLanguage Models Be Too Big?” In Proceedings of the 2021\nACM Conference on Fairness, Accountability, and\nTransparency. ACM. https://doi.org/10.1145/3442188.3445922.\n\n\nBengtsson, Henrik. 2021. “A Unifying\nFramework for Parallel and Distributed Processing in R using\nFutures.” The R Journal 13 (2): 208–27. https://doi.org/10.32614/RJ-2021-048.\n\n\nBenoit, Kenneth. 2020. “Text as Data: An Overview.” In\nThe SAGE Handbook of Research Methods in Political Science and\nInternational Relations, edited by Luigi Curini and Robert\nFranzese, 461–97. London: SAGE Publishing. https://doi.org/10.4135/9781526486387.n29.\n\n\nBenoit, Kenneth, and Michael Laver. 2006. Party\nPolicy in Modern Democracies. Routledge.\n\n\n———. 2007. “Estimating Party Policy Positions: Comparing Expert\nSurveys and Hand-Coded Content Analysis.” Electoral\nStudies 26 (1): 90–107. https://doi.org/10.1016/j.electstud.2006.04.008.\n\n\nBenoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng,\nStefan Müller, and Akitaka Matsuo. 2018. “quanteda: An R package for the quantitative analysis of\ntextual data.” Journal of Open Source Software 3\n(30): 774. https://doi.org/10.21105/joss.00774.\n\n\nBensinger, Greg. 2020. “Google Redraws the Borders on Maps\nDepending on Who’s Looking.” The Washington Post,\nFebruary. https://www.washingtonpost.com/technology/2020/02/14/google-maps-political-borders/.\n\n\nBerdine, Gilbert, Vincent Geloso, and Benjamin Powell. 2018.\n“Cuban Infant Mortality and Longevity: Health Care or\nRepression?” Health Policy and Planning 33 (6): 755–57.\nhttps://doi.org/10.1093/heapol/czy033.\n\n\nBerkson, Joseph. 1946. “Limitations of the Application of Fourfold\nTable Analysis to Hospital Data.” Biometrics Bulletin 2\n(3): 47–53. https://doi.org/10.2307/3002000.\n\n\nBerners-Lee, Timothy. 1989. “Information Management: A\nProposal.” https://www.w3.org/History/1989/proposal.html.\n\n\nBerry, Donald. 1989. “Comment: Ethics and ECMO.”\nStatistical Science 4 (4): 306–10. https://www.jstor.org/stable/2245830.\n\n\nBertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and\nGreg More Employable Than Lakisha and Jamal? A Field Experiment on Labor\nMarket Discrimination.” American Economic Review 94 (4):\n991–1013. https://doi.org/10.1257/0002828042002561.\n\n\nBethlehem, R. A. I., J. Seidlitz, S. R. White, J. W. Vogel, K. M.\nAnderson, C. Adamson, S. Adler, et al. 2022. “Brain Charts for the\nHuman Lifespan.” Nature 604 (7906): 525–33. https://doi.org/10.1038/s41586-022-04554-y.\n\n\nBetz, Timm, Scott Cook, and Florian Hollenbach. 2018. “On the Use\nand Abuse of Spatial Instruments.” Political Analysis 26\n(4): 474–79. https://doi.org/10.1017/pan.2018.10.\n\n\nBickel, Peter, Eugene Hammel, and William O’Connell. 1975. “Sex\nBias in Graduate Admissions: Data from Berkeley: Measuring Bias Is\nHarder Than Is Usually Assumed, and the Evidence Is Sometimes Contrary\nto Expectation.” Science 187 (4175): 398–404. https://doi.org/10.1126/science.187.4175.398.\n\n\nBiderman, Stella, Kieran Bicheno, and Leo Gao. 2022. “Datasheet\nfor the Pile.” https://arxiv.org/abs/2201.07311.\n\n\nBirkmeyer, John, Jonathan Finks, Amanda O’Reilly, Mary Oerline, Arthur\nCarlin, Andre Nunn, Justin Dimick, Mousumi Banerjee, and Nancy\nBirkmeyer. 2013. “Surgical Skill and Complication Rates After\nBariatric Surgery.” New England Journal of Medicine 369\n(15): 1434–42. https://doi.org/10.1056/nejmsa1300625.\n\n\nBlair, Ed, Seymour Sudman, Norman M Bradburn, and Carol Stocking. 1977.\n“How to Ask Questions about Drinking and Sex: Response Effects in\nMeasuring Consumer Behavior.” Journal of Marketing\nResearch 14 (3): 316–21. https://doi.org/10.2307/3150769.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys.\n2019. “Declaring and Diagnosing Research Designs.”\nAmerican Political Science Review 113 (3): 838–59. https://doi.org/10.1017/S0003055419000194.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, Macartan Humphreys, and\nLuke Sonnet. 2021. estimatr: Fast Estimators\nfor Design-Based Inference. https://CRAN.R-project.org/package=estimatr.\n\n\nBlair, James. 2019. Democratizing R with\nPlumber APIs. https://posit.co/resources/videos/democratizing-r-with-plumber-apis/.\n\n\nBland, Martin, and Douglas Altman. 1986. “Statistical Methods for\nAssessing Agreement Between Two Methods of Clinical Measurement.”\nThe Lancet 327 (8476): 307–10. https://doi.org/10.1016/S0140-6736(86)90837-8.\n\n\nBlei, David. 2012. “Probabilistic Topic Models.”\nCommunications of the ACM 55 (4): 77–84. https://doi.org/10.1145/2133806.2133826.\n\n\nBlei, David, Andrew Ng, and Michael Jordan. 2003. “Latent\nDirichlet Allocation.” Journal of Machine Learning\nResearch 3 (Jan): 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.\n\n\nBloom, Howard, Andrew Bell, and Kayla Reiman. 2020. “Using Data\nfrom Randomized Trials to Assess the Likely Generalizability of\nEducational Treatment-Effect Estimates from Regression Discontinuity\nDesigns.” Journal of Research on Educational\nEffectiveness 13 (3): 488–517. https://doi.org/10.1080/19345747.2019.1634169.\n\n\nBlumenthal, Mark. 2014. “Polls, Forecasts, and\nAggregators.” PS: Political Science & Politics 47\n(02): 297–300. https://doi.org/10.1017/s1049096514000055.\n\n\nBoland, Philip. 1984. “A Biographical Glimpse of William Sealy\nGosset.” The American Statistician 38 (3): 179–83. https://doi.org/10.2307/2683648.\n\n\nBolker, Ben, and David Robinson. 2022. broom.mixed: Tidying Methods for Mixed\nModels. https://CRAN.R-project.org/package=broom.mixed.\n\n\nBolton, Ruth, and Randall Chapman. 1986. “Searching for Positive\nReturns at the Track.” Management Science 32 (August):\n1040–60. https://doi.org/10.1287/mnsc.32.8.1040.\n\n\nBombieri, Giulia, Vincenzo Penteriani, Kamran Almasieh, Hüseyin Ambarlı,\nMohammad Reza Ashrafzadeh, Chandan Surabhi Das, Nishith Dharaiya, et al.\n2023. “A Worldwide Perspective on Large Carnivore Attacks on\nHumans.” PLOS Biology 21 (1): e3001946. https://doi.org/10.1371/journal.pbio.3001946.\n\n\nBor, Jacob, Atheendar Venkataramani, David Williams, and Alexander Tsai.\n2018. “Police Killings and Their Spillover Effects on the Mental\nHealth of Black Americans: A Population-Based, Quasi-Experimental\nStudy.” The Lancet 392 (10144): 302–10. https://doi.org/10.1016/s0140-6736(18)31130-9.\n\n\nBorer, Elizabeth T., Eric W. Seabloom, Matthew B. Jones, and Mark\nSchildhauer. 2009. “Some Simple Guidelines for Effective Data\nManagement.” Bulletin of the Ecological Society of\nAmerica 90 (2): 205–14. https://doi.org/10.1890/0012-9623-90.2.205.\n\n\nBorghi, John, and Ana Van Gulick. 2022. “Promoting Open Science\nThrough Research Data Management.” Harvard Data Science\nReview 4 (3). https://doi.org/10.1162/99608f92.9497f68e.\n\n\nBorkin, Michelle, Zoya Bylinskii, Nam Wook Kim, Constance May\nBainbridge, Chelsea Yeh, Daniel Borkin, Hanspeter Pfister, and Aude\nOliva. 2015. “Beyond Memorability: Visualization Recognition and\nRecall.” IEEE Transactions on Visualization and Computer\nGraphics 22 (1): 519–28. https://doi.org/10.1109/TVCG.2015.2467732.\n\n\nBosch, Oriol, and Melanie Revilla. 2022. “When survey science met web tracking: Presenting an error\nframework for metered data.” Journal of the Royal\nStatistical Society: Series A (Statistics in Society), November,\n1–29. https://doi.org/10.1111/rssa.12956.\n\n\nBouguen, Adrien, Yue Huang, Michael Kremer, and Edward Miguel. 2019.\n“Using Randomized Controlled Trials to Estimate Long-Run Impacts\nin Development Economics.” Annual Review of Economics 11\n(1): 523–61. https://doi.org/10.1146/annurev-economics-080218-030333.\n\n\nBouie, Jamelle. 2022. “We Still Can’t See American Slavery for\nWhat It Was.” The New York Times, January. https://www.nytimes.com/2022/01/28/opinion/slavery-voyages-data-sets.html.\n\n\nBowen, Claire McKay. 2022. Protecting Your\nPrivacy in a Data-Driven World. 1st ed. Chapman; Hall/CRC.\nhttps://doi.org/10.1201/9781003122043.\n\n\nBowers, Jake, and Maarten Voors. 2016. “How to Improve Your\nRelationship with Your Future Self.” Revista de Ciencia\nPolı́tica 36 (3): 829–48. https://doi.org/10.4067/S0718-090X2016000300011.\n\n\nBowley, Arthur Lyon. 1901. Elements of Statistics. London: P.\nS. King.\n\n\n———. 1913. “Working-Class Households in Reading.”\nJournal of the Royal Statistical Society 76 (7): 672–701. https://doi.org/10.2307/2339708.\n\n\nBox, George E. P. 1976. “Science and Statistics.”\nJournal of the American Statistical Association 71 (356):\n791–99. https://doi.org/10.1080/01621459.1976.10480949.\n\n\nBoykis, Vicki. 2019. “A Deep Dive on Python Type Hints,”\nJuly. https://vickiboykis.com/2019/07/08/a-deep-dive-on-python-type-hints/.\n\n\nBoysel, Sam, and Davis Vaughan. 2021. fredr: An\nR Client for the “FRED” API. https://CRAN.R-project.org/package=fredr.\n\n\nBradley, Valerie, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic,\nXiao-Li Meng, and Seth Flaxman. 2021. “Unrepresentative Big\nSurveys Significantly Overestimated US Vaccine\nUptake.” Nature 600 (7890): 695–700. https://doi.org/10.1038/s41586-021-04198-4.\n\n\nBraginsky, Mika. 2020. wordbankr: Accessing the\nWordbank Database. https://CRAN.R-project.org/package=wordbankr.\n\n\nBrandt, Allan. 1978. “Racism and Research: The Case of the\nTuskegee Syphilis Study.” Hastings Center Report, 21–29.\nhttps://doi.org/10.2307/3561468.\n\n\nBreiman, Leo. 1994. “The 1991 Census Adjustment: Undercount or Bad\nData?” Statistical Science 9 (4). https://doi.org/10.1214/ss/1177010259.\n\n\n———. 2001. “Statistical Modeling: The Two Cultures.”\nStatistical Science 16 (3): 199–231. https://doi.org/10.1214/ss/1009213726.\n\n\nBremer, Nadieh, and Shirley Wu. 2021. Data Sketches. A K\nPeters/CRC Press. https://doi.org/10.1201/9780429445019.\n\n\nBrewer, Cynthia. 2015. Designing Better Maps: A Guide for GIS\nUsers. 2nd ed.\n\n\nBrewer, Ken. 2013. “Three Controversies in the History of Survey\nSampling.” Survey Methodology 39 (2): 249–63.\n\n\nBreznau, Nate, Eike Mark Rinke, Alexander Wuttke, Hung HV Nguyen, Muna\nAdem, Jule Adriaans, Amalia Alvarez-Benjumea, et al. 2022.\n“Observing Many Researchers Using the Same Data and Hypothesis\nReveals a Hidden Universe of Uncertainty.” Proceedings of the\nNational Academy of Sciences 119 (44): e2203150119. https://doi.org/10.1073/pnas.2203150119.\n\n\nBriggs, Ryan. 2021. “Why Does Aid Not Target the Poorest?”\nInternational Studies Quarterly 65 (3): 739–52. https://doi.org/10.1093/isq/sqab035.\n\n\nBrodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. “Methods Matter: p-Hacking and Publication Bias in Causal\nAnalysis in Economics.” American Economic Review\n110 (11): 3634–60. https://doi.org/10.1257/aer.20190687.\n\n\nBrokowski, Carolyn, and Mazhar Adli. 2019. “CRISPR Ethics: Moral\nConsiderations for Applications of a Powerful Tool.” Journal\nof Molecular Biology 431 (1): 88–101. https://doi.org/10.1016/j.jmb.2018.05.044.\n\n\nBronner, Laura. 2020. “Why Statistics Don’t Capture the Full\nExtent of the Systemic Bias in Policing.”\nFiveThirtyEight, June. https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/.\n\n\n———. 2021. “Quantitative Editing.” YouTube, June.\nhttps://youtu.be/LI5m9RzJgWc.\n\n\nBrontë, Charlotte. 1847. Jane Eyre. https://www.gutenberg.org/files/1260/1260-h/1260-h.htm.\n\n\n———. 1857. The Professor. https://www.gutenberg.org/files/1028/1028-h/1028-h.htm.\n\n\nBrook, Robert, John Ware, William Rogers, Emmett Keeler, Allyson Ross\nDavies, Cathy Sherbourne, George Goldberg, Kathleen Lohr, Patricia Camp,\nand Joseph Newhouse. 1984. “The Effect of Coinsurance on the\nHealth of Adults: Results from the RAND Health Insurance\nExperiment.” https://www.rand.org/pubs/reports/R3055.html.\n\n\nBrown, Zack. 2018. “A Git Origin Story.” Linux\nJournal, July. https://www.linuxjournal.com/content/git-origin-story.\n\n\nBryan, Jenny. 2015. “Naming Things.” Reproducible\nScience Workshop, May. https://speakerdeck.com/jennybc/how-to-name-files.\n\n\n———. 2018a. “Excuse Me, Do You Have a Moment to Talk about Version\nControl?” The American Statistician 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928.\n\n\n———. 2018b. “Code Smells and Feels.” YouTube,\nJuly. https://youtu.be/7oyiPBjLAWY.\n\n\n———. 2020. Happy Git and GitHub for the\nuseR. https://happygitwithr.com.\n\n\nBryan, Jenny, and Jim Hester. 2020. What They\nForgot to Teach You About R. https://rstats.wtf/index.html.\n\n\nBryan, Jenny, Jim Hester, David Robinson, Hadley Wickham, and Christophe\nDervieux. 2022. reprex: Prepare Reproducible\nExample Code via the Clipboard. https://CRAN.R-project.org/package=reprex.\n\n\nBryan, Jenny, and Hadley Wickham. 2021. gh:\nGitHub API. https://CRAN.R-project.org/package=gh.\n\n\nBuckheit, Jonathan, and David Donoho. 1995. “Wavelab and\nReproducible Research.” In Wavelets and Statistics,\n55–81. Springer. https://doi.org/10.1007/978-1-4612-2544-7_5.\n\n\nBueno de Mesquita, Ethan, and Anthony Fowler. 2021. Thinking Clearly\nwith Data: A Guide to Quantitative Reasoning and Analysis. New\nJersey: Princeton University Press.\n\n\nBuhr, Ray. 2017. Using R as a Production\nMachine Learning Language (Part I). https://raybuhr.github.io/blog/posts/making-predictions-over-http/.\n\n\nBuja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung\nLee, Deborah F. Swayne, and Hadley Wickham. 2009. “Statistical\nInference for Exploratory Data Analysis and Model Diagnostics.”\nPhilosophical Transactions of the Royal Society A:\nMathematical, Physical and Engineering Sciences 367 (1906):\n4361–83. https://doi.org/10.1098/rsta.2009.0120.\n\n\nBuja, Andreas, Dianne Cook, and Deborah Swayne. 1996. “Interactive\nHigh-Dimensional Data Visualization.” Journal of\nComputational and Graphical Statistics 5 (1): 78–99. https://doi.org/10.2307/1390754.\n\n\nBuneman, Peter, Sanjeev Khanna, and Tan Wang-Chiew. 2001. “Why and\nWhere: A Characterization of Data Provenance.” In Database\nTheory ICDT 2001, 316–30. Springer\nBerlin Heidelberg. https://doi.org/10.1007/3-540-44503-x_20.\n\n\nBuolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades:\nIntersectional Accuracy Disparities in Commercial Gender\nClassification.” In Conference on Fairness, Accountability\nand Transparency, 77–91.\n\n\nBurch, Tyler James. 2023. “2023 NHL Playoff\nPredictions,” April. https://tylerjamesburch.com/blog/misc/nhl-predictions.\n\n\nBurton, Jason, Nicole Cruz, and Ulrike Hahn. 2021. “Reconsidering\nEvidence of Moral Contagion in Online Social Networks.”\nNature Human Behaviour 5 (12): 1629–35. https://doi.org/10.1038/s41562-021-01133-5.\n\n\nBush, Vannevar. 1945. “As We May Think.” The Atlantic\nMonthly, July. https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/.\n\n\nByrd, James Brian, Anna Greene, Deepashree Venkatesh Prasad, Xiaoqian\nJiang, and Casey Greene. 2020. “Responsible, Practical Genomic\nData Sharing That Accelerates Research.” Nature Reviews\nGenetics 21 (10): 615–29. https://doi.org/10.1038/s41576-020-0257-5.\n\n\nCahill, Niamh, Michelle Weinberger, and Leontine Alkema. 2020.\n“What Increase in Modern Contraceptive Use Is Needed in FP2020\nCountries to Reach 75% Demand Satisfied by 2030? An Assessment Using the\nAccelerated Transition Method and Family Planning Estimation\nModel.” Gates Open Research 4. https://doi.org/10.12688/gatesopenres.13125.1.\n\n\nCalonico, Sebastian, Matias Cattaneo, Max Farrell, and Rocio Titiunik.\n2021. rdrobust: Robust Data-Driven Statistical\nInference in Regression-Discontinuity Designs. https://CRAN.R-project.org/package=rdrobust.\n\n\nCambon, Jesse, and Christopher Belanger. 2021. “tidygeocoder: Geocoding Made Easy.” Zenodo.\nhttps://doi.org/10.5281/zenodo.3981510.\n\n\nCanty, Angelo, and B. D. Ripley. 2021. boot:\nBootstrap R (S-Plus) Functions.\n\n\nCardoso, Tom. 2020. “Bias behind bars: A\nGlobe investigation finds a prison system stacked against Black and\nIndigenous inmates.” The Globe and Mail, October.\nhttps://www.theglobeandmail.com/canada/article-investigation-racial-bias-in-canadian-prison-risk-assessments/.\n\n\nCarl, Sebastian, Ben Baldwin, Lee Sharpe, Tan Ho, and John Edwards.\n2023. Nflverse: Easily Install and Load the ’Nflverse’. https://CRAN.R-project.org/package=nflverse.\n\n\nCarleton, Chris. 2021. “wccarleton/conflict-europe: Acce.” Zenodo.\nhttps://doi.org/10.5281/zenodo.4550688.\n\n\nCarleton, Chris, Dave Campbell, and Mark Collard. 2021. “A\nReassessment of the Impact of Temperature Change on European Conflict\nDuring the Second Millennium CE Using a Bespoke Bayesian Time-Series\nModel.” Climatic Change 165 (1): 1–16. https://doi.org/10.1007/s10584-021-03022-2.\n\n\nCaro, Robert. 2019. Working. 1st ed. New York: Knopf.\n\n\nCarpenter, Christopher, and Carlos Dobkin. 2014. “Replication data for: The Minimum Legal Drinking Age and\nCrime.” https://doi.org/10.7910/DVN/27070.\n\n\n———. 2015. “The Minimum Legal Drinking Age\nand Crime.” The Review of Economics and\nStatistics 97 (2): 521–24. https://doi.org/10.1162/REST_a_00489.\n\n\nCarroll, Lewis. 1871. Through the Looking-Glass. Macmillan. https://www.gutenberg.org/files/12/12-h/12-h.htm.\n\n\nCastro, Marcia, Susie Gurzenda, Cassio Turra, Sun Kim, Theresa\nAndrasfay, and Noreen Goldman. 2023. “Research Note:\nCOVID-19 Is Not an Independent Cause of Death.”\nDemography, February. https://doi.org/10.1215/00703370-10575276.\n\n\nCaughey, Devin, and Jasjeet Sekhon. 2011. “Elections and the Regression Discontinuity Design:\nLessons from Close U.S. House Races, 1942–2008.”\nPolitical Analysis 19 (4): 385–408. https://doi.org/10.1093/pan/mpr032.\n\n\nChamberlain, Scott, Hadley Wickham, Winston Chang, and Mauricio Vargas.\n2022. Analogsea: Interface to “Digital Ocean”. https://CRAN.R-project.org/package=analogsea.\n\n\nChamberlin, Donald. 2012. “Early History of\nSQL.” IEEE Annals of the History of\nComputing 34 (4): 78–82. https://doi.org/10.1109/mahc.2012.61.\n\n\nChambliss, Daniel. 1989. “The Mundanity of Excellence: An\nEthnographic Report on Stratification and Olympic Swimmers.”\nSociological Theory 7 (1): 70–86. https://doi.org/10.2307/202063.\n\n\nChambru, Cédric, and Paul Maneuvrier-Hervieu. 2022. “Introducing HiSCoD: A new gateway for the study of\nhistorical social conflict.” Working Paper Series,\nDepartment of Economics, University of Zurich. https://doi.org/10.5167/uzh-217109.\n\n\nChan, Duo. 2021. “Combining Statistical, Physical, and Historical\nEvidence to Improve Historical Sea-Surface Temperature Records.”\nHarvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.edcee38f.\n\n\nChang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke,\nYihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara\nBorges. 2021. shiny: Web Application Framework\nfor R. https://CRAN.R-project.org/package=shiny.\n\n\nChase, William. 2020. “The Glamour of Graphics.”\nRStudio Conference, January. https://posit.co/resources/videos/the-glamour-of-graphics/.\n\n\nChawla, Dalmeet Singh. 2020. “Critiqued Coronavirus Simulation\nGets Thumbs up from Code-Checking Efforts.” Nature 582:\n323–24. https://doi.org/10.1038/d41586-020-01685-y.\n\n\nChellel, Kit. 2018. “The Gambler Who Cracked the Horse-Racing\nCode.” Bloomberg Businessweek, May. https://www.bloomberg.com/news/features/2018-05-03/the-gambler-who-cracked-the-horse-racing-code.\n\n\nChen, Heng, Marie-Hélène Felt, and Christopher Henry. 2018. “2017\nMethods-of-Payment Survey: Sample Calibration and Variance\nEstimation.” Bank of Canada. https://doi.org/10.34989/tr-114.\n\n\nChen, Wei, Xilu Chen, Chang-Tai Hsieh, and Zheng Song. 2019. “A\nForensic Examination of China’s National Accounts.” Brookings\nPapers on Economic Activity, 77–127. https://www.jstor.org/stable/26798817.\n\n\nChen, Weijun, Yan Qi, Yuwen Zhang, Christina Brown, Akos Lada, and\nHarivardan Jayaraman. 2022. “Notifications: Why Less Is\nMore,” December. https://medium.com/@AnalyticsAtMeta/notifications-why-less-is-more-how-facebook-has-been-increasing-both-user-satisfaction-and-app-9463f7325e7d.\n\n\nCheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2021. leaflet: Create Interactive Web Maps with the JavaScript\n“Leaflet” Library. https://CRAN.R-project.org/package=leaflet.\n\n\nCheriet, Mohamed, Nawwaf Kharma, Cheng-Lin Liu, and Ching Suen. 2007.\nCharacter Recognition Systems: A Guide for Students and\nPractitioner. Wiley.\n\n\nChouldechova, Alexandra, Diana Benavides-Prado, Oleksandr Fialko, and\nRhema Vaithianathan. 2018. “A Case Study of Algorithm-Assisted\nDecision Making in Child Maltreatment Hotline Screening\nDecisions.” In Proceedings of the 1st Conference on Fairness,\nAccountability and Transparency, edited by Sorelle Friedler and\nChristo Wilson, 81:134–48. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/chouldechova18a.html.\n\n\nChrétien, Jean. 2007. My Years as Prime Minister. 1st ed.\nToronto: Knopf Canada.\n\n\nChristensen, Garret, Allan Dafoe, Edward Miguel, Don Moore, and Andrew\nRose. 2019. “A Study of the Impact of Data Sharing on Article\nCitations Using Journal Policies as a Natural Experiment.”\nPLOS ONE 14 (12): e0225883. https://doi.org/10.1371/journal.pone.0225883.\n\n\nChristensen, Garret, Jeremy Freese, and Edward Miguel. 2019.\nTransparent and Reproducible Social Science Research.\nCalifornia: University of California Press.\n\n\nChristian, Brian. 2012. “The A/B Test: Inside\nthe Technology That’s Changing the Rules of Business.”\nWired, April. https://www.wired.com/2012/04/ff-abtesting/.\n\n\nCirone, Alexandra, and Arthur Spirling. 2021. “Turning History\ninto Data: Data Collection, Measurement, and Inference in HPE.”\nJournal of Historical Political Economy 1 (1): 127–54. https://doi.org/10.1561/115.00000005.\n\n\nCity of Toronto. 2021. 2021 Street Needs Assessment. https://www.toronto.ca/city-government/data-research-maps/research-reports/housing-and-homelessness-research-and-reports/.\n\n\nCleveland, William. (1985) 1994. The Elements of Graphing Data.\n2nd ed. New Jersey: Hobart Press.\n\n\nClinton, Joshua, John Lapinski, and Marc Trussler. 2022.\n“Reluctant Republicans, Eager Democrats?” Public\nOpinion Quarterly 86 (2): 247–69. https://doi.org/10.1093/poq/nfac011.\n\n\nCohen, Glenn, and Michelle Mello. 2018. “HIPAA and\nProtecting Health Information in the 21st Century.”\nJAMA 320 (3): 231. https://doi.org/10.1001/jama.2018.5630.\n\n\nCohen, Jason, Steven Teleki, and Eric Brown. 2006. Best Kept Secrets\nof Peer Code Review. Smart Bear Incorporated.\n\n\nCohn, Alain. 2019. “Data and code for: Civic\nHonesty Around the Globe.” Harvard Dataverse. https://doi.org/10.7910/dvn/ykbodn.\n\n\nCohn, Alain, Michel André Maréchal, David Tannenbaum, and Christian\nLukas Zünd. 2019a. “Civic Honesty Around the Globe.”\nScience 365 (6448): 70–73. https://doi.org/10.1126/science.aau8712.\n\n\n———. 2019b. “Supplementary Materials for: Civic Honesty Around the\nGlobe.” Science 365 (6448): 70–73.\n\n\nCohn, Nate. 2016. “We Gave Four Good Pollsters the Same Raw Data.\nThey Had Four Different Results.” The New York Times,\nSeptember. https://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html.\n\n\nCollins, Annie, and Rohan Alexander. 2022. “Reproducibility of\nCOVID-19 Pre-Prints.” Scientometrics 127: 4655–73. https://doi.org/10.1007/s11192-022-04418-2.\n\n\nColombo, Tommaso, Holger Fröning, Pedro Javier Garcı̀a, and Wainer\nVandelli. 2016. “Optimizing the Data-Collection Time of a\nLarge-Scale Data-Acquisition System Through a Simulation\nFramework.” The Journal of Supercomputing 72 (12):\n4546–72. https://doi.org/10.1007/s11227-016-1764-1.\n\n\nComer, Benjamin P., and Jason R. Ingram. 2022. “Comparing Fatal\nEncounters, Mapping Police Violence, and Washington Post Fatal Police\nShooting Data from 2015-2019: A Research Note.” Criminal\nJustice Review, January, 073401682110710. https://doi.org/10.1177/07340168211071014.\n\n\nCongelio, Bradley. 2024. Introduction to NFL\nAnalytics with R. 1st ed. Chapman; Hall/CRC. https://bradcongelio.com/nfl-analytics-with-r-book/.\n\n\nCook, Dianne, Andreas Buja, Javier Cabrera, and Catherine Hurley. 1995.\n“Grand Tour and Projection\nPursuit.” Journal of Computational and Graphical\nStatistics 4 (3): 155–72. https://doi.org/10.1080/10618600.1995.10474674.\n\n\nCook, Dianne, Nancy Reid, and Emi Tanaka. 2021. “The Foundation Is\nAvailable for Thinking about Data Visualization Inferentially.”\nHarvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.8453435d.\n\n\nCook, Dianne, and Deborah Swayne. 2007. Interactive and Dynamic Graphics for Data Analysis: With\nR and GGobi. 1st ed. Springer.\n\n\nCooley, David. 2020. mapdeck: Interactive Maps\nUsing “Mapbox GL JS” and\n“Deck.gl”. https://CRAN.R-project.org/package=mapdeck.\n\n\nCouncil of European Union. 2016. “General Data Protection\nRegulation 2016/679.” https://eur-lex.europa.eu/eli/reg/2016/679/oj.\n\n\nCowen, Tyler. 2021. “Episode 132: Amia Srinivasan on Utopian\nFeminism.” Conversations with Tyler, September. https://conversationswithtyler.com/episodes/amia-srinivasan/.\n\n\n———. 2023. “Episode 168: Katherine Rundell on the Art of\nWords.” Conversations with Tyler, January. https://conversationswithtyler.com/episodes/katherine-rundell/.\n\n\nCox, David. 2018. “In Gentle Praise of Significance Tests.”\nYouTube, October. https://youtu.be/txLj%5FP9UlCQ.\n\n\nCox, David, and Nancy Reid. 1987. “Parameter Orthogonality and\nApproximate Conditional Inference.” Journal of the Royal\nStatistical Society: Series B (Methodological) 49 (1): 1–18. https://doi.org/10.1111/j.2517-6161.1987.tb01422.x.\n\n\nCox, Murray. 2021. “Inside Airbnb—Toronto\nData.” http://insideairbnb.com/get-the-data.html.\n\n\nCoyle, Edward, Andrew Coggan, Mari Hopper, and Thomas Walters. 1988.\n“Determinants of Endurance in Well-Trained\nCyclists.” Journal of Applied Physiology 64 (6):\n2622–30. https://doi.org/10.1152/jappl.1988.64.6.2622.\n\n\nCraiu, Radu. 2019. “The Hiring Gambit: In Search of the Twofer\nData Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.\n\n\nCramer, Jan Salomon. 2003. “The Origins of Logistic\nRegression.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.360300.\n\n\nCrane, Nicola, Stephanie Hazlitt, and Apache Arrow. 2023.\nApache Arrow R Cookbook. https://arrow.apache.org/cookbook/r/.\n\n\nCrawford, Kate. 2021. Atlas of AI.\n1st ed. New Haven: Yale University Press.\n\n\nCrosby, Alfred. 1997. The Measure of Reality: Quantification in\nWestern Europe, 1250-1600. Cambridge: Cambridge University Press.\n\n\nCsárdi, Gábor. 2022. gitcreds: Query\n“git” Credentials from “R”. https://CRAN.R-project.org/package=gitcreds.\n\n\nCsárdi, Gábor, Jim Hester, Hadley Wickham, Winston Chang, Martin Morgan,\nand Dan Tenenbaum. 2021. remotes: R Package\nInstallation from Remote Repositories, Including\n“GitHub”. https://CRAN.R-project.org/package=remotes.\n\n\nCummins, Neil. 2022. “The Hidden Wealth of English Dynasties,\n1892–2016.” The Economic History Review 75 (3): 667–702.\nhttps://doi.org/10.1111/ehr.13120.\n\n\nCunningham, Scott. 2021. Causal Inference: The Mixtape. 1st ed.\nNew Haven: Yale Press. https://mixtape.scunning.com.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism.\nMassachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nda Silva, Natalia, Dianne Cook, and Eun-Kyung Lee. 2023. “Interactive graphics for visually diagnosing forest\nclassifiers in R.” Computational Statistics,\nJanuary. https://doi.org/10.1007/s00180-023-01323-x.\n\n\nDagan, Noa, Noam Barda, Eldad Kepten, Oren Miron, Shay Perchik, Mark\nKatz, Miguel Hernán, Marc Lipsitch, Ben Reis, and Ran Balicer. 2021.\n“BNT162b2 mRNA Covid-19 Vaccine in a Nationwide Mass Vaccination\nSetting.” New England Journal of Medicine 384 (15):\n1412–23. https://doi.org/10.1056/NEJMoa2101765.\n\n\nDaston, Lorraine. 2000. “Why Statistics Tend Not Only to Describe\nthe World but to Change It.” London Review of Books 22\n(8). https://www.lrb.co.uk/the-paper/v22/n08/lorraine-daston/why-statistics-tend-not-only-to-describe-the-world-but-to-change-it.\n\n\nData and Justice Criminology Lab, Institute of Criminology and Criminal\nJustice, Carleton University; The Centre for Research & Innovation\nfor Black Survivors of Homicide Victims (The CRIB), at the\nFactor-Inwentash Faculty of Social Work, University of Toronto; Canadian\nCivil Liberties Association; Ethics and Technology Lab, Queen’s\nUniversity. 2022. “Tracking (in)justice: A Living Data Set\nTracking Canadian Police-Involved Deaths.” https://trackinginjustice.ca.\n\n\nDattani, Saloni. 2024. “The Rise in Reported Maternal Mortality\nRates in the US Is Largely Due to a Change in Measurement.”\nOur World in Data.\n\n\nDavidson, Thomas, Debasmita Bhattacharya, and Ingmar Weber. 2019.\n“Racial Bias in Hate Speech and Abusive Language Detection\nDatasets.” In Proceedings of the Third Workshop on Abusive\nLanguage Online, 25–35.\n\n\nDavies, Neil M., Gibran Hemani, Jenae M. Neiderhiser, Hilary C. Martin,\nMelinda C. Mills, Peter M. Visscher, Loïc Yengo, Alexander Strudwick\nYoung, and Matthew C. Keller. 2024. “The Importance of\nFamily-Based Sampling for Biobanks.” Nature 634 (8035):\n795–803. https://doi.org/10.1038/s41586-024-07721-5.\n\n\nDavies, Rhian, Steph Locke, and Lucy D’Agostino McGowan. 2022. datasauRus: Datasets from the Datasaurus\nDozen. https://CRAN.R-project.org/package=datasauRus.\n\n\nDavis, Darren. 1997. “Nonrandom Measurement Error and Race of\nInterviewer Effects Among African Americans.” The Public\nOpinion Quarterly 61 (1): 183–207. https://doi.org/10.1086/297792.\n\n\nDavison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their\nApplications. Cambridge: Cambridge University Press. http://statwww.epfl.ch/davison/BMA/.\n\n\nDe Jonge, Edwin, and Mark van der Loo. 2013. An\nintroduction to data cleaning with R. Statistics Netherlands\nHeerlen. https://cran.r-project.org/doc/contrib/de%5FJonge+van%5Fder%5FLoo-Introduction%5Fto%5Fdata%5Fcleaning%5Fwith%5FR.pdf.\n\n\nDean, Natalie. 2022. “Tracking COVID-19 Infections:\nTime for Change.” Nature 602 (7896): 185. https://doi.org/10.1038/d41586-022-00336-8.\n\n\nDeaton, Angus. 2010. “Instruments, Randomization, and Learning\nabout Development.” Journal of Economic Literature 48\n(2): 424–55. https://doi.org/10.1257/jel.48.2.424.\n\n\nDenby, Lorraine, and Colin Mallows. 2009. “Variations on the\nHistogram.” Journal of Computational and Graphical\nStatistics 18 (1): 21–31. https://doi.org/10.1198/jcgs.2009.0002.\n\n\nDeWitt, Helen. 2000. The Last Samurai. 1st ed. United States:\nTalk Mirimax Books.\n\n\nDillman, Don, Jolene Smyth, and Leah Christian. (1978) 2014.\nInternet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design\nMethod. 4th ed. Wiley.\n\n\nDoggers, Peter. 2021. “Carlsen Wins Game 6, Longest World Chess\nChampionship Game of All Time,” December. https://www.chess.com/news/view/fide-world-chess-championship-2021-game-6.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed,\nand Allison Jones-Farmer. 2021. “Explaining Predictive Model\nPerformance: An Experimental Study of Data Preparation and Model\nChoice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nDoll, Richard, and Bradford Hill. 1950. “Smoking and Carcinoma of\nthe Lung.” British Medical Journal 2 (4682): 739–48. https://doi.org/10.1136/bmj.2.4682.739.\n\n\nDruckman, James, and Donald Green. 2021. “A New Era of\nExperimental Political Science.” In Advances in Experimental\nPolitical Science, 1–16. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108777919.002.\n\n\nDu, Kai, Steven Huddart, and Xin Daniel Jiang. 2022. “Lost in\nStandardization: Effects of Financial Statement Database Discrepancies\non Inference.” Journal of Accounting and Economics,\nDecember, 101573. https://doi.org/10.1016/j.jacceco.2022.101573.\n\n\nDuflo, Esther. 2020. “Field Experiments and the Practice of\nPolicy.” American Economic Review 110 (7): 1952–73. https://doi.org/10.1257/aer.110.7.1952.\n\n\nDwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006.\n“Calibrating Noise to Sensitivity in Private Data\nAnalysis.” In Theory of Cryptography Conference, 265–84.\nSpringer. https://doi.org/10.1007/11681878_14.\n\n\nDwork, Cynthia, and Aaron Roth. 2013. “The Algorithmic Foundations\nof Differential Privacy.” Foundations and Trends in\nTheoretical Computer Science 9 (3-4): 211–407. https://doi.org/10.1561/0400000042.\n\n\nEdelman, Murray, Liberty Vittert, and Xiao-Li Meng. 2021. “An\nInterview with Murray Edelman on the History of the Exit Poll.”\nHarvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.3a25cd24.\n\n\nEdgeworth, Francis Ysidro. 1885. “Methods of Statistics.”\nJournal of the Statistical Society of London, 181–217.\n\n\nEdwards, Jonathan. 2017. “PACE team response\nshows a disregard for the principles of science.”\nJournal of Health Psychology 22 (9): 1155–58. https://doi.org/10.1177/1359105317700886.\n\n\nEfron, Bradley, and Carl Morris. 1977. “Stein’s Paradox in\nStatistics.” Scientific American 236 (May): 119–27. https://doi.org/10.1038/scientificamerican0577-119.\n\n\nEghbal, Nadia. 2020. Working in Public: The Making and Maintenance\nof Open Source Software. California: Stripe Press.\n\n\nEisenstein, Michael. 2022. “Need Web Data? Here’s How to Harvest\nThem.” Nature 607: 200–201. https://doi.org/10.1038/d41586-022-01830-9.\n\n\nElliott, Michael, Brady West, Xinyu Zhang, and Stephanie Coffey. 2022.\n“The Anchoring Method: Estimation of Interviewer Effects in the\nAbsence of Interpenetrated Sample Assignment.” Survey\nMethodology 48 (1): 25–48. http://www.statcan.gc.ca/pub/12-001-x/2022001/article/00005-eng.htm.\n\n\nElson, Malte. 2018. “Question Wording and Item\nFormulation.” https://doi.org/10.31234/osf.io/e4ktc.\n\n\nEnns, Peter, and Jake Rothschild. 2022. “Do You Know Where Your\nSurvey Data Come From?” May. https://medium.com/3streams/surveys-3ec95995dde2.\n\n\nFarrugia, Patricia, Bradley Petrisor, Forough Farrokhyar, and Mohit\nBhandari. 2010. “Research Questions, Hypotheses and\nObjectives.” Canadian Journal of Surgery 53 (4): 278.\n\n\nFeldman, Gilad. 2024. RRR Assessment Peer Review.\nhttps://mgto.org/rrrassessmentreviewtemplate.\n\n\nFinkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan\nGruber, Joseph Newhouse, Heidi Allen, Katherine Baicker, and Oregon\nHealth Study Group. 2012. “The Oregon Health Insurance Experiment:\nEvidence from the First Year.” The Quarterly Journal of\nEconomics 127 (3): 1057–1106. https://doi.org/10.1093/qje/qjs020.\n\n\nFirke, Sam. 2023. janitor: Simple Tools for\nExamining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.\n\n\nFisher, Ronald. (1925) 1928. Statistical Methods for Research\nWorkers. 2nd ed. London: Oliver; Boyd.\n\n\n———. (1935) 1949. The Design of Experiments. 5th ed. London:\nOliver; Boyd.\n\n\nFiske, Susan, and Shiro Kuriwaki. 2021. “Words to the Wise on\nWriting Scientific Papers,” November. https://doi.org/10.31234/osf.io/n32qw.\n\n\nFitts, Alexis Sobel. 2014. “The King of Content: How Upworthy Aims\nto Alter the Web, and Could End up Altering the World.”\nColumbia Journalism Review 53: 34–38. https://archives.cjr.org/feature/the%5Fking%5Fof%5Fcontent.php.\n\n\nFlake, Jessica, and Eiko Fried. 2020. “Measurement Schmeasurement:\nQuestionable Measurement Practices and How to Avoid Them.”\nAdvances in Methods and Practices in Psychological Science 3\n(4): 456–65. https://doi.org/10.1177/2515245920952393.\n\n\nFlynn, Michael. 2022. troopdata: Tools for\nAnalyzing Cross-National Military Deployment and Basing\nData. https://CRAN.R-project.org/package=troopdata.\n\n\nFord, Paul. 2015. “What Is Code?” Bloomberg\nBusinessweek, June. https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.\n\n\nForster, Edward Morgan. 1927. Aspects of the Novel. London:\nEdward Arnold.\n\n\nFoster, Gordon. 1968. “Computers, Statistics and Planning: Systems\nor Chaos?” Geary Lecture. https://www.esri.ie/system/files/publications/GLS2.pdf.\n\n\nFourcade, Marion, and Kieran Healy. 2017. “Seeing Like a\nMarket.” Socio-Economic Review 15 (1): 9–29. https://doi.org/10.1093/ser/mww033.\n\n\nFowler, Martin, and Kent Beck. 2018. Refactoring: Improving the Design of Existing\nCode. 2nd ed. New York: Addison-Wesley Professional.\n\n\nFox, John, and Robert Andersen. 2006. “Effect Displays for\nMultinomial and Proportional-Odds Logit Models.” Sociological\nMethodology 36 (1): 225–55. https://doi.org/10.1111/j.1467-9531.2006.00180.\n\n\nFox, John, Sanford Weisberg, and Brad Price. 2022. carData:\nCompanion to Applied Regression Data Sets. https://CRAN.R-project.org/package=carData.\n\n\nFranconeri, Steven, Lace Padilla, Priti Shah, Jeffrey Zacks, and Jessica\nHullman. 2021. “The Science of Visual Data Communication: What\nWorks.” Psychological Science in the Public Interest 22\n(3): 110–61. https://doi.org/10.1177/15291006211051956.\n\n\nFrandell, Ashlee, Mary Feeney, Timothy Johnson, Eric Welch, Lesley\nMichalegko, and Heyjie Jung. 2021. “The Effects of Electronic\nAlert Letters for Internet Surveys of Academic Scientists.”\nScientometrics 126 (8): 7167–81. https://doi.org/10.1007/s11192-021-04029-3.\n\n\nFranklin, Laura. 2005. “Exploratory Experiments.”\nPhilosophy of Science 72 (5): 888–99. https://doi.org/10.1086/508117.\n\n\nFrei, Christoph, and Liam Welsh. 2022. “How\nthe Closure of a U.S. Tax Loophole May Affect Investor\nPortfolios.” Journal of Risk and Financial\nManagement 15 (5): 209. https://doi.org/10.3390/jrfm15050209.\n\n\nFrick, Hannah, Fanny Chow, Max Kuhn, Michael Mahoney, Julia Silge, and\nHadley Wickham. 2022. rsample: General\nResampling Infrastructure. https://CRAN.R-project.org/package=rsample.\n\n\nFried, Eiko, Jessica Flake, and Donald Robinaugh. 2022.\n“Revisiting the Theoretical and Methodological Foundations of\nDepression Measurement.” Nature Reviews Psychology 1\n(6): 358–68. https://doi.org/10.1038/s44159-022-00050-2.\n\n\nFriedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2009. The\nElements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.\n\n\nFriendly, Michael. 2021. HistData: Data Sets from the History of\nStatistics and Data Visualization. https://CRAN.R-project.org/package=HistData.\n\n\nFriendly, Michael, and Howard Wainer. 2021. A History of Data\nVisualization and Graphic Communication. 1st ed. Massachusetts:\nHarvard University Press.\n\n\nFry, Hannah. 2020. “Big Tech Is Testing You.” The New\nYorker, February, 61–65. https://www.newyorker.com/magazine/2020/03/02/big-tech-is-testing-you.\n\n\nFryzlewicz, Piotr. 2024. “Telling Stories\nwith Data: With Applications in R.” The American\nStatistician, April, 1–5. https://doi.org/10.1080/00031305.2024.2339562.\n\n\nFuller, Mark, and James Mosher. 1987. “Raptor Survey\nTechniques.” In Raptor Management Techniques Manual,\nedited by Beth Pendleton, Brian Millsap, Keith Cline, and David Bird,\n37–65. National Wildlife Federation. https://www.sandiegocounty.gov/content/dam/sdc/pds/ceqa/JVR/AdminRecord/IncorporatedByReference/Appendices/Appendix-D---Biological-Resources-Report/Fuller%20and%20Mosher%201987.pdf.\n\n\nFunkhouser, Gray. 1937. “Historical Development of the Graphical\nRepresentation of Statistical Data.” Osiris 3: 269–404.\nhttps://doi.org/10.1086/368480.\n\n\nGagolewski, Marek. 2022. “stringi:\nFast and Portable Character String Processing in\nR.” Journal of Statistical Software 103\n(2): 1–59. https://doi.org/10.18637/jss.v103.i02.\n\n\nGalef, Julia. 2020. “Episode 248: Are Democrats Being Irrational?\n(David Shor).” Rationally Speaking, December. http://rationallyspeakingpodcast.org/248-are-democrats-being-irrational-david-shor/.\n\n\nGao, Lucy, Jacob Bien, and Daniela Witten. 2022. “Selective\nInference for Hierarchical Clustering.” Journal of the\nAmerican Statistical Association, October, 1–11. https://doi.org/10.1080/01621459.2022.2116331.\n\n\nGao, Zheng, Christian Bird, and Earl T. Barr. 2017. “To Type or\nNot to Type: Quantifying Detectable Bugs in\nJavaScript.” In 2017\nIEEE/ACM 39th International Conference on\nSoftware Engineering (ICSE). IEEE. https://doi.org/10.1109/icse.2017.75.\n\n\nGarfinkel, Irwin, Lee Rainwater, and Timothy Smeeding. 2006. “A\nRe-Examination of Welfare States and Inequality in Rich Nations: How\nin-Kind Transfers and Indirect Taxes Change the Story.”\nJournal of Policy Analysis and Management 25 (4): 897–919. https://doi.org/10.1002/pam.20213.\n\n\nGargiulo, Maria. 2022. “Statistical Biases, Measurement\nChallenges, and Recommendations for Studying Patterns of Femicide in\nConflict.” Peace Review 34 (2): 163–76. https://doi.org/10.1080/10402659.2022.2049002.\n\n\nGarnier, Simon, Noam Ross, Robert Rudis, Antônio Camargo, Marco Sciaini,\nand Cédric Scherer. 2021. viridis –\nColorblind-Friendly Color Maps for R. https://doi.org/10.5281/zenodo.4679424.\n\n\nGazeley, Ursula, Georges Reniers, Hallie Eilerts-Spinelli, Julio Romero\nPrieto, Momodou Jasseh, Sammy Khagayi, and Veronique Filippi. 2022.\n“Women’s Risk of Death Beyond 42 Days Post Partum: A Pooled\nAnalysis of Longitudinal Health and Demographic Surveillance System Data\nin Sub-Saharan Africa.” The Lancet Global Health 10\n(11): e1582–89. https://doi.org/10.1016/s2214-109x(22)00339-4.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021.\n“Datasheets for Datasets.” Communications of the\nACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\nGelfand, Sharla. 2021. “Make a ReprEx... Please.”\nYouTube, February. https://youtu.be/G5Nm-GpmrLw.\n\n\n———. 2022a. Astrologer: Chani Nicholas Weekly Horoscopes\n(2013-2017). http://github.com/sharlagelfand/astrologer.\n\n\n———. 2022b. opendatatoronto: Access the City of\nToronto Open Data Portal. https://CRAN.R-project.org/package=opendatatoronto.\n\n\nGelman, Andrew. 2016. “What has happened down\nhere is the winds have changed,” September. https://statmodeling.stat.columbia.edu/2016/09/21/what-has-happened-down-here-is-the-winds-have-changed/.\n\n\n———. 2019. “Another Regression Discontinuity Disaster and What Can\nWe Learn from It,” June. https://statmodeling.stat.columbia.edu/2019/06/25/another-regression-discontinuity-disaster-and-what-can-we-learn-from-it/.\n\n\n———. 2020. “Statistical Models of Election Outcomes.”\nYouTube, August. https://youtu.be/7gjDnrbLQ4k.\n\n\nGelman, Andrew, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and\nDonald Rubin. (1995) 2014. Bayesian Data Analysis. 3rd ed.\nChapman; Hall/CRC.\n\n\nGelman, Andrew, Sharad Goel, Douglas Rivers, and David Rothschild. 2016.\n“The Mythical Swing Voter.” Quarterly Journal of\nPolitical Science 11 (1): 103–30. https://doi.org/10.1561/100.00015031.\n\n\nGelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using\nRegression and Multilevel/Hierarchical Models. 1st ed. Cambridge\nUniversity Press.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and\nOther Stories. Cambridge University Press. https://avehtari.github.io/ROS-Examples/.\n\n\nGelman, Andrew, and Guido Imbens. 2019. “Why High-Order\nPolynomials Should Not Be Used in Regression Discontinuity\nDesigns.” Journal of Business & Economic Statistics\n37 (3): 447–56. https://doi.org/10.1080/07350015.2017.1366909.\n\n\nGelman, Andrew, and Eric Loken. 2013. “The Garden of Forking\nPaths: Why Multiple Comparisons Can Be a Problem, Even When There Is No\n‘Fishing Expedition’ or ‘p-Hacking’ and the\nResearch Hypothesis Was Posited Ahead of Time.” Department of\nStatistics, Columbia University. http://www.stat.columbia.edu/~gelman/research/unpublished/p%5Fhacking.pdf.\n\n\nGelman, Andrew, Greggor Mattson, and Daniel Simpson. 2018. “Gaydar\nand the Fallacy of Decontextualized Measurement.”\nSociological Science 5 (12): 270–80. https://doi.org/10.15195/v5.a12.\n\n\nGelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s\nPractice What We Preach: Turning Tables into Graphs.” The\nAmerican Statistician 56 (2): 121–30. https://doi.org/10.1198/000313002317572790.\n\n\nGelman, Andrew, and Aki Vehtari. 2021. “What Are the Most\nImportant Statistical Ideas of the Past 50 Years?” Journal of\nthe American Statistical Association 116 (536): 2087–97. https://doi.org/10.1080/01621459.2021.1938081.\n\n\n———. 2023. Learn Statistics: Hundreds of Stories, Activities, and\nExamples.\n\n\nGelman, Andrew, Aki Vehtari, Daniel Simpson, Charles Margossian, Bob\nCarpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian\nBürkner, and Martin Modrák. 2020. “Bayesian Workflow.”\narXiv. https://doi.org/10.48550/arXiv.2011.01808.\n\n\nGentemann, Chelle Leigh, Chris Holdgraf, Ryan Abernathey, Daniel\nCrichton, James Colliander, Edward Joseph Kearns, Yuvi Panda, and\nRichard Signell. 2021. “Science Storms the Cloud.”\nAGU Advances 2 (2). https://doi.org/10.1029/2020av000354.\n\n\nGerber, Alan, and Donald Green. 2012. Field Experiments: Design,\nAnalysis, and Interpretation. New York: WW Norton.\n\n\nGerring, John. 2012. “Mere Description.” British\nJournal of Political Science 42 (4): 721–46. https://doi.org/10.1017/s0007123412000130.\n\n\nGertler, Paul, Sebastian Martinez, Patrick Premand, Laura Rawlings, and\nChristel Vermeersch. 2016. Impact Evaluation in Practice. 2nd\ned. The World Bank. https://doi.org/10.1596/978-1-4648-0779-4.\n\n\nGeuenich, Michael, Jinyu Hou, Sunyun Lee, Shanza Ayub, Hartland Jackson,\nand Kieran Campbell. 2021a. “Automated Assignment of Cell Identity\nfrom Single-Cell Multiplexed Imaging and Proteomic Data.”\nCell Systems 12 (12): 1173–86. https://doi.org/10.1016/j.cels.2021.08.012.\n\n\n———. 2021b. “Replication Materials: \"Automated Assignment of Cell\nIdentity from Single-Cell Multiplexed Imaging and Proteomic\nData\".” https://doi.org/10.5281/ZENODO.5156049.\n\n\nGhitza, Yair, and Andrew Gelman. 2020. “Voter Registration\nDatabases and MRP: Toward the Use of Large-Scale Databases in Public\nOpinion Research.” Political Analysis 28 (4): 507–31. https://doi.org/10.1017/pan.2020.3.\n\n\nGibney, Elizabeth. 2022. “The leap second’s\ntime is up: world votes to stop pausing clocks.”\nNature 612 (7938): 18–18. https://doi.org/10.1038/d41586-022-03783-5.\n\n\nGleick, James. 1990. “The Census: Why We Can’t Count.”\nThe New York Times, July. https://www.nytimes.com/1990/07/15/magazine/the-census-why-we-can-t-count.html.\n\n\nGodfrey, Ernest. 1918. “History and Development of Statistics in\nCanada.” In The History of Statistics–Their Development and\nProgress in Many Countries. New York: Macmillan, edited by John\nKoren, 179–98. Macmillan Company of New York.\n\n\nGoodman, Leo. 1961. “Snowball Sampling.” The Annals of\nMathematical Statistics 32 (1): 148–70. https://doi.org/10.1214/aoms/1177705148.\n\n\nGoodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2023.\n“rstanarm: Bayesian applied\nregression modeling via Stan.” https://mc-stan.org/rstanarm.\n\n\nGoogle. 2022. “What to Look for in a Code Review.” Google\nEngineering Practices Documentation. https://google.github.io/eng-practices/review/reviewer/looking-for.html.\n\n\nGordon, Brett, Robert Moakler, and Florian Zettelmeyer. 2022.\n“Close Enough? A Large-Scale Exploration of Non-Experimental\nApproaches to Advertising Measurement.” Marketing\nScience, November. https://doi.org/10.1287/mksc.2022.1413.\n\n\nGordon, Brett, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky.\n2019. “A Comparison of Approaches to Advertising Measurement:\nEvidence from Big Field Experiments at Facebook.” Marketing\nScience 38 (2): 193–225. https://doi.org/10.1287/mksc.2018.1135.\n\n\nGould, Elliot, Hannah Fraser, Timothy Parker, Shinichi Nakagawa, Simon\nGriffith, Peter Vesk, and Fiona Fidler. 2023. “Same Data,\nDifferent Analysts: Variation in Effect Sizes Due to Analytical\nDecisions in Ecology and Evolutionary Biology,” October. https://doi.org/10.32942/x2gg62.\n\n\nGraham, Paul. 2020. “How to Write Usefully,” February. http://paulgraham.com/useful.html.\n\n\nGray, Charles T., and Ben Marwick. 2019. “Truth, Proof, and\nReproducibility: There’s No Counter-Attack for the Codeless.” In\nCommunications in Computer and Information Science, 111–29.\nSpringer Singapore. https://doi.org/10.1007/978-981-15-1960-4_8.\n\n\nGreen, Donald, Terence Leong, Holger Kern, Alan Gerber, and Christopher\nLarimer. 2009. “Testing the Accuracy of Regression Discontinuity\nAnalysis Using Experimental Benchmarks.” Political\nAnalysis 17 (4): 400–417. https://doi.org/10.1093/pan/mpp018.\n\n\nGreen, Eric. 2020. “Nivi Research: Mister P\nhelps us understand vaccine hesitancy,” December. https://research.nivi.io/posts/2020-12-08-mister-p-helps-us-understand-vaccine-hesitancy/.\n\n\nGreenberg, Bernard, Abdel-Latif Abul-Ela, Walt Simmons, and Daniel\nHorvitz. 1969. “The Unrelated Question Randomized Response Model:\nTheoretical Framework.” Journal of the American Statistical\nAssociation 64 (326): 520–39. https://doi.org/10.1080/01621459.1969.10500991.\n\n\nGreenland, Sander, Stephen Senn, Kenneth Rothman, John Carlin, Charles\nPoole, Steven Goodman, and Douglas Altman. 2016. “Statistical Tests, P values, Confidence Intervals, and\nPower: A Guide to Misinterpretations.” European\nJournal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.\n\n\nGreifer, Noah. 2021. “Why Do We Do Matching for Causal Inference\nVs Regressing on Confounders?” Cross Validated,\nSeptember. https://stats.stackexchange.com/q/544958.\n\n\nGrimmer, Justin, Margaret Roberts, and Brandon Stewart. 2022. Text As Data: A New Framework for Machine Learning and\nthe Social Sciences. New Jersey: Princeton University Press.\n\n\nGrolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times\nMade Easy with lubridate.”\nJournal of Statistical Software 40 (3): 1–25. https://doi.org/10.18637/jss.v040.i03.\n\n\nGronsbell, Jessica, Jessica Minnier, Sheng Yu, Katherine Liao, and\nTianxi Cai. 2019. “Automated Feature Selection of Predictors in\nElectronic Medical Records Data.” Biometrics 75 (1):\n268–77. https://doi.org/10.1111/biom.12987.\n\n\nGroves, Robert. 2011. “Three Eras of Survey Research.”\nPublic Opinion Quarterly 75 (5): 861–71. https://doi.org/10.1093/poq/nfr057.\n\n\nGroves, Robert, and Lars Lyberg. 2010. “Total\nSurvey Error: Past, Present, and Future.” Public\nOpinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.\n\n\nGrün, Bettina, and Kurt Hornik. 2011. “topicmodels: An R Package for Fitting\nTopic Models.” Journal of Statistical Software 40 (13):\n1–30. https://doi.org/10.18637/jss.v040.i13.\n\n\nGustafsson, Karl, and Linus Hagström. 2017. “What Is the Point?\nTeaching Graduate Students How to Construct Political Science Research\nPuzzles.” European Political Science 17 (4): 634–48. https://doi.org/10.1057/s41304-017-0130-y.\n\n\nGutman, Robert. 1958. “Birth and Death Registration in\nMassachusetts: II. The Inauguration of a Modern System,\n1800-1849.” The Milbank Memorial Fund Quarterly 36 (4):\n373–402.\n\n\nHackett, Robert. 2016. “Researchers Caused an\nUproar By Publishing Data From 70,000 OkCupid Users.”\nFortune, May. https://fortune.com/2016/05/18/okcupid-data-research/.\n\n\nHalberstam, David. 1972. The Best and the\nBrightest. 1st ed. New York: Random House.\n\n\nHamming, Richard. (1997) 2020. The Art of Doing\nScience and Engineering. 2nd ed. Stripe Press.\n\n\nHammond, Jennifer, Heidi Leister-Tebbe, Annie Gardner, Paula Abreu,\nWeihang Bao, Wayne Wisemandle, MaryLynn Baniecki, et al. 2022.\n“Oral Nirmatrelvir for High-Risk, Nonhospitalized Adults with\nCovid-19.” New England Journal of Medicine 386 (15):\n1397–1408. https://doi.org/10.1056/nejmoa2118542.\n\n\nHand, David. 2018. “Statistical Challenges of Administrative and\nTransaction Data.” Journal of the Royal Statistical Society:\nSeries A (Statistics in Society) 181 (3): 555–605. https://doi.org/10.1111/rssa.12315.\n\n\nHandcock, Mark, and Krista Gile. 2011. “Comment: On the Concept of\nSnowball Sampling.” Sociological Methodology 41 (1):\n367–71. https://doi.org/10.1111/j.1467-9531.2011.01243.x.\n\n\nHangartner, Dominik, Daniel Kopp, and Michael Siegenthaler. 2021.\n“Monitoring Hiring Discrimination Through Online Recruitment\nPlatforms.” Nature 589 (7843): 572–76. https://doi.org/10.1038/s41586-020-03136-0.\n\n\nHanretty, Chris. 2020. “An Introduction to Multilevel Regression\nand Post-Stratification for Estimating Constituency Opinion.”\nPolitical Studies Review 18 (4): 630–45. https://doi.org/10.1177/1478929919864773.\n\n\nHao, Karen. 2019. “This is How AI Bias Really\nHappens—And Why It’s So Hard To Fix.” MIT Technology\nReview, February. https://www.technologyreview.com/2019/02/04/137602/this-is-how-ai-bias-really-happensand-why-its-so-hard-to-fix/.\n\n\nHart, Edmund, Pauline Barmby, David LeBauer, François Michonneau, Sarah\nMount, Patrick Mulrooney, Timothée Poisot, Kara Woo, Naupaka Zimmerman,\nand Jeffrey Hollister. 2016. “Ten Simple Rules for Digital Data\nStorage.” PLOS Computational Biology 12\n(10): e1005097. https://doi.org/10.1371/journal.pcbi.1005097.\n\n\nHartocollis, Anemona. 2022. “U.S. News Ranked\nColumbia No. 2, but a Math Professor Has His Doubts.”\nThe New York Times, March. https://www.nytimes.com/2022/03/17/us/columbia-university-rank.html.\n\n\nHassan, Mai. 2022. “New Insights on Africa’s Autocratic\nPast.” African Affairs 121 (483): 321–33. https://doi.org/10.1093/afraf/adac002.\n\n\nHastie, Trevor, and Robert Tibshirani. 1990. Generalized Additive\nModels. 1st ed. Boca Raton: Chapman; Hall/CRC.\n\n\nHawes, Michael. 2020. “Implementing Differential\nPrivacy: Seven Lessons From the\n2020 United States\nCensus.” Harvard Data Science Review 2 (2).\nhttps://doi.org/10.1162/99608f92.353c6f99.\n\n\nHayot, Eric. 2014. The Elements of Academic Style. New York:\nColumbia University Press.\n\n\nHealy, Kieran. 2018. Data Visualization. New Jersey: Princeton\nUniversity Press. https://socviz.co.\n\n\n———. 2020. “The Kitchen Counter Observatory,” May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/.\n\n\n———. 2022. “Unhappy in Its Own Way,” July. https://kieranhealy.org/blog/archives/2022/07/22/unhappy-in-its-own-way/.\n\n\nHeckathorn, Douglas. 1997. “Respondent-Driven Sampling: A New\nApproach to the Study of Hidden Populations.” Social\nProblems 44 (2): 174–99. https://doi.org/10.2307/3096941.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey\nGreene, and Stephanie Hicks. 2021. “Reproducibility Standards for\nMachine Learning in the Life Sciences.” Nature Methods\n18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHeller, Jean. 2022. “AP Exposes the Tuskegee Syphilis Study: The\n50th Anniversary.” AP, July. https://apnews.com/article/tuskegee-study-ap-story-investigation-syphilis-53403657e77d76f52df6c2e2892788c9.\n\n\nHermans, Felienne. 2017. “Peter Hilton on Naming.” IEEE\nSoftware 34 (3): 117–20. https://doi.org/10.1109/MS.2017.81.\n\n\n———. 2021. The Programmer’s Brain: What Every Programmer Needs to\nKnow about Cognition. 1st ed. New York: Simon; Schuster. https://www.manning.com/books/the-programmers-brain.\n\n\nHernán, Miguel, David Clayton, and Niels Keiding. 2011. “The\nSimpson’s Paradox Unraveled.” International Journal of\nEpidemiology 40 (3): 780–85. https://doi.org/10.1093/ije/dyr041.\n\n\nHernán, Miguel, and James Robins. 2023. What If. 1st ed. Boca\nRaton: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.\n\n\nHerndon, Thomas, Michael Ash, and Robert Pollin. 2014. “Does High\nPublic Debt Consistently Stifle Economic Growth? A Critique of Reinhart\nand Rogoff.” Cambridge Journal of Economics 38 (2):\n257–79. https://doi.org/10.1093/cje/bet075.\n\n\nHester, Jim, Florent Angly, Russ Hyde, Michael Chirico, Kun Ren,\nAlexander Rosenstock, and Indrajeet Patil. 2022. lintr: A “Linter” for R Code. https://CRAN.R-project.org/package=lintr.\n\n\nHester, Jim, Hadley Wickham, and Gábor Csárdi. 2021. fs: Cross-Platform File System Operations Based on\n“libuv”. https://CRAN.R-project.org/package=fs.\n\n\nHill, Austin Bradford. 1965. “The Environment and Disease:\nAssociation or Causation?” Proceedings of the Royal Society\nof Medicine 58 (5): 295–300.\n\n\nHillel, Wayne. 2017. How Do We Trust Our Science Code? https://www.hillelwayne.com/how-do-we-trust-science-code/.\n\n\nHo, Daniel, Kosuke Imai, Gary King, and Elizabeth Stuart. 2011.\n“MatchIt: Nonparametric Preprocessing for Parametric\nCausal Inference.” Journal of Statistical Software 42\n(8): 1–28. https://doi.org/10.18637/jss.v042.i08.\n\n\nHodgetts, Paul. 2022. “The Negative Space of Data,” March.\nhttps://hodgettsp.netlify.app/post/data-negativespace/.\n\n\nHofmeister, Johannes, Janet Siegmund, and Daniel Holt. 2017.\n“Shorter Identifier Names Take Longer to Comprehend.” In\n2017 IEEE 24th International Conference on Software Analysis,\nEvolution and Reengineering (SANER), 217–27. https://doi.org/10.1109/saner.2017.7884623.\n\n\nHolland, Paul. 1986. “Statistics and Causal Inference.”\nJournal of the American Statistical Association 81 (396):\n945–60. https://doi.org/10.2307/2289064.\n\n\nHolliday, Derek, Tyler Reny, Alex Rossell Hayes, Aaron Rudkin, Chris\nTausanovitch, and Lynn Vavreck. 2021. “Democracy Fund + UCLA Nationscape Methodology and\nRepresentativeness Assessment.”\n\n\nHopper, Nate. 2022. “The Thorny Problem of Keeping the Internet’s\nTime.” The New Yorker, September. https://www.newyorker.com/tech/annals-of-technology/the-thorny-problem-of-keeping-the-internets-time.\n\n\nHorst, Allison Marie, Alison Presmanes Hill, and Kristen Gorman. 2020.\npalmerpenguins: Palmer Archipelago (Antarctica)\npenguin data. https://doi.org/10.5281/zenodo.3960218.\n\n\nHorton, Nicholas, Rohan Alexander, Micaela Parker, Aneta Piekut, and\nColin Rundel. 2022. “The Growing Importance of Reproducibility and\nResponsible Workflow in the Data Science and Statistics\nCurriculum.” Journal of Statistics and Data Science\nEducation 30 (3): 207–8. https://doi.org/10.1080/26939169.2022.2141001.\n\n\nHorton, Nicholas, and Stuart Lipsitz. 2001. “Multiple Imputation\nin Practice.” The American Statistician 55 (3): 244–54.\nhttps://doi.org/10.1198/000313001317098266.\n\n\nHotz, Joseph, Christopher Bollinger, Tatiana Komarova, Charles Manski,\nRobert Moffitt, Denis Nekipelov, Aaron Sojourner, and Bruce Spencer.\n2022. “Balancing Data Privacy and Usability in the Federal\nStatistical System.” Proceedings of the National Academy of\nSciences 119 (31): 1–10. https://doi.org/10.1073/pnas.2104906119.\n\n\nHowes, Adam. 2022. “Representing Uncertainty Using Significant\nFigures,” April. https://athowes.github.io/posts/2022-04-24-representing-uncertainty-using-significant-figures/.\n\n\nHug, Lucia, Monica Alexander, Danzhen You, Leontine Alkema, and UN\nInter-agency Group for Child. 2019. “National, Regional, and\nGlobal Levels and Trends in Neonatal Mortality Between 1990 and 2017,\nwith Scenario-Based Projections to 2030: A Systematic Analysis.”\nLancet Global Health 7 (6): e710–20. https://doi.org/10.1016/S2214-109X(19)30163-9.\n\n\nHughes, Nicola, and Jill Rutter. 2016. “Ministers Reflect:\nInterview with Oliver Letwin,” December. https://www.instituteforgovernment.org.uk/ministers-reflect/person/oliver-letwin/.\n\n\nHulley, Stephen, Steven Cummings, Warren Browner, Deborah Grady, and\nThomas Newman. 2007. Designing Clinical Research. 3rd ed.\nLippincott Williams & Wilkins.\n\n\nHullman, Jessica, and Andrew Gelman. 2021. “Designing for\nInteractive Exploratory Data Analysis Requires Theories of Graphical\nInference.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.3ab8a587.\n\n\nHuntington-Klein, Nick. 2021. The Effect: An Introduction to\nResearch Design and Causality. 1st ed. Chapman & Hall. https://theeffectbook.net.\n\n\n———. 2022. “Library of Statistical Techniques.” https://lost-stats.github.io.\n\n\nHuntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni,\nJeffrey Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The\nInfluence of Hidden Researcher Decisions in Applied\nMicroeconomics.” Economic Inquiry 59: 944–60. https://doi.org/10.1111/ecin.12992.\n\n\nHuyen, Chip. 2020. “Machine Learning Is Going Real-Time,”\nDecember. https://huyenchip.com/2020/12/27/real-time-machine-learning.html.\n\n\nHvitfeldt, Emil, and Julia Silge. 2021. Supervised Machine Learning for Text Analysis in\nR. 1st ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781003093459.\n\n\nHyman, Michael, Luca Sartore, and Linda J Young. 2021. “Capture-Recapture Estimation of Characteristics of U.S.\nLocal Food Farms Using a Web-Scraped List Frame.”\nJournal of Survey Statistics and Methodology 10 (4): 979–1004.\nhttps://doi.org/10.1093/jssam/smab008.\n\n\nHyndman, Rob, Timothy Hyndman, Charles Gray, Sayani Gupta, and Jacquie\nTran. 2022. cricketdata: International Cricket\nData. https://CRAN.R-project.org/package=cricketdata.\n\n\nIannone, Richard. 2022. DiagrammeR: Graph/Network\nVisualization. https://CRAN.R-project.org/package=DiagrammeR.\n\n\nIannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra\nLauer, and JooYoung Seo. 2022. gt: Easily\nCreate Presentation-Ready Display Tables.\n\n\nIannone, Richard, and Mauricio Vargas. 2022. pointblank: Data Validation and Organization of Metadata\nfor Local and Remote Tables. https://CRAN.R-project.org/package=pointblank.\n\n\nInternational Organization Of Legal Metrology. 2007. International\nVocabulary of Metrology – Basic and General Concepts and Associated\nTerms. 3rd ed. https://www.oiml.org/en/files/pdf%5Fv/v002-200-e07.pdf.\n\n\nIoannidis, John. 2005. “Why Most Published Research Findings Are\nFalse.” PLOS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.\n\n\nIrizarry, Rafael. 2020. “The Role of Academia\nin Data Science Education.” Harvard Data Science\nReview 2 (1). https://doi.org/10.1162/99608f92.dd363929.\n\n\nIrving, Damien, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte\nWickham, and Greg Wilson. 2021. Research Software Engineering with\nPython. Chapman; Hall/CRC.\n\n\nIsaacson, Walter. 2011. Steve Jobs. 1st ed. Simon &\nSchuster.\n\n\nIshiguro, Kazuo. 1989. The Remains of the Day. 1st ed. Faber;\nFaber.\n\n\nIzrailev, Sergei. 2022. tictoc: Functions for\nTiming R Scripts, as Well as Implementations of “Stack” and\n“List” Structures. https://CRAN.R-project.org/package=tictoc.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.\n(2013) 2021. An Introduction to Statistical\nLearning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.\n\n\nJenkins, Jennifer, Steven Rich, Andrew Ba Tran, Paige Moody, Julie Tate,\nand Ted Mellnik. 2022. “How the Washington Post Examines Police\nShootings in the United States.” https://www.washingtonpost.com/investigations/2022/12/05/washington-post-fatal-police-shootings-methodology/.\n\n\nJet Propulsion Laboratory. 2009. “JPL\nInstitutional Coding Standard for the C Programming\nLanguage.” Document Number D-60411, March. https://web.archive.org/web/20111015064908/http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf.\n\n\nJohnson, Alicia, Miles Ott, and Mine Dogucu. 2022. Bayes Rules! An Introduction to Bayesian Modeling with\nR. 1st ed. Chapman; Hall/CRC. https://www.bayesrulesbook.com.\n\n\nJohnson, Kaneesha. 2021. “Two Regimes of Prison Data\nCollection.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.72825001.\n\n\nJohnston, Myfanwy, and David Robinson. 2022. gutenbergr: Download and Process Public Domain Works from\nProject Gutenberg. https://CRAN.R-project.org/package=gutenbergr.\n\n\nJones, Arnold. 1953. “Census Records of the Later Roman\nEmpire.” The Journal of Roman Studies 43: 49–64. https://doi.org/10.2307/297781.\n\n\nJordan, Michael. 2004. “Graphical Models.” Statistical\nScience 19 (1). https://doi.org/10.1214/088342304000000026.\n\n\n———. 2019. “Artificial Intelligence–The\nRevolution Hasn’t Happened Yet.” Harvard Data Science\nReview 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.\n\n\nJoyner, Michael. 1991. “Modeling: Optimal Marathon Performance on\nthe Basis of Physiological Factors.” Journal of Applied\nPhysiology 70 (2): 683–87. https://doi.org/10.1152/jappl.1991.70.2.683.\n\n\nJurafsky, Dan, and James Martin. (2000) 2023. Speech and Language\nProcessing. 3rd ed. https://web.stanford.edu/~jurafsky/slp3/.\n\n\nKahan, Brennan, Suzie Cro, Fan Li, and Michael Harhay. 2023.\n“Eliminating Ambiguous Treatment Effects Using Estimands.”\nAmerican Journal of Epidemiology, February. https://doi.org/10.1093/aje/kwad036.\n\n\nKahan, Brennan, Joanna Hindley, Mark Edwards, Suzie Cro, and Tim Morris.\n2024. “The estimands framework: a primer on\nthe ICH E9(R1) addendum.” BMJ, January, e076316.\nhttps://doi.org/10.1136/bmj-2023-076316.\n\n\nKahan, Brennan, Fan Li, Andrew Copas, and Michael Harhay. 2022.\n“Estimands in Cluster-Randomized Trials: Choosing Analyses That\nAnswer the Right Question.” International Journal of\nEpidemiology, July. https://doi.org/10.1093/ije/dyac131.\n\n\nKahle, David, and Hadley Wickham. 2013. “ggmap: Spatial Visualization with ggplot2.”\nThe R Journal 5 (1): 144–61. http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.\n\n\nKahneman, Daniel, Olivier Sibony, and Cass Sunstein. 2021. Noise: A\nFlaw in Human Judgment. William Collins.\n\n\nKalamara, Eleni, Arthur Turrell, Chris Redl, George Kapetanios, and\nSujit Kapadia. 2022. “Making text count:\nEconomic forecasting using newspaper text.”\nJournal of Applied Econometrics 37 (5): 896–919.\nhttps://doi.org/10.1002/jae.2907.\n\n\nKalgin, Alexander. 2014. “Implementation of\nPerformance Management in Regional Government in Russia: Evidence of\nData Manipulation.” Public Management Review 18\n(1): 110–38. https://doi.org/10.1080/14719037.2014.965271.\n\n\nKapoor, Sayash, and Arvind Narayanan. 2023. “Leakage and the\nReproducibility Crisis in Machine-Learning-Based Science.”\nPatterns 4 (9): 1–12. https://doi.org/10.1016/j.patter.2023.100804.\n\n\nKarsten, Karl. 1923. Charts and Graphs. New York:\nPrentice-Hall.\n\n\nKasy, Maximilian, and Alexander Teytelboym. 2023. “Matching with\nSemi-Bandits.” The Econometrics Journal 26 (1): 45–66.\nhttps://doi.org/10.1093/ectj/utac021.\n\n\nKatz, Lindsay, and Rohan Alexander. 2023a. “A\nnew, comprehensive database of all proceedings of the Australian\nParliamentary Debates (1998-2022).” Zenodo. https://doi.org/10.5281/zenodo.7799678.\n\n\n———. 2023b. “Digitization of the Australian Parliamentary Debates,\n1998–2022.” Scientific Data 10 (1): 1–14. https://doi.org/10.1038/s41597-023-02464-w.\n\n\nKay, Matthew. 2022. tidybayes: Tidy Data\nand Geoms for Bayesian Models. https://doi.org/10.5281/zenodo.1308151.\n\n\nKennedy, Lauren, and Jonah Gabry. 2020. “MRP\nwith rstanarm,” July. https://mc-stan.org/rstanarm/articles/mrp.html.\n\n\nKennedy, Lauren, and Andrew Gelman. 2021. “Know Your Population\nand Know Your Model: Using Model-Based Regression and Poststratification\nto Generalize Findings Beyond the Observed Sample.”\nPsychological Methods 26 (5): 547–58. https://doi.org/10.1037/met0000362.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun\nJia, and Julien Teitler. 2022. “He, She, They: Using Sex and\nGender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKenny, Christopher T., Shiro Kuriwaki, Cory McCartan, Evan T. R.\nRosenman, Tyler Simko, and Kosuke Imai. 2021. “The use of differential privacy for census data and its\nimpact on redistricting: The case of the 2020 U.S.\nCensus.” Science Advances 7 (41). https://doi.org/10.1126/sciadv.abk3283.\n\n\n———. 2023. “Comment: The Essential Role of Policy Evaluation for\nthe 2020 Census Disclosure Avoidance System.” Harvard Data\nScience Review, no. Special Issue 2. https://doi.org/10.1162/99608f92.abc2c765.\n\n\nKent, William. 1993. “My Height: A Model for Numeric\nInformation.” https://www.bkent.net/Doc/myheight.htm.\n\n\nKeshav, Srinivasan. 2007. “How to Read a Paper.”\nACM SIGCOMM Computer Communication\nReview 37 (3): 83–84. https://doi.org/10.1145/1273445.1273458.\n\n\nKeyes, Os. 2019. “Counting the Countless.” Real\nLife. https://reallifemag.com/counting-the-countless/.\n\n\nKharecha, Pushker, and James Hansen. 2013. “Prevented Mortality\nand Greenhouse Gas Emissions from Historical and Projected Nuclear\nPower.” Environmental Science & Technology 47 (9):\n4889–95. https://doi.org/10.1021/es3051197.\n\n\nKiang, Mathew, Alexander Tsai, Monica Alexander, David Rehkopf, and\nSanjay Basu. 2021. “Racial/Ethnic Disparities in Opioid-Related\nMortality in the USA, 1999–2019: The Extreme Case of Washington\nDC.” Journal of Urban Health 98 (5): 589–95. https://doi.org/10.1007/s11524-021-00573-8.\n\n\nKing, Gary. 2006. “Publication, Publication.” PS:\nPolitical Science & Politics 39 (1): 119–25. https://doi.org/10.1017/S1049096506060252.\n\n\nKing, Gary, and Richard Nielsen. 2019. “Why Propensity Scores\nShould Not Be Used for Matching.” Political Analysis 27\n(4): 435–54. https://doi.org/10.1017/pan.2019.11.\n\n\nKing, Stephen. 2000. On Writing: A Memoir of the Craft. 1st ed.\nScribner.\n\n\nKirkegaard, Emil, and Julius Bjerrekær. 2016. “The OKCupid\nDataset: A Very Large Public Dataset of Dating Site Users.”\nOpen Differential Psychology, 1–10. https://doi.org/10.26775/ODP.2016.11.03.\n\n\nKish, Leslie. 1959. “Some Statistical Problems in Research\nDesign.” American Sociological Review 24 (3): 328–38. https://doi.org/10.2307/2089381.\n\n\nKleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics\nwith R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.\n\n\nKnuth, Donald. 1984. “Literate Programming.” The\nComputer Journal 27 (2): 97–111. https://doi.org/10.1093/comjnl/27.2.97.\n\n\n———. 1998. Art of Computer Programming, Volume 2: Seminumerical\nAlgorithms. 2nd ed.\n\n\nKnutson, Victoria, Serge Aleshin-Guendel, Ariel Karlinsky, William\nMsemburi, and Jon Wakefield. 2022. “Estimating Global and\nCountry-Specific Excess Mortality During the COVID-19 Pandemic,”\nMay. https://cdn.who.int/media/docs/default-source/world-health-data-platform/covid-19-excessmortality/covid-methods-paper-revision.pdf.\n\n\nKoenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey,\nZion Mengesha, Connor Toups, John Rickford, Dan Jurafsky, and Sharad\nGoel. 2020. “Racial Disparities in Automated Speech\nRecognition.” Proceedings of the National Academy of\nSciences 117 (14): 7684–89. https://doi.org/10.1073/pnas.1915768117.\n\n\nKoenecke, Allison, and Hal Varian. 2020. “Synthetic Data\nGeneration for Economists.” https://arxiv.org/abs/2011.01374.\n\n\nKoenker, Roger, and Achim Zeileis. 2009. “On Reproducible\nEconometric Research.” Journal of Applied Econometrics\n24 (5): 833–47. https://doi.org/10.1002/jae.1083.\n\n\nKoerner, Lisbet. 2000. Linnaeus: Nature and Nation. Cambridge:\nHarvard University Press.\n\n\nKohavi, Ron, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and\nYa Xu. 2012. “Trustworthy Online Controlled Experiments.”\nIn Proceedings of the 18th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining -\nKDD 12, 1st ed. ACM Press.\nhttps://doi.org/10.1145/2339530.2339653.\n\n\nKohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical\nGuide to A/B Testing. Cambridge University Press.\n\n\nKoitsalu, Marie, Martin Eklund, Jan Adolfsson, Henrik Grönberg, and\nYvonne Brandberg. 2018. “Effects of Pre-Notification, Invitation\nLength, Questionnaire Length and Reminder on Participation Rate: A\nQuasi-Randomised Controlled Trial.” BMC Medical Research\nMethodology 18 (3): 1–5. https://doi.org/10.1186/s12874-017-0467-5.\n\n\nKrantz, Sebastian. 2023. collapse: Advanced and\nFast Data Transformation. https://CRAN.R-project.org/package=collapse.\n\n\nKuhn, Max. 2022. tune: Tidy Tuning\nTools. https://CRAN.R-project.org/package=tune.\n\n\nKuhn, Max, and Hannah Frick. 2022. poissonreg:\nModel Wrappers for Poisson Regression. https://CRAN.R-project.org/package=poissonreg.\n\n\nKuhn, Max, and Davis Vaughan. 2022. parsnip: A\nCommon API to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip.\n\n\nKuhn, Max, Davis Vaughan, and Emil Hvitfeldt. 2022. yardstick: Tidy Characterizations of Model\nPerformance. https://CRAN.R-project.org/package=yardstick.\n\n\nKuhn, Max, and Hadley Wickham. 2020. tidymodels: a collection of packages for modeling and\nmachine learning using tidyverse principles. https://www.tidymodels.org.\n\n\n———. 2022. recipes: Preprocessing and Feature\nEngineering Steps for Modeling. https://CRAN.R-project.org/package=recipes.\n\n\nKuriwaki, Shiro, Will Beasley, and Thomas Leeper. 2023. dataverse: R Client for Dataverse 4+\nRepositories.\n\n\nKuznets, Simon, Lillian Epstein, and Elizabeth Jenks. 1941. National Income and Its Composition,\n1919-1938. National Bureau of Economic Research.\n\n\nLamott, Anne. 1994. Bird by Bird: Some Instructions on Writing and\nLife. Anchor Books.\n\n\nLandau, William Michael. 2021. “The targets R\nPackage: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for\nReproducibility and High-Performance Computing.”\nJournal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.\n\n\nLane, Nick. 2015. “The Unseen World: Reflections on Leeuwenhoek\n(1677) ‘Concerning Little Animals’.”\nPhilosophical Transactions of the Royal Society B: Biological\nSciences 370 (1666): 20140344. https://doi.org/10.1098/rstb.2014.0344.\n\n\nLaouenan, Morgane, Palaash Bhargava, Jean-Benoı̂t Eyméoud, Olivier\nGergaud, Guillaume Plique, and Etienne Wasmer. 2022. “A Cross-Verified Database of Notable People,\n3500BC–2018AD.” Scientific Data 9 (290). https://doi.org/10.1038/s41597-022-01369-4.\n\n\nLarmarange, Joseph. 2023. labelled:\nManipulating Labelled Data. https://CRAN.R-project.org/package=labelled.\n\n\nLatour, Bruno. 1996. “On Actor-Network Theory: A Few\nClarifications.” Soziale Welt 47 (4): 369–81. http://www.jstor.org/stable/40878163.\n\n\nLauderdale, Benjamin, Delia Bailey, Jack Blumenau, and Douglas Rivers.\n2020. “Model-Based Pre-Election Polling for National and\nSub-National Outcomes in the US and UK.” International\nJournal of Forecasting 36 (2): 399–413. https://doi.org/10.1016/j.ijforecast.2019.05.012.\n\n\nLaver, Michael, Kenneth Benoit, and John Garry. 2003. “Extracting\nPolicy Positions from Political Texts Using Words as Data.”\nAmerican Political Science Review 97 (2): 311–31. https://doi.org/10.1017/S0003055403000698.\n\n\nLeek, Jeff, Blakeley McShane, Andrew Gelman, David Colquhoun, Michèle\nNuijten, and Steven Goodman. 2017. “Five Ways to Fix\nStatistics.” Nature 551 (7682): 557–59. https://doi.org/10.1038/d41586-017-07522-z.\n\n\nLeek, Jeff, and Roger Peng. 2020. “Advanced Data Science\n2020.” http://jtleek.com/ads2020/index.html.\n\n\nLeonelli, Sabina. 2020. “Learning from Data Journeys.” In\nData Journeys in the Sciences, 1–24. Springer International\nPublishing. https://doi.org/10.1007/978-3-030-37177-7_1.\n\n\nLeos-Barajas, Vianey, Theoni Photopoulou, Roland Langrock, Toby\nPatterson, Yuuki Watanabe, Megan Murgatroyd, and Yannis Papastamatiou.\n2016. “Analysis of Animal Accelerometer Data Using Hidden Markov\nModels.” Methods in Ecology and Evolution 8 (2): 161–73.\nhttps://doi.org/10.1111/2041-210x.12657.\n\n\nLetterman, Clark. 2021. “Q&A: How Pew\nResearch Center surveyed nearly 30,000 people in India,”\nJuly. https://medium.com/pew-research-center-decoded/q-a-how-pew-research-center-surveyed-nearly-30-000-people-in-india-7c778f6d650e.\n\n\nLevay, Kevin, Jeremy Freese, and James Druckman. 2016. “The\nDemographic and Political Composition of Mechanical Turk\nSamples.” SAGE Open 6 (1): 1–17. https://doi.org/10.1177/2158244016636433.\n\n\nLevine, Judah, Patrizia Tavella, and Martin Milton. 2022. “Towards\na Consensus on a Continuous Coordinated Universal Time.”\nMetrologia 60 (1): 014001. https://doi.org/10.1088/1681-7575/ac9da5.\n\n\nLewis, Crystal. 2024. Data Management in Large-Scale Education\nResearch. 1st ed. Chapman; Hall/CRC. https://datamgmtinedresearch.com/index.html.\n\n\nLichand, Guilherme, and Sharon Wolf. 2022. “Measuring Child Labor:\nWhom Should Be Asked, and Why It Matters,” March. https://doi.org/10.21203/rs.3.rs-1474562/v1.\n\n\nLight, Richard, Judith Singer, and John Willett. 1990. By Design: Planning Research on Higher\nEducation. 1st ed. Cambridge: Harvard University Press.\n\n\nLima, Renato de, Oliver Phillips, Alvaro Duque, Sebastian Tello, Stuart\nDavies, Alexandre Adalardo de Oliveira, Sandra Muller, et al. 2022.\n“Making Forest Data Fair and Open.” Nature Ecology\n& Evolution 6 (April): 656–58. https://doi.org/10.1038/s41559-022-01738-7.\n\n\nLin, Herbert. 2014. “A Proposal to Reduce Government\nOverclassification of Information Related to National Security.”\nJournal of National Security Law and Policy 7: 443–63.\n\n\nLin, Sarah, Ibraheem Ali, and Greg Wilson. 2021. “Ten Quick Tips\nfor Making Things Findable.” PLOS Computational Biology\n16 (12): 1–10. https://doi.org/10.1371/journal.pcbi.1008469.\n\n\nLips, Hilary. 2020. Sex and Gender: An Introduction. 7th ed.\nIllinois: Waveland Press.\n\n\nLittle, Roderick, and Roger Lewis. 2021. “Estimands, Estimators,\nand Estimates.” JAMA 326 (10): 967. https://doi.org/10.1001/jama.2021.2886.\n\n\nLiu, Emily, Lenny Bronner, and Jeremy Bowers. 2022. “What the\nWashington Post Elections Engineering Team Had to Learn about Election\nData.” Washington Post Engineering, April. https://washpost.engineering/what-the-washington-post-elections-engineering-team-had-to-learn-about-election-data-a41603daf9ca.\n\n\nLockheed Martin. 2005. “Joint Strike Fighter Air Vehicle C++\nCoding Standards For The System Development And Demonstration\nProgram.” Document Number 2RDU00001 Rev C,\nDecember. https://www.stroustrup.com/JSF-AV-rules.pdf.\n\n\nLohr, Sharon. (1999) 2022. Sampling: Design and Analysis. 3rd\ned. Chapman; Hall/CRC.\n\n\nLoken, Meredith, and Hilary Matfess. 2023. “Introducing the\nWomen’s Activities in Armed Rebellion (WAAR) Project, 1946-2015.”\nJournal of Peace Research.\n\n\nLovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. 1st ed. Chapman;\nHall/CRC. https://geocompr.robinlovelace.net.\n\n\nLucas, Jack, Reed Merrill, Kelly Blidook, Sandra Breux, Laura Conrad,\nGabriel Eidelman, Royce Koop, et al. 2020. “Canadian\nMunicipal Elections Database.” Scholars Portal Dataverse.\nhttps://doi.org/10.5683/sp2/4mzjpq.\n\n\nLucas, Robert. 1978. “Asset Prices in an Exchange Economy.”\nEconometrica 46 (6): 1429–45. https://doi.org/10.2307/1913837.\n\n\nLuebke, David Martin, and Sybil Milton. 1994. “Locating the\nVictim: An Overview of Census-Taking, Tabulation Technology, and\nPersecution in Nazi Germany.” IEEE Annals of the History of\nComputing 16 (3): 25–39. https://doi.org/10.1109/MAHC.1994.298418.\n\n\nLumley, Thomas. 2020. “survey: analysis of\ncomplex survey samples.” https://cran.r-project.org/web/packages/survey/index.html.\n\n\nLundberg, Ian, Rebecca Johnson, and Brandon Stewart. 2021. “What\nIs Your Estimand? Defining the Target Quantity Connects Statistical\nEvidence to Theory.” American Sociological Review 86\n(3): 532–65. https://doi.org/10.1177/00031224211004187.\n\n\nLuscombe, Alex, Kevin Dick, and Kevin Walby. 2021. “Algorithmic\nThinking in the Public Interest: Navigating Technical, Legal, and\nEthical Hurdles to Web Scraping in the Social Sciences.”\nQuality & Quantity 56 (3): 1–22. https://doi.org/10.1007/s11135-021-01164-0.\n\n\nLuscombe, Alex, Jamie Duncan, and Kevin Walby. 2022. “Jumpstarting\nthe Justice Disciplines: A Computational-Qualitative Approach to\nCollecting and Analyzing Text and Image Data in Criminology and Criminal\nJustice Studies.” Journal of Criminal Justice Education\n33 (2): 151–71. https://doi.org/10.1080/10511253.2022.2027477.\n\n\nLuscombe, Alex, and Alexander McClelland. 2020. “Policing the\nPandemic: Tracking the Policing of Covid-19 Across Canada,”\nApril. https://doi.org/10.31235/osf.io/9pn27.\n\n\nLyman, Frank. 1981. “The Responsive Classroom Discussion: The\nInclusion of All Students.” Mainstreaming Digest 109:\n109–13.\n\n\nMacDorman, Marian, and Eugene Declercq. 2018. “The Failure of\nUnited States Maternal Mortality Reporting and Its Impact on Women’s\nLives.” Birth 45 (2): 105–8. https://doi.org/1111/birt.12333.\n\n\nMaher, Michael. 1982. “Modelling Association Football\nScores.” Statistica Neerlandica 36 (3): 109–18. https://doi.org/10.1111/j.1467-9574.1982.tb00782.x.\n\n\nMaier, Maximilian, František Bartoš, Tom Stanley, David Shanks, Adam\nHarris, and Eric-Jan Wagenmakers. 2022. “No Evidence for Nudging\nAfter Adjusting for Publication Bias.” Proceedings of the\nNational Academy of Sciences 119 (31): e2200300119. https://doi.org/10.1073/pnas.2200300119.\n\n\nMammoliti, Anthony, Petr Smirnov, Minoru Nakano, Zhaleh Safikhani,\nChristopher Eeles, Heewon Seo, Sisira Kadambat Nair, et al. 2021.\n“Orchestrating and Sharing Large Multimodal Data for Transparent\nand Reproducible Research.” Nature Communications 12\n(1). https://doi.org/10.1038/s41467-021-25974-w.\n\n\nManski, Charles. 2022. “Inference with Imputed Data: The Allure of\nMaking Stuff Up.” arXiv. https://doi.org/10.48550/arXiv.2205.07388.\n\n\nMarchese, David. 2022. “Her Discovery Changed the World. How Does\nShe Think We Should Use It?” The New York Times, August.\nhttps://www.nytimes.com/interactive/2022/08/15/magazine/jennifer-doudna-crispr-interview.html.\n\n\nMartin, Charles, and Ben Popper. 2021. “Don’t Push That Button:\nExploring the Software That Flies SpaceX Rockets and Starships.”\nThe Overflow, December. https://stackoverflow.blog/2021/12/27/dont-push-that-button-exploring-the-software-that-flies-spacex-starships/.\n\n\nMartı́nez, Luis. 2022. “How Much Should We Trust the Dictator’s\nGDP Growth Estimates?” Journal of Political\nEconomy 130 (10): 2731–69. https://doi.org/10.1086/720458.\n\n\nMatias, Nathan, Kevin Munger, Marianne Aubin Le Quere, and Charles\nEbersole. 2021. “The Upworthy Research\nArchive, a time series of 32,487 experiments in U.S.\nmedia.” Scientific Data 8 (1): 1–8. https://doi.org/10.1038/s41597-021-00934-7.\n\n\nMatsumoto, Yukihiro. 2007. “Treating Code as\nan Essay.” In Beautiful Code, edited by Andy Oram\nand Greg Wilson, 477–81. O’Reilly.\n\n\nMattson, Greggor. 2017. “Artificial Intelligence Discovers\nGayface. Sigh.” https://greggormattson.com/2017/09/09/artificial-intelligence-discovers-gayface/amp/.\n\n\nMcCarthy, Fiona M., Tamsin E. M. Jones, Anne E. Kwitek, Cynthia L.\nSmith, Peter D. Vize, Monte Westerfield, and Elspeth A. Bruford. 2023.\n“The Case for Standardizing Gene Nomenclature in\nVertebrates.” Nature 614 (7948): E31–32. https://doi.org/10.1038/s41586-022-05633-w.\n\n\nMcClelland, Alexander. 2019. “‘Lock This Whore up’:\nLegal Violence and Flows of Information Precipitating Personal Violence\nAgainst People Criminalised for HIV-Related Crimes in Canada.”\nEuropean Journal of Risk Regulation 10 (1): 132–47. https://doi.org/10.1017/err.2019.20.\n\n\nMcElreath, Richard. (2015) 2020. Statistical\nRethinking: A Bayesian Course with Examples in R and Stan.\n2nd ed. Chapman; Hall/CRC.\n\n\n———. 2020. “Science as Amateur Software Development.”\nYouTube, September. https://youtu.be/zwRdO9%5FGGhY.\n\n\nMcIlroy, Doug, Ray Brownrigg, Thomas Minka, and Roger Bivand. 2023.\nmapproj: Map Projections. https://CRAN.R-project.org/package=mapproj.\n\n\nMcKenzie, David. 2021. “What Do You Need To\nDo To Make A Matching Estimator Convincing? Rhetorical vs Statistical\nChecks.” World Bank Blogs—Development Impact,\nFebruary. https://blogs.worldbank.org/impactevaluations/what-do-you-need-do-make-matching-estimator-convincing-rhetorical-vs-statistical.\n\n\nMcKinney, Wes. (2011) 2022. Python for Data Analysis. 3rd ed.\nhttps://wesmckinney.com/book/.\n\n\nMcPhee, John. 2017. Draft No. 4. 1st ed. Farrar, Straus;\nGiroux.\n\n\nMcQuire, Scott. 2019. “One Map to Rule Them All? Google Maps as\nDigital Technical Object.” Communication and the Public\n4 (2): 150–65. https://doi.org/10.1177/2057047319850192.\n\n\nMellon, Jonathan. 2024. “Rain, Rain, Go Away: 194 Potential\nExclusion‐restriction Violations for Studies Using Weather as an\nInstrumental Variable.” American Journal of Political\nScience, 1–18. https://doi.org/10.1111/ajps.12894.\n\n\nMeng, Xiao-Li. 1994. “Multiple-Imputation Inferences with\nUncongenial Sources of Input.” Statistical Science 9\n(4): 538–58. https://doi.org/10.1214/ss/1177010269.\n\n\n———. 2012. “You Want Me to Analyze Data i Don’t Have? Are You\nInsane?” Shanghai Archives of Psychiatry 24 (5):\n297–301. https://doi.org/10.3969/j.issn.1002-0829.2012.05.011.\n\n\n———. 2018. “Statistical Paradises and Paradoxes in Big Data (i):\nLaw of Large Populations, Big Data Paradox, and the 2016 US Presidential\nElection.” The Annals of Applied Statistics 12 (2):\n685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\n———. 2021. “What Are the Values of Data, Data Science, or Data\nScientists?” Harvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.ee717cf7.\n\n\nMerali, Zeeya. 2010. “Computational Science:... Error.”\nNature 467 (7317): 775–77. https://doi.org/10.1038/467775a.\n\n\nMiceli, Milagros, Julian Posada, and Tianling Yang. 2022.\n“Studying up Machine Learning Data.” Proceedings of the\nACM on Human-Computer Interaction 6 (January): 1–14.\nhttps://doi.org/10.1145/3492853.\n\n\nMichener, William. 2015. “Ten Simple Rules for Creating a Good\nData Management Plan.” PLOS Computational Biology 11\n(10): e1004525. https://doi.org/10.1371/journal.pcbi.1004525.\n\n\nMill, James. 1817. The History of British India. 1st ed. https://books.google.ca/books?id=Orw_AAAAcAAJ.\n\n\nMiller, Greg. 2014. “The Cartographer Who’s\nTransforming Map Design.” Wired, October. https://www.wired.com/2014/10/cindy-brewer-map-design/.\n\n\nMiller, Michael, and Joseph Sutherland. 2022. “The Effect of\nGender on Interruptions at Congressional Hearings.” American\nPolitical Science Review, 1–19. https://doi.org/10.1017/S0003055422000260.\n\n\nMills, David L. 1991. “Internet Time Synchronization: The Network\nTime Protocol.” IEEE Transactions on Communications 39\n(10): 1482–93.\n\n\nMindell, David. 2008. Digital Apollo: Human and\nMachine in Spaceflight. 1st ed. New York: The MIT Press.\n\n\nMineault, Patrick, and The Good Research Code Handbook Community. 2021.\n“The Good Research Code Handbook.” https://doi.org/10.5281/zenodo.5796873.\n\n\nMinsky, Yaron. 2011. “OCaml for the\nmasses.” Communications of the ACM 54 (11):\n53–58. https://doi.org/10.1145/2018396.2018413.\n\n\n———. 2015. “Automated Trading and OCaml with Yaron Minsky.”\nHackers — Software Engineering Daily, November. https://softwareengineeringdaily.com/2015/11/09/automated-trading-and-ocaml-with-yaron-minsky/.\n\n\nMitchell, Alanna. 2022a. “Get Ready for the New, Improved\nSecond.” The New York Times, April. https://www.nytimes.com/2022/04/25/science/time-second-measurement.html.\n\n\n———. 2022b. “Time Has Run Out for the Leap Second.” The\nNew York Times, November. https://www.nytimes.com/2022/11/14/science/time-leap-second.html.\n\n\nMitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy\nVasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and\nTimnit Gebru. 2019. “Model Cards for Model Reporting.”\nProceedings of the Conference on Fairness, Accountability, and\nTransparency, January. https://doi.org/10.1145/3287560.3287596.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe\nBiden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMiyakawa, Tsuyoshi. 2020. “No Raw Data, No Science: Another\nPossible Source of the Reproducibility Crisis.” Molecular\nBrain 13 (1): 1–6. https://doi.org/10.1186/s13041-020-0552-2.\n\n\nMok, Lillio, Samuel Way, Lucas Maystre, and Ashton Anderson. 2022.\n“The Dynamics of Exploration on Spotify.” In\nProceedings of the International AAAI Conference on Web and Social\nMedia, 16:663–74. https://doi.org/10.1609/icwsm.v16i1.19324.\n\n\nMolanphy, Chris. 2012. “100 & Single: Three Rules to Define\nthe Term ‘One-Hit Wonder’ in 2012.” The Village\nVoice, September. https://www.villagevoice.com/2012/09/10/100-single-three-rules-to-define-the-term-one-hit-wonder-in-2012/.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey:\nPrinceton University Press.\n\n\nMoyer, Brian, and Abe Dunn. 2020. “Measuring the\nGross Domestic Product\n(GDP): The Ultimate Data\nScience Project.” Harvard Data\nScience Review 2 (1). https://doi.org/10.1162/99608f92.414caadb.\n\n\nMullard, Asher. 2021. “Half of Top Cancer Studies Fail\nHigh-Profile Reproducibility Effort.” Nature 600 (7889):\n368--369. https://doi.org/10.1038/d41586-021-03691-0.\n\n\nMüller, Kirill. 2020. here: A Simpler Way to\nFind Your Files. https://CRAN.R-project.org/package=here.\n\n\nMüller, Kirill, Tobias Schieferdecker, and Patrick Schratz. 2019.\nVisualization, Transformation and Reporting with the Tidyverse.\nhttps://krlmlr.github.io/vistransrep/.\n\n\nMüller, Kirill, and Lorenz Walthert. 2022. styler: Non-Invasive Pretty Printing of R\nCode. https://CRAN.R-project.org/package=styler.\n\n\nMüller, Kirill, and Hadley Wickham. 2022. tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.\n\n\nMurphy, Heather. 2017. “Why Stanford Researchers Tried to Create a\n‘Gaydar’ Machine.” The New York Times,\nOctober. https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html.\n\n\nNavarro, Danielle. 2022. “Binding Apache\nArrow to R,” January. https://blog.djnavarro.net/posts/2022-01-18%5Fbinding-arrow-to-r/.\n\n\nNavarro, Danielle, Jonathan Keane, and Stephanie Hazlitt. 2022.\n“Larger-Than-Memory Data Workflows with\nApache Arrow,” June. https://arrow-user2022.netlify.app.\n\n\nNelder, John. 1999. “From Statistics to Statistical\nScience.” Journal of the Royal Statistical Society: Series D\n(The Statistician) 48 (2): 257–69. https://doi.org/10.1111/1467-9884.00187.\n\n\nNelder, John, and Robert Wedderburn. 1972. “Generalized Linear\nModels.” Journal of the Royal Statistical Society: Series A\n(General) 135 (3): 370–84. https://doi.org/10.2307/2344614.\n\n\nNeufeld, Anna, and Daniela Witten. 2021. “Discussion of Breiman’s\n\"Two Cultures\": From Two Cultures to One.” Observational\nStudies 7 (1): 171–74. https://doi.org/10.1353/obs.2021.0004.\n\n\nNeufeld, Michael. 2002. “Wernher von Braun, the SS, and\nConcentration Camp Labor: Questions of Moral, Political, and Criminal\nResponsibility.” German Studies Review 25 (1): 57–78. https://doi.org/10.2307/1433245.\n\n\nNeuwirth, Erich. 2022. RColorBrewer: ColorBrewer\nPalettes. https://CRAN.R-project.org/package=RColorBrewer.\n\n\nNewman, Daniel. 2014. “Missing Data: Five Practical\nGuidelines.” Organizational Research Methods 17 (4):\n372–411. https://doi.org/10.1177/1094428114548590.\n\n\nNeyman, Jerzy. 1934. “On the Two Different Aspects of the\nRepresentative Method: The Method of Stratified Sampling and the Method\nof Purposive Selection.” Journal of the Royal Statistical\nSociety 97 (4): 558–625. https://doi.org/10.2307/2342192.\n\n\nNix, Justin, and M. James Lozada. 2020. “Police Killings of\nUnarmed Black Americans: A Reassessment of Community Mental Health\nSpillover Effects,” January. https://doi.org/10.31235/osf.io/ajz2q.\n\n\nNobles, Melissa. 2002. “Racial Categorization and\nCensuses.” In Census and Identity: The Politics of Race,\nEthnicity, and Language in National Censuses, edited by David\nKertzer and Dominique Arel, 43–70. Cambridge: Cambridge University\nPress. https://doi.org/10.1017/CBO9780511606045.003.\n\n\nNorthcutt, Curtis, Anish Athalye, and Jonas Mueller. 2021.\n“Pervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.” In Proceedings of the 35th Conference on Neural\nInformation Processing Systems Track on Datasets and Benchmarks. https://doi.org/10.48550/arXiv.2103.14749.\n\n\nObermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil\nMullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used\nto Manage the Health of Populations.” Science 366\n(6464): 447–53. https://doi.org/10.1126/science.aax2342.\n\n\nOberski, Daniel, and Frauke Kreuter. 2020. “Differential Privacy\nand Social Science: An Urgent\nPuzzle.” Harvard Data Science Review 2 (1).\nhttps://doi.org/10.1162/99608f92.63a22079.\n\n\nOECD. 2014. “The Essential Macroeconomic Aggregates.” In\nUnderstanding National Accounts, 13–46. OECD. https://doi.org/10.1787/9789264214637-2-en.\n\n\n———. 2022. Quarterly GDP. https://data.oecd.org/gdp/quarterly-gdp.htm.\n\n\nOoms, Jeroen. 2014. “The jsonlite Package: A\nPractical and Consistent Mapping Between JSON Data and R\nObjects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.\n\n\n———. 2022a. openssl: Toolkit for Encryption,\nSignatures and Certificates Based on OpenSSL. https://CRAN.R-project.org/package=openssl.\n\n\n———. 2022b. pdftools: Text Extraction,\nRendering and Converting of PDF Documents. https://CRAN.R-project.org/package=pdftools.\n\n\n———. 2022c. ssh: Secure Shell (SSH) Client for\nR. https://CRAN.R-project.org/package=ssh.\n\n\n———. 2022d. tesseract: Open Source OCR\nEngine. https://CRAN.R-project.org/package=tesseract.\n\n\nOpen Science Collaboration. 2015. “Estimating the Reproducibility\nof Psychological Science.” Science 349 (6251): aac4716.\nhttps://doi.org/10.1126/science.aac4716.\n\n\nOrwell, George. 1946. Politics and the English Language. https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/politics-and-the-english-language/.\n\n\nOsborne, Jason. 2012. Best Practices in Data\nCleaning: A Complete Guide to Everything You Need to Do Before and After\nCollecting Your Data. SAGE Publications.\n\n\nOsgood, D. Wayne. 2000. “Poisson-Based Regression Analysis of\nAggregate Crime Rates.” Journal of Quantitative\nCriminology 16 (1): 21–43. https://doi.org/10.1023/a:1007521427059.\n\n\nPalmer Station Antarctica LTER, and Gorman, Kristen. 2020.\n“Structural Size Measurements and Isotopic Signatures of Foraging\nAmong Adult Male and Female Adélie Penguins (Pygoscelis Adeliae) Nesting\nAlong the Palmer Archipelago Near Palmer Station, 2007-2009.” https://doi.org/10.6073/PASTA/98B16D7D563F265CB52372C8CA99E60F.\n\n\nPasek, Josh. 2015. “Predicting Elections:\nConsidering Tools to Pool the Polls.” Public Opinion\nQuarterly 79 (2): 594–619. https://doi.org/10.1093/poq/nfu060.\n\n\nPatki, Neha, Roy Wedge, and Kalyan Veeramachaneni. 2016. “The\nSynthetic Data Vault.” In 2016 IEEE International Conference\non Data Science and Advanced Analytics (DSAA), 399–410. https://doi.org/10.1109/DSAA.2016.49.\n\n\nPaullada, Amandalynne, Inioluwa Deborah Raji, Emily Bender, Emily\nDenton, and Alex Hanna. 2021. “Data and Its (Dis)contents: A\nSurvey of Dataset Development and Use in Machine Learning\nResearch.” Patterns 2 (11): 100336. https://doi.org/10.1016/j.patter.2021.100336.\n\n\nPavlik, Kaylin. 2019. “Understanding + Classifying Genres Using\nSpotify Audio Features.” https://www.kaylinpavlik.com/classifying-songs-genres/.\n\n\nPedersen, Thomas Lin. 2022. patchwork: The\nComposer of Plots. https://CRAN.R-project.org/package=patchwork.\n\n\nPerepolkin, Dmytro. 2022. polite: Be Nice on\nthe Web. https://CRAN.R-project.org/package=polite.\n\n\nPerkel, Jeffrey. 2021. “Ten Computer Codes That Transformed\nScience.” Nature 589 (7842): 344–48. https://doi.org/10.1038/d41586-021-00075-2.\n\n\n———. 2023. “The Sleight-of-Hand Trick That Can Simplify Scientific\nComputing.” Nature 617 (7959): 212--213. https://doi.org/10.1038/d41586-023-01469-0.\n\n\nPhillips, Alban. 1958. “The Relation Between Unemployment and the\nRate of Change of Money Wage Rates in the United Kingdom,\n1861-1957.” Economica 25 (100): 283–99. https://doi.org/10.1111/j.1468-0335.1958.tb00003.x.\n\n\nPiller, Charles. 2022. “Blots on a Field?” Science\n377 (6604): 358–63. https://doi.org/10.1126/science.ade0209.\n\n\nPineau, Joelle, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent\nLarivière, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Hugo\nLarochelle. 2021. “Improving Reproducibility in Machine Learning\nResearch (a Report from the NeurIPS 2019 Reproducibility\nProgram).” Journal of Machine Learning Research 22\n(164): 1–20. http://jmlr.org/papers/v22/20-303.html.\n\n\nPitman, Jim. 1993. Probability. 1st ed. New York: Springer. https://doi.org/10.1007/978-1-4612-4374-8.\n\n\nPlant, Anne, and Robert Hanisch. 2020. “Reproducibility in\nScience: A Metrology Perspective.” Harvard Data Science\nReview 2 (4). https://doi.org/10.1162/99608f92.eb6ddee4.\n\n\nPodlogar, Tim, Peter Leo, and James Spragg. 2022. “Using VO2max as a marker of training status in\nathletes—Can we do better?” Journal of Applied\nPhysiology 133 (6): 144–47. https://doi.org/10.1152/japplphysiol.00723.2021.\n\n\nPreece, Donald Arthur. 1981. “Distributions of Final Digits in\nData.” The Statistician 30 (1): 31. https://doi.org/10.2307/2987702.\n\n\nPrévost, Jean-Guy, and Jean-Pierre Beaud. 2015. Statistics, Public\nDebate and the State, 1800–1945: A Social, Political and Intellectual\nHistory of Numbers. Routledge.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical\nComputing. Vienna, Austria: R Foundation for Statistical Computing.\nhttps://www.R-project.org/.\n\n\nR Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and\nKirill Müller. 2022. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.\n\n\nRadcliffe, Nicholas. 2023. Test-Driven Data\nAnalysis (Python TDDA library). https://tdda.readthedocs.io/en/latest/index.html.\n\n\nRegister, Yim. 2020a. “Introduction to Sampling and\nRandomization.” YouTube, November. https://youtu.be/U272FFxG8LE.\n\n\n———. 2020b. “Data Science Ethics in 6 Minutes.”\nYouTube, December. https://youtu.be/mA4gypAiRYU.\n\n\nRehaag, Sean. 2023. “Supreme Court of Canada Bulk Decisions\nDataset.” Refugee Law Laboratory. https://refugeelab.ca/bulk-data/scc.\n\n\nReid, Nancy. 2003. “Asymptotics and the Theory of\nInference.” The Annals of Statistics 31 (6): 1695–1731.\nhttps://doi.org/10.1214/aos/1074290325.\n\n\nRichardson, Neal, Ian Cook, Nic Crane, Dewey Dunnington, Romain\nFrançois, Jonathan Keane, Dragoș Moldovan-Grünfeld, Jeroen Ooms, and\nApache Arrow. 2023. arrow: Integration to\nApache Arrow. https://CRAN.R-project.org/package=arrow.\n\n\nRiederer, Emily. 2020. “Column Names as Contracts,”\nSeptember. https://emilyriederer.netlify.app/post/column-name-contracts/.\n\n\n———. 2021. “Causal Design Patterns for Data Analysts,”\nJanuary. https://emilyriederer.netlify.app/post/causal-design-patterns/.\n\n\nRiffe, Tim, Enrique Acosta, Enrique José Acosta, Diego Manuel Aburto,\nAnna Alburez-Gutierrez, Ainhoa Altová, Ugofilippo Alustiza, et al. 2021.\n“Data Resource Profile: COVerAGE-DB: A\nGlobal Demographic Database of COVID-19 Cases and\nDeaths.” International Journal of Epidemiology 50 (2):\n390–390f. https://doi.org/10.1093/ije/dyab027.\n\n\nRiley, Richard, Tim Cole, Jon Deeks, Jamie Kirkham, Julie Morris, Rafael\nPerera, Angie Wade, and Gary Collins. 2022. “On the 12th Day of\nChristmas, a Statistician Sent to Me...”\nBMJ, December, e072883. https://doi.org/10.1136/bmj-2022-072883.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet.\nPenguin Classics.\n\n\nRoberts, Margaret, Brandon Stewart, and Dustin Tingley. 2019.\n“stm: An R Package for\nStructural Topic Models.” Journal of Statistical\nSoftware 91 (2): 1–40. https://doi.org/10.18637/jss.v091.i02.\n\n\nRobinson, David, Alex Hayes, and Simon Couch. 2022. broom: Convert Statistical Objects into Tidy\nTibbles. https://CRAN.R-project.org/package=broom.\n\n\nRobinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data\nScience. Shelter Island: Manning Publications. https://livebook.manning.com/book/build-a-career-in-data-science.\n\n\nRockoff, Hugh. 2019. “On the Controversies Behind the Origins of\nthe Federal Economic Statistics.” Journal of Economic\nPerspectives 33 (1): 147–64. https://doi.org/10.1257/jep.33.1.147.\n\n\nRomer, Paul. 2018. “Jupyter, Mathematica, and the Future of the\nResearch Paper,” April. https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/.\n\n\nRose, Angela, Rebecca Grais, Denis Coulombier, and Helga Ritter. 2006.\n“A Comparison of Cluster and Systematic Sampling Methods for\nMeasuring Crude Mortality.” Bulletin of the World Health\nOrganization 84: 290–96. https://doi.org/10.2471/blt.05.029181.\n\n\nRosenau, James N. 1999. “A Transformed Observer in a Transforming\nWorld.” Studia Diplomatica 52 (1/2): 5–14. http://www.jstor.org/stable/44838096.\n\n\nRoss, Casey. 2022. “How a Decades-Old Database Became a Hugely\nProfitable Dossier on the Health of 270 Million Americans.”\nStat, February. https://www.statnews.com/2022/02/01/ibm-watson-health-marketscan-data/.\n\n\nRubinstein, Benjamin, and Francesco Alda. 2017. “Pain-Free Random\nDifferential Privacy with Sensitivity Sampling.” In 34th\nInternational Conference on Machine Learning (ICML’2017).\n\n\nRudis, Bob. 2020. hrbrthemes: Additional\nThemes, Theme Components and Utilities for\n“ggplot2”. https://CRAN.R-project.org/package=hrbrthemes.\n\n\nRuggles, Steven, Catherine Fitch, Diana Magnuson, and Jonathan\nSchroeder. 2019. “Differential Privacy and Census Data:\nImplications for Social and Economic Research.” AEA Papers\nand Proceedings 109 (May): 403–8. https://doi.org/10.1257/pandp.20191107.\n\n\nRuggles, Steven, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas,\nMegan Schouweiler, and Matthew Sobek. 2021. “IPUMS USA: Version\n11.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/d010.v11.0.\n\n\nRyan, Philip. 2015. “Keeping a Lab Notebook.”\nYouTube, May. https://youtu.be/-MAIuaOL64I.\n\n\nSadowski, Caitlin, Emma Söderberg, Luke Church, Michal Sipko, and\nAlberto Bacchelli. 2018. “Modern Code Review: A Case Study at\nGoogle.” In Proceedings of the 40th International Conference\non Software Engineering: Software Engineering in Practice, 181–90.\nICSE-SEIP ’18. New York, NY, USA: Association for Computing Machinery.\nhttps://doi.org/10.1145/3183519.3183525.\n\n\nSakshaug, Joseph, Ting Yan, and Roger Tourangeau. 2010.\n“Nonresponse Error, Measurement Error, and Mode of Data\nCollection: Tradeoffs in a Multi-Mode Survey of Sensitive and\nNon-Sensitive Items.” Public Opinion Quarterly 74 (5):\n907–33. https://doi.org/10.1093/poq/nfq057.\n\n\nSalganik, Matthew. 2018. Bit by Bit: Social Research in the Digital\nAge. New Jersey: Princeton University Press.\n\n\nSalganik, Matthew, Peter Sheridan Dodds, and Duncan Watts. 2006.\n“Experimental Study of Inequality and Unpredictability in an\nArtificial Cultural Market.” Science 311 (5762): 854–56.\nhttps://doi.org/10.1126/science.1121066.\n\n\nSalganik, Matthew, and Douglas Heckathorn. 2004. “Sampling and\nEstimation in Hidden Populations Using Respondent-Driven\nSampling.” Sociological Methodology 34 (1): 193–240. https://doi.org/10.1111/j.0081-1750.2004.00152.x.\n\n\nSambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong,\nPraveen Paritosh, and Lora Aroyo. 2021. “‘Everyone Wants to\nDo the Model Work, Not the Data Work’: Data Cascades in\nHigh-Stakes AI.” In Proceedings of the 2021\nCHI Conference on Human Factors in Computing Systems.\nACM. https://doi.org/10.1145/3411764.3445518.\n\n\nSamuel, Arthur. 1959. “Some Studies in Machine Learning Using the\nGame of Checkers.” IBM Journal of Research and\nDevelopment 3 (3): 210–29. https://doi.org/10.1147/rd.33.0210.\n\n\nSaulnier, Lucile, Siddharth Karamcheti, Hugo Laurençon, Léo Tronchon,\nThomas Wang, Victor Sanh, Amanpreet Singh, et al. 2022. “Putting\nEthical Principles at the Core of the Research Lifecycle.” https://huggingface.co/blog/ethical-charter-multimodal.\n\n\nSavage, Van, and Pamela Yeh. 2019. “Novelist Cormac\nMcCarthy’s Tips on How to Write a Great Science\nPaper.” Nature 574 (7778): 441–42. https://doi.org/10.1038/d41586-019-02918-5.\n\n\nSchaffner, Brian, Stephen Ansolabehere, and Sam Luks. 2021.\n“Cooperative Election Study Common Content,\n2020.” Harvard Dataverse. https://doi.org/10.7910/DVN/E9N6PH.\n\n\nSchloerke, Barret, and Jeff Allen. 2022. plumber: An API Generator for R. https://CRAN.R-project.org/package=plumber.\n\n\nSchmertmann, Carl. 2022. “UN API Test,” July. https://bonecave.schmert.net/un-api-example.html.\n\n\nSchofield, Alexandra, Måns Magnusson, and David Mimno. 2017.\n“Pulling Out the Stops: Rethinking Stopword Removal for Topic\nModels.” In Proceedings of the 15th Conference of the\nEuropean Chapter of the Association for Computational\nLinguistics: Volume 2, Short Papers, 432–36. Valencia, Spain:\nAssociation for Computational Linguistics. https://aclanthology.org/E17-2069.\n\n\nSchofield, Alexandra, Måns Magnusson, Laure Thompson, and David Mimno.\n2017. “Understanding Text Pre-Processing for Latent Dirichlet\nAllocation.” In ACL Workshop for Women in NLP (WiNLP).\nhttps://www.cs.cornell.edu/~xanda/winlp2017.pdf.\n\n\nSchofield, Alexandra, Laure Thompson, and David Mimno. 2017.\n“Quantifying the Effects of Text Duplication on Semantic\nModels.” In Proceedings of the 2017 Conference on Empirical\nMethods in Natural Language Processing, 2737–47. Copenhagen,\nDenmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1290.\n\n\nScott, James. 1998. Seeing Like a State. Yale University Press.\n\n\nSekhon, Jasjeet, and Rocío Titiunik. 2017. “Understanding\nRegression Discontinuity Designs as Observational Studies.”\nObservational Studies 3 (2): 174–82. https://doi.org/10.1353/obs.2017.0005.\n\n\nSen, Amartya. 1980. “Description as\nChoice.” Oxford Economic Papers 32 (3): 353–69.\nhttps://doi.org/10.1093/oxfordjournals.oep.a041484.\n\n\nShankar, Shreya, Rolando Garcia, Joseph Hellerstein, and Aditya\nParameswaran. 2022. “Operationalizing Machine Learning: An\nInterview Study.” arXiv. https://doi.org/10.48550/ARXIV.2209.09125.\n\n\nSi, Yajuan. 2020. “On the Use of Auxiliary Variables in Multilevel\nRegression and Poststratification.” https://arxiv.org/abs/2011.00360.\n\n\nSides, John, Lynn Vavreck, and Christopher Warshaw. 2021. “The\nEffect of Television Advertising in United States Elections.”\nAmerican Political Science Review, 1–17. https://doi.org/10.1017/s000305542100112x.\n\n\nSilberzahn, Raphael, Eric Uhlmann, Daniel Martin, Pasquale Anselmi,\nFrederik Aust, Eli Awtrey, Štěpán Bahnı́k, et al. 2018. “Many\nAnalysts, One Data Set: Making Transparent How Variations in Analytic\nChoices Affect Results.” Advances in Methods and Practices in\nPsychological Science 1 (3): 337–56. https://doi.org/10.1177/2515245917747646.\n\n\nSilge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data\nPrinciples in R.” The Journal of Open Source\nSoftware 1 (3). https://doi.org/10.21105/joss.00037.\n\n\nSilver, Nate. 2020. “We Fixed an Issue with How Our Primary\nForecast Was Calculating Candidates’ Demographic Strengths.”\nFiveThirtyEight, February. https://fivethirtyeight.com/features/we-fixed-a-mistake-in-how-our-primary-forecast-was-calculating-candidates-demographic-strengths/.\n\n\nSimonsohn, Uri. 2013. “Just Post It: The Lesson from Two Cases of\nFabricated Data Detected by Statistics Alone.” Psychological\nScience 24 (10): 1875–88. https://doi.org/10.1177/0956797613480366.\n\n\nSimpkinson, Scott. 1971. “Testing to Ensure\nMission Success.” In What Made Apollo a Success,\nedited by NASA, 21–29.\n\n\nSimpson, Edward. 1951. “The Interpretation of Interaction in\nContingency Tables.” Journal of the Royal Statistical\nSociety: Series B (Methodological) 13 (2): 238–41. https://doi.org/10.1111/j.2517-6161.1951.tb00088.x.\n\n\nSmith, Jessie, Saleema Amershi, Solon Barocas, Hanna Wallach, and\nJennifer Wortman Vaughan. 2022. “REAL ML: Recognizing, Exploring,\nand Articulating Limitations of Machine Learning Research.”\n2022 ACM Conference on Fairness, Accountability, and Transparency\n(FAccT ’22). https://doi.org/10.1145/3531146.3533122.\n\n\nSmith, Matthew. 2018. “Should Milk Go in a Cup of Tea First or\nLast?” July. https://yougov.co.uk/topics/consumer/articles-reports/2018/07/30/should-milk-go-cup-tea-first-or-last.\n\n\nSmith, Richard. 2002. “A Statistical Assessment of Buchanan’s Vote\nin Palm Beach County.” Statistical Science 17 (4):\n441–57. https://doi.org/10.1214/ss/1049993203.\n\n\nSobek, Matthew, and Steven Ruggles. 1999. “The IPUMS Project: An\nUpdate.” Historical Methods: A Journal of Quantitative and\nInterdisciplinary History 32 (3): 102–10. https://doi.org/10.1080/01615449909598930.\n\n\nSomers, James. 2015. “Toolkits for the\nMind.” MIT Technology Review, April. https://www.technologyreview.com/2015/04/02/168469/toolkits-for-the-mind/.\n\n\n———. 2017. “Torching the Modern-Day Library of Alexandria.”\nThe Atlantic, April. https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/.\n\n\n———. 2018. “The Scientific Paper Is Obsolete.” The\nAtlantic, April. https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/.\n\n\nSpear, Mary Eleanor. 1952. Charting Statistics. https://archive.org/details/ChartingStatistics_201801/.\n\n\nSprint, Gina, and Jason Conci. 2019. “Mining GitHub Classroom\nCommit Behavior in Elective and Introductory Computer Science\nCourses.” Journal of Computing Sciences in Colleges 35\n(1): 76–84.\n\n\nStaicu, Ana-Maria. 2017. “Interview with Nancy Reid.”\nInternational Statistical Review 85 (3): 381–403. https://doi.org/10.1111/insr.12237.\n\n\nStaniak, Mateusz, and Przemysław Biecek. 2019. “The Landscape of R Packages for Automated Exploratory\nData Analysis.” The R Journal 11\n(2): 347–69. https://doi.org/10.32614/RJ-2019-033.\n\n\nStantcheva, Stefanie. 2023. “How to Run Surveys: A Guide to\nCreating Your Own Identifying Variation and Revealing the\nInvisible.” Annual Review of Economics 15 (1): 205–34.\nhttps://doi.org/10.1146/annurev-economics-091622-010157.\n\n\nStatistics Canada. 2020. “Sex at Birth and Gender: Technical\nReport on Changes for the 2021 Census.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-20-0002/982000022020002-eng.pdf.\n\n\n———. 2023. “Guide to the Census of Population, 2021.”\nStatistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/98-304-x2021001-eng.pdf.\n\n\nSteckel, Richard. 1991. “The Quality of Census Data for Historical\nInquiry: A Research Agenda.” Social Science History 15\n(4): 579–99. https://doi.org/10.2307/1171470.\n\n\nSteele, Fiona. 2007. “Multilevel Models for Longitudinal\nData.” Journal of the Royal Statistical Society Series\nA: Statistics in Society 171 (1): 5–19. https://doi.org/10.1111/j.1467-985x.2007.00509.x.\n\n\nSteele, Fiona, Anna Vignoles, and Andrew Jenkins. 2007. “The\nEffect of School Resources on Pupil Attainment: A Multilevel\nSimultaneous Equation Modelling Approach.” Journal of the\nRoyal Statistical Society Series A: Statistics in Society 170 (3):\n801–24. https://doi.org/10.1111/j.1467-985x.2007.00476.x.\n\n\nStevens, Wallace. 1934. The Idea of Order at Key West. https://www.poetryfoundation.org/poems/43431/the-idea-of-order-at-key-west.\n\n\nSteyvers, Mark, and Tom Griffiths. 2006. “Probabilistic Topic\nModels.” In Latent Semantic Analysis: A Road to Meaning,\nedited by T. Landauer, D McNamara, S. Dennis, and W. Kintsch. https://cocosci.princeton.edu/tom/papers/SteyversGriffiths.pdf.\n\n\nStigler, Stephen. 1978. “Francis Ysidro Edgeworth,\nStatistician.” Journal of the Royal Statistical\nSociety. Series A (General) 141 (3): 287–322. https://doi.org/10.2307/2344804.\n\n\n———. 1986. The History of Statistics. Massachusetts: Belknap\nHarvard.\n\n\nStock, James, and Francesco Trebbi. 2003. “Retrospectives: Who\nInvented Instrumental Variable Regression?” Journal of\nEconomic Perspectives 17 (3): 177–94. https://doi.org/10.1257/089533003769204416.\n\n\nStolberg, Michael. 2006. “Inventing the Randomized Double-Blind\nTrial: The Nuremberg Salt Test of 1835.” Journal of the Royal\nSociety of Medicine 99 (12): 642–43. https://doi.org/10.1177/014107680609901216.\n\n\nStoler, Ann Laura. 2002. “Colonial Archives and the Arts of\nGovernance.” Archival Science 2 (March): 87–109. https://doi.org/10.1007/bf02435632.\n\n\nStolley, Paul. 1991. “When Genius Errs: R. A. Fisher and the Lung\nCancer Controversy.” American Journal of Epidemiology\n133 (5): 416–25. https://doi.org/10.1093/oxfordjournals.aje.a115904.\n\n\nStommes, Drew, P. M. Aronow, and Fredrik Sävje. 2023. “On the\nReliability of Published Findings Using the Regression Discontinuity\nDesign in Political Science.” Research & Politics 10\n(2). https://doi.org/https://doi.org/10.1177/2053168023116645.\n\n\nStudent. 1908. “The Probable Error of a Mean.”\nBiometrika 6 (1): 1–25. https://doi.org/10.2307/2331554.\n\n\nSunstein, Cass, and Lucia Reisch. 2017. The Economics of Nudge.\nRoutledge.\n\n\nSuriyakumar, Vinith, Nicolas Papernot, Anna Goldenberg, and Marzyeh\nGhassemi. 2021. “Chasing Your Long Tails.” In\nProceedings of the 2021 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3442188.3445934.\n\n\nSwain, Larry. 1985. “Basic Principles of Questionnaire\nDesign.” Survey Methodology 11 (2): 161–70.\n\n\nSylvester, Christine, Anastasia Ershova, Aleksandra Khokhlova, Nikoleta\nYordanova, and Zachary Greene. 2023. “ParlEE\nplenary speeches V2 data set: Annotated full-text of 15.1 million\nsentence-level plenary speeches of six EU legislative\nchambers.” Harvard Dataverse. https://doi.org/10.7910/DVN/VOPK0E.\n\n\nSzaszi, Barnabas, Anthony Higney, Aaron Charlton, Andrew Gelman, Ignazio\nZiano, Balazs Aczel, Daniel Goldstein, David Yeager, and Elizabeth\nTipton. 2022. “No Reason to Expect Large and Consistent Effects of\nNudge Interventions.” Proceedings of the National Academy of\nSciences 119 (31): e2200732119. https://doi.org/10.1073/pnas.2200732119.\n\n\nTaddy, Matt. 2019. Business Data Science. 1st ed. McGraw Hill.\n\n\nTaflaga, Marija, and Matthew Kerby. 2019. “Who Does What Work in a\nMinisterial Office: Politically Appointed Staff and the Descriptive\nRepresentation of Women in Australian Political Offices,\n19792010.” Political Studies 68 (2):\n463–85. https://doi.org/10.1177/0032321719853459.\n\n\nTal, Eran. 2020. “Measurement in\nScience.” In The Stanford Encyclopedia of\nPhilosophy, edited by Edward Zalta, Fall 2020. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/;\nMetaphysics Research Lab, Stanford University.\n\n\nTang, John. 2015. “Pollution havens and the\ntrade in toxic chemicals: Evidence from U.S. trade flows.”\nEcological Economics 112 (April): 150–60. https://doi.org/10.1016/j.ecolecon.2015.02.022.\n\n\nTang, Jun, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and\nXiaofeng Wang. 2017. “Privacy Loss in Apple’s Implementation of\nDifferential Privacy on MacOS 10.12.” arXiv. https://doi.org/10.48550/arXiv.1709.02753.\n\n\nTausanovitch, Chris, and Lynn Vavreck. 2021. “Democracy Fund\n+ UCLA Nationscape Project.” https://www.voterstudygroup.org/data/nationscape.\n\n\nTaylor, Adam. 2015. “New Zealand Says No to Jedis.” The\nWashington Post, September. https://www.washingtonpost.com/news/worldviews/wp/2015/09/29/new-zealand-says-no-to-jedis/.\n\n\nTeate, Renée. 2022. SQL for Data Scientists. Wiley.\n\n\nThe Economist. 2013. “Johnson: Those Six Little Rules: George\nOrwell on Writing,” July. https://www.economist.com/prospero/2013/07/29/johnson-those-six-little-rules.\n\n\n———. 2022a. “What Spotify Data Show about the Decline of\nEnglish,” January. https://www.economist.com/interactives/graphic-detail/2022/01/29/what-spotify-data-show-about-the-decline-of-english.\n\n\n———. 2022b. “Will Emmanuel Macron Win a Second Term?”\nApril. https://www.economist.com/interactive/france-2022/forecast.\n\n\n———. 2022c. “France’s Presidential Election: The Second Round in\nDetail,” April. https://www.economist.com/interactive/france-2022/results-round-two.\n\n\nThe Washington Post. 2023. “Fatal Force Database.” https://github.com/washingtonpost/data-police-shootings.\n\n\nThe White House. 2023. “Recommendations on the Best Practices for\nthe Collection of Sexual Orientation and Gender Identity Data on Federal\nStatistical Survey,” January. https://www.whitehouse.gov/wp-content/uploads/2023/01/SOGI-Best-Practices.pdf.\n\n\nThieme, Nick. 2018. “R Generation.” Significance\n15 (4): 14–19. https://doi.org/10.1111/j.1740-9713.2018.01169.x.\n\n\nThistlethwaite, Donald, and Donald Campbell. 1960.\n“Regression-Discontinuity Analysis: An Alternative to the Ex Post\nFacto Experiment.” Journal of Educational Psychology 51\n(6): 309–17. https://doi.org/10.1037/h0044319.\n\n\nThompson, Charlie, Daniel Antal, Josiah Parry, Donal Phipps, and Tom\nWolff. 2022. spotifyr: R Wrapper for the\n“Spotify” Web API. https://CRAN.R-project.org/package=spotifyr.\n\n\nThomson-DeVeaux, Amelia, Laura Bronner, and Damini Sharma. 2021.\n“Cities Spend Millions On Police Misconduct\nEvery Year. Here’s Why It’s So Difficult to Hold Departments\nAccountable.” FiveThirtyEight, February. https://fivethirtyeight.com/features/police-misconduct-costs-cities-millions-every-year-but-thats-where-the-accountability-ends/.\n\n\nThornhill, John. 2021. “Lunch with the FT: Mathematician Hannah\nFry.” Financial Times, July. https://www.ft.com/content/a5e33e5a-99b9-4bbc-948f-8a527c7675c3.\n\n\nTierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2021. naniar: Data Structures, Summaries, and Visualisations\nfor Missing Data. https://CRAN.R-project.org/package=naniar.\n\n\nTierney, Nicholas, and Karthik Ram. 2020. “A Realistic Guide to\nMaking Data Available Alongside Code to Improve Reproducibility.”\nhttps://arxiv.org/abs/2002.11626.\n\n\n———. 2021. “Common-Sense Approaches to Sharing Tabular Data\nAlongside Publication.” Patterns 2 (12): 100368. https://doi.org/10.1016/j.patter.2021.100368.\n\n\nTimbers, Tiffany. 2020. canlang: Canadian\nCensus language data. https://ttimbers.github.io/canlang/.\n\n\nTimbers, Tiffany, Trevor Campbell, and Melissa Lee. 2022. Data\nScience: A First Introduction. Chapman; Hall/CRC. https://datasciencebook.ca.\n\n\nTolley, Erin, and Mireille Paquet. 2021. “Gender, Municipal Party\nPolitics, and Montreal’s First Woman Mayor.” Canadian Journal\nof Urban Research 30 (1): 40–52. https://cjur.uwinnipeg.ca/index.php/cjur/article/view/323.\n\n\nTourangeau, Roger, Lance Rips, and Kenneth Rasinski. 2000. The\nPsychology of Survey Response. 1st ed. Cambridge University Press.\nhttps://doi.org/10.1017/CBO9780511819322.003.\n\n\nTouvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet,\nMarie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023.\n“LLaMA: Open and Efficient Foundation\nLanguage Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.13971.\n\n\nTrisovic, Ana, Matthew Lau, Thomas Pasquier, and Mercè Crosas. 2022.\n“A Large-Scale Study on Research Code Quality and\nExecution.” Scientific Data 9 (1). https://doi.org/10.1038/s41597-022-01143-6.\n\n\nTukey, John. 1962. “The Future of Data Analysis.” The\nAnnals of Mathematical Statistics 33 (1): 1–67. https://doi.org/10.1214/aoms/1177704711.\n\n\n———. 1977. Exploratory Data Analysis.\n\n\nTurcotte, Alexi, Aviral Goel, Filip Křikava, and Jan Vitek. 2020.\n“Designing Types for r, Empirically.” Proceedings of\nthe ACM on Programming Languages 4\n(OOPSLA): 1–25. https://doi.org/10.1145/3428249.\n\n\nUN IGME. 2021. “Levels and Trends in Child Mortality,\n2021.” https://childmortality.org/wp-content/uploads/2021/12/UNICEF-2021-Child-Mortality-Report.pdf.\n\n\nUrban, Steve, Rangarajan Sreenivasan, and Vineet Kannan. 2016.\n“It’s All A/Bout Testing: The Netflix\nExperimentation Platform.” Netflix Technology\nBlog, April. https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15.\n\n\nUshey, Kevin. 2022. renv: Project\nEnvironments. https://CRAN.R-project.org/package=renv.\n\n\nvan Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in\nR.” Journal of Statistical Software 45 (3): 1–67.\nhttps://doi.org/10.18637/jss.v045.i03.\n\n\nVan den Broeck, Jan, Solveig Argeseanu Cunningham, Roger Eeckels, and\nKobus Herbst. 2005. “Data Cleaning: Detecting, Diagnosing, and\nEditing Data Abnormalities.” PLOS Medicine 2 (10): e267.\nhttps://doi.org/10.1371/journal.pmed.0020267.\n\n\nvan der Loo, Mark. 2022. The Data Validation Cookbook. https://data-cleaning.github.io/validate/.\n\n\nvan der Loo, Mark, and Edwin De Jonge. 2021. “Data Validation Infrastructure for R.”\nJournal of Statistical Software 97 (10): 1–33. https://doi.org/10.18637/jss.v097.i10.\n\n\nVanderplas, Susan, Dianne Cook, and Heike Hofmann. 2020. “Testing\nStatistical Charts: What Makes a Good Graph?” Annual Review\nof Statistics and Its Application 7: 61–88. https://doi.org/10.1146/annurev-statistics-031219-041252.\n\n\nVanhoenacker, Mark. 2015. Skyfaring: A Journey with a Pilot.\n1st ed. Alfred A. Knopf.\n\n\nVarin, Cristiano, Nancy Reid, and David Firth. 2011. “An Overview\nof Composite Likelihood Methods.” Statistica Sinica,\n5–42. https://www.jstor.org/stable/24309261.\n\n\nVarner, Maddy, and Aaron Sankin. 2020. “Suckers List: How\nAllstate’s Secret Auto Insurance Algorithm Squeezes Big\nSpenders.” The Markup, February. https://themarkup.org/allstates-algorithm/2020/02/25/car-insurance-suckers-list.\n\n\nVavreck, Lynn, and Chris Tausanovitch. 2021. “Democracy Fund\n+ UCLA Nationscape Project User Guide.” https://www.voterstudygroup.org/data/nationscape.\n\n\nVickers, Andrew, and Emily Vertosick. 2016. “An Empirical Study of\nRace Times in Recreational Endurance Runners.”\nBMC Sports Science, Medicine and Rehabilitation 8\n(1). https://doi.org/10.1186/s13102-016-0052-y.\n\n\nVidoni, Melina. 2021. “Evaluating Unit\nTesting Practices in R Packages.” In 2021 IEEE/ACM\n43rd International Conference on Software Engineering (ICSE),\n1523–34. https://doi.org/10.1109/ICSE43902.2021.00136.\n\n\nvon Bergmann, Jens, Dmitry Shkolnik, and Aaron Jacobs. 2021. cancensus: R package to access, retrieve, and work with\nCanadian Census data and geography. https://mountainmath.github.io/cancensus/.\n\n\nWalby, Kevin, and Alex Luscombe. 2019. Freedom of Information and\nSocial Science Research Design. Routledge.\n\n\nWalker, Kyle. 2022. Analyzing US Census Data. Chapman;\nHall/CRC. https://walker-data.com/census-r/index.html.\n\n\nWalker, Kyle, and Matt Herman. 2022. tidycensus: Load US Census Boundary and Attribute Data as\n“tidyverse” and “sf”-Ready Data\nFrames. https://CRAN.R-project.org/package=tidycensus.\n\n\nWallach, Hanna. 2018. “Computational Social Science ≠ Computer Science + Social Data.”\nCommunications of the ACM 61 (3): 42–44. https://doi.org/10.1145/3132698.\n\n\nWan, Mengting, and Julian J. McAuley. 2018. “Item Recommendation\non Monotonic Behavior Chains.” In Proceedings of the 12th\nACM Conference on Recommender Systems, RecSys 2018,\nVancouver, BC, Canada, October 2-7, 2018, edited by Sole Pera,\nMichael D. Ekstrand, Xavier Amatriain, and John O’Donovan, 86–94.\nACM. https://doi.org/10.1145/3240323.3240369.\n\n\nWan, Mengting, Rishabh Misra, Ndapa Nakashole, and Julian J. McAuley.\n2019. “Fine-Grained Spoiler Detection from Large-Scale Review\nCorpora.” In Proceedings of the 57th Conference of the\nAssociation for Computational Linguistics, ACL 2019,\nFlorence, Italy, July 28- August 2, 2019, Volume 1: Long Papers,\nedited by Anna Korhonen, David R. Traum, and Lluı́s Màrquez, 2605–10.\nAssociation for Computational Linguistics. https://doi.org/10.18653/v1/p19-1248.\n\n\nWang, Wei, David Rothschild, Sharad Goel, and Andrew Gelman. 2015.\n“Forecasting Elections with Non-Representative Polls.”\nInternational Journal of Forecasting 31 (3): 980–91. https://doi.org/10.1016/j.ijforecast.2014.06.001.\n\n\nWang, Yilun, and Michal Kosinski. 2018. “Deep Neural Networks Are\nMore Accurate Than Humans at Detecting Sexual Orientation from Facial\nImages.” Journal of Personality and Social Psychology\n114 (2): 246–57. https://doi.org/10.1037/pspa0000098.\n\n\nWardrop, Robert. 1995. “Simpson’s Paradox and the Hot Hand in\nBasketball.” The American Statistician 49 (1): 24–28. https://doi.org/10.2307/2684806.\n\n\nWare, James. 1989. “Investigating Therapies of Potentially Great\nBenefit: ECMO.” Statistical Science 4 (4): 298–306. https://doi.org/10.1214/ss/1177012384.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWei, LJ, and S Durham. 1978. “The Randomized Play-the-Winner Rule\nin Medical Trials.” Journal of the American Statistical\nAssociation 73 (364): 840–43. https://doi.org/10.2307/2286290.\n\n\nWeinberg, Gerald. 1971. The Psychology of Computer Programming.\nNew York: Van Nostrand Reinhold Company.\n\n\nWeissgerber, Tracey, Natasa Milic, Stacey Winham, and Vesna Garovic.\n2015. “Beyond Bar and Line Graphs: Time for a New Data\nPresentation Paradigm.” PLoS Biology 13 (4): e1002128.\nhttps://doi.org/10.1371/journal.pbio.1002128.\n\n\nWhitby, Andrew. 2020. The Sum of the\nPeople. New York: Basic Books.\n\n\nWhitelaw, James. 1805. An Essay on the Population of Dublin. Being\nthe Result of an Actual Survey Taken in 1798, with Great Care and\nPrecision, and Arranged in a Manner Entirely New. Graisberry;\nCampbell.\n\n\nWicherts, Jelte, Marjan Bakker, and Dylan Molenaar. 2011.\n“Willingness to Share Research Data Is Related to the Strength of\nthe Evidence and the Quality of Reporting of Statistical\nResults.” PLOS ONE 6 (11): e26828. https://doi.org/10.1371/journal.pone.0026828.\n\n\nWickham, Hadley. 2009. “Manipulating Data.” In ggplot2, 157–75. Springer New York. https://doi.org/10.1007/978-0-387-98141-3_9.\n\n\n———. 2011. “testthat: Get Started with\nTesting.” The R Journal 3: 5–10. https://journal.r-project.org/archive/2011-1/RJournal%5F2011-1%5FWickham.pdf.\n\n\n———. 2014. “Tidy Data.” Journal of Statistical\nSoftware 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.\n\n\n———. 2016. ggplot2: Elegant Graphics for Data\nAnalysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.\n\n\n———. 2017. tidyverse: Easily Install and Load\nthe “Tidyverse”. https://CRAN.R-project.org/package=tidyverse.\n\n\n———. 2018. “Whole Game.” YouTube, January. https://youtu.be/go5Au01Jrvs.\n\n\n———. 2019. Advanced R. 2nd ed. Chapman; Hall/CRC.\nhttps://adv-r.hadley.nz.\n\n\n———. 2020. Tidyverse. https://www.tidyverse.org/.\n\n\n———. 2021a. babynames: US Baby Names\n1880-2017. https://CRAN.R-project.org/package=babynames.\n\n\n———. 2021b. Mastering Shiny. 1st ed. O’Reilly Media. https://mastering-shiny.org.\n\n\n———. 2021c. The Tidyverse Style Guide. https://style.tidyverse.org/index.html.\n\n\n———. 2022a. R Packages. 2nd ed. O’Reilly Media. https://r-pkgs.org.\n\n\n———. 2022b. rvest: Easily Harvest (Scrape) Web\nPages. https://CRAN.R-project.org/package=rvest.\n\n\n———. 2022c. stringr: Simple, Consistent\nWrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.\n\n\n———. 2023a. forcats: Tools for Working with\nCategorical Variables (Factors). https://CRAN.R-project.org/package=forcats.\n\n\n———. 2023b. httr: Tools for Working with URLs\nand HTTP. https://CRAN.R-project.org/package=httr.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy\nD’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019.\n“Welcome to the Tidyverse.”\nJournal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, and Jennifer Bryan. 2023. readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.\n\n\nWickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. usethis: Automate Package and Project Setup.\nhttps://CRAN.R-project.org/package=usethis.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016)\n2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022.\ndplyr: A Grammar of Data\nManipulation. https://CRAN.R-project.org/package=dplyr.\n\n\nWickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2022. dbplyr: A “dplyr” Back End for\nDatabases. https://CRAN.R-project.org/package=dbplyr.\n\n\nWickham, Hadley, and Lionel Henry. 2022. purrr:\nFunctional Programming Tools. https://CRAN.R-project.org/package=purrr.\n\n\nWickham, Hadley, Jim Hester, and Jenny Bryan. 2022. readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.\n\n\nWickham, Hadley, Jim Hester, Winston Chang, and Jenny Bryan. 2022.\ndevtools: Tools to Make Developing R Packages\nEasier. https://CRAN.R-project.org/package=devtools.\n\n\nWickham, Hadley, Jim Hester, and Jeroen Ooms. 2021. xml2: Parse XML. https://CRAN.R-project.org/package=xml2.\n\n\nWickham, Hadley, Evan Miller, and Danny Smith. 2023. haven: Import and Export “SPSS”\n“Stata” and “SAS” Files. https://CRAN.R-project.org/package=haven.\n\n\nWickham, Hadley, and Dana Seidel. 2022. scales:\nScale Functions for Visualization. https://CRAN.R-project.org/package=scales.\n\n\nWickham, Hadley, and Lisa Stryjewski. 2011. “40 Years of\nBoxplots,” November. https://vita.had.co.nz/papers/boxplots.pdf.\n\n\nWickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2023. tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.\n\n\nWiessner, Polly. 2014. “Embers of Society: Firelight Talk Among\nthe Ju/’hoansi Bushmen.” Proceedings of the National Academy\nof Sciences 111 (39): 14027–35. https://doi.org/10.1073/pnas.1404212111.\n\n\nWilde, Oscar. 1891. The Picture of Dorian Gray. https://www.gutenberg.org/files/174/174-h/174-h.htm.\n\n\nWilford, John Noble. 1977. “Wernher von Braun, Rocket Pioneer,\nDies.” The New York Times, June. https://www.nytimes.com/1977/06/18/archives/wernher-von-braun-rocket-pioneer-dies-wernher-von-braun-pioneer-in.html.\n\n\nWilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed.\nSpringer.\n\n\nWilkinson, Mark, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle\nAppleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016.\n“The FAIR Guiding Principles for Scientific Data Management and\nStewardship.” Scientific Data 3 (1): 1–9. https://doi.org/10.1038/sdata.2016.18.\n\n\nWilson, Greg, Jenny Bryan, Karen Cranston, Justin Kitzes, Lex\nNederbragt, and Tracy Teal. 2017. “Good Enough Practices in\nScientific Computing.” PLOS Computational Biology 13\n(6): 1–20. https://doi.org/10.1371/journal.pcbi.1005510.\n\n\nWong, Julia Carrie. 2020. “One Year Inside Trump’s Monumental\nFacebook Campaign.” The Guardian, January. https://www.theguardian.com/us-news/2020/jan/28/donald-trump-facebook-ad-campaign-2020-election.\n\n\nWood, Simon. 2015. Core Statistics. Cambridge University Press.\nhttps://www.maths.ed.ac.uk/\\%7Eswood34/core-statistics.pdf.\n\n\nWorld Health Organization. 2019. “Trends in Maternal Mortality\n2000 to 2017: Estimates by WHO, UNICEF, UNFPA, World Bank Group and the\nUnited Nations Population Division.” https://apps.who.int/iris/handle/10665/327596.\n\n\nWright, Philip. 1928. The Tariff on Animal and Vegetable Oils.\nNew York: Macmillan Company.\n\n\nWu, Changbao, and Mary Thompson. 2020. Sampling Theory and\nPractice. Springer.\n\n\nXie, Yihui. 2019. “TinyTeX: A lightweight,\ncross-platform, and easy-to-maintain LaTeX distribution based on TeX\nLive.” TUGboat, no. 1: 30–32. https://tug.org/TUGboat/Contents/contents40-1.html.\n\n\n———. 2023. knitr: A General-Purpose Package for\nDynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nXu, Ya. 2020. “Causal Inference Challenges in Industry: A\nPerspective from Experiences at LinkedIn.” YouTube,\nJuly. https://youtu.be/OoKsLAvyIYA.\n\n\nYoshioka, Alan. 1998. “Use of Randomisation in the Medical\nResearch Council’s Clinical Trial of Streptomycin in Pulmonary\nTuberculosis in the 1940s.” BMJ 317 (7167): 1220–23. https://doi.org/10.1136/bmj.317.7167.1220.\n\n\nZhang, Ping, XunPeng Shi, YongPing Sun, Jingbo Cui, and Shuai Shao.\n2019. “Have China’s provinces achieved their\ntargets of energy intensity reduction? Reassessment based on nighttime\nlighting data.” Energy Policy 128 (May): 276–83.\nhttps://doi.org/10.1016/j.enpol.2019.01.014.\n\n\nZhang, Susan, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen,\nShuohui Chen, Christopher Dewan, et al. 2022. “OPT: Open\nPre-Trained Transformer Language Models.” arXiv. https://doi.org/10.48550/arXiv.2205.01068.\n\n\nZimmer, Michael. 2018. “Addressing Conceptual Gaps in Big Data\nResearch Ethics: An Application of Contextual Integrity.”\nSocial Media + Society 4 (2): 1–11. https://doi.org/10.1177/2056305118768300.\n\n\nZinsser, William. 1976. On Writing Well. New York:\nHarperCollins.\n\n\nZook, Matthew, Solon Barocas, danah boyd, Kate Crawford, Emily Keller,\nSeeta Peña Gangadharan, Alyssa Goodman, et al. 2017. “Ten Simple\nRules for Responsible Big Data Research.” PLOS Computational\nBiology 13 (3): e1005399. https://doi.org/10.1371/journal.pcbi.1005399.",
+ "text": "Abadie, Alberto, Susan Athey, Guido Imbens, and Jeffrey Wooldridge.\n2017. “When Should You Adjust Standard Errors for\nClustering?” Working Paper 24003. Working Paper Series. National\nBureau of Economic Research. https://doi.org/10.3386/w24003.\n\n\nAbelson, Harold, and Gerald Jay Sussman. 1996. Structure and\nInterpretation of Computer Programs. Cambridge: The MIT Press.\n\n\nAbeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu, and Mark Ziemann.\n2021. “Gene Name Errors: Lessons Not Learned.” PLOS\nComputational Biology 17 (7): 1–13. https://doi.org/10.1371/journal.pcbi.1008984.\n\n\nAcemoglu, Daron, Simon Johnson, and James Robinson. 2001. “The\nColonial Origins of Comparative Development: An Empirical\nInvestigation.” American Economic Review 91\n(5): 1369–1401. https://doi.org/10.1257/aer.91.5.1369.\n\n\nAchen, Christopher. 1978. “Measuring Representation.”\nAmerican Journal of Political Science 22 (3): 475–510. https://doi.org/10.2307/2110458.\n\n\nAkerlof, George. 1970. “The Market for ‘Lemons’:\nQuality Uncertainty and the Market Mechanism.” The Quarterly\nJournal of Economics 84 (3): 488–500. https://doi.org/10.2307/1879431.\n\n\nAlexander, Monica. 2019a. “Reproducibility in Demographic\nResearch.” https://www.monicaalexander.com/posts/2019-10-20-reproducibility/.\n\n\n———. 2019b. “The Concentration and Uniqueness of Baby Names in\nAustralia and the US,” January. https://www.monicaalexander.com/posts/2019-20-01-babynames/.\n\n\n———. 2019c. “Analyzing Name Changes After Marriage Using a\nNon-Representative Survey,” August. https://www.monicaalexander.com/posts/2019-08-07-mrp/.\n\n\n———. 2021. “Overcoming Barriers to Sharing Code.”\nYouTube, February. https://youtu.be/yvM2C6aZ94k.\n\n\nAlexander, Monica, and Leontine Alkema. 2022. “A Bayesian Cohort Component Projection Model to Estimate\nWomen of Reproductive Age at the Subnational Level in Data-Sparse\nSettings.” Demography 59 (5): 1713–37. https://doi.org/10.1215/00703370-10216406.\n\n\nAlexander, Monica, Mathew Kiang, and Magali Barbieri. 2018.\n“Trends in Black and White Opioid Mortality in the United States,\n1979–2015.” Epidemiology 29 (5): 707–15. https://doi.org/10.1097/EDE.0000000000000858.\n\n\nAlexander, Rohan, and Monica Alexander. 2021. “The Increased\nEffect of Elections and Changing Prime Ministers on Topics Discussed in\nthe Australian Federal Parliament Between 1901 and 2018.” https://doi.org/10.48550/arXiv.2111.09299.\n\n\nAlexander, Rohan, and Paul Hodgetts. 2021.\nAustralianPoliticians: Provides Datasets About Australian\nPoliticians. https://CRAN.R-project.org/package=AustralianPoliticians.\n\n\nAlexander, Rohan, and A Mahfouz. 2021. heapsofpapers: Easily Download Heaps of PDF and CSV\nFiles. https://CRAN.R-project.org/package=heapsofpapers.\n\n\nAlexander, Rohan, and Zachary Ward. 2018. “Age at Arrival and\nAssimilation During the Age of Mass Migration.” The Journal\nof Economic History 78 (3): 904–37. https://doi.org/10.1017/S0022050718000335.\n\n\nAlexopoulos, Michelle, and Jon Cohen. 2015. “The power of print: Uncertainty shocks, markets, and the\neconomy.” International Review of Economics\n& Finance 40 (November): 8–28. https://doi.org/10.1016/j.iref.2015.02.002.\n\n\nAllen, Jeff. 2021. plumberDeploy: Plumber\nDeployment. https://CRAN.R-project.org/package=plumberDeploy.\n\n\nAlsan, Marcella, and Amy Finkelstein. 2021. “Beyond Causality:\nAdditional Benefits of Randomized Controlled Trials for Improving Health\nCare Delivery.” The Milbank Quarterly 99 (4): 864–81. https://doi.org/10.1111/1468-0009.12521.\n\n\nAlsan, Marcella, and Marianne Wanamaker. 2018. “Tuskegee and the\nHealth of Black Men.” The Quarterly Journal of Economics\n133 (1): 407–55. https://doi.org/10.1093/qje/qjx029.\n\n\nAltman, Douglas, and Martin Bland. 1995. “Statistics notes: The normal distribution.”\nBMJ 310 (6975): 298–98. https://doi.org/10.1136/bmj.310.6975.298.\n\n\nAmaka, Ofunne, and Amber Thomas. 2021. “The Naked Truth: How the\nNames of 6,816 Complexion Products Can Reveal Bias in Beauty.”\nThe Pudding, March. https://pudding.cool/2021/03/foundation-names/.\n\n\nAmerican Medical Association and New York Academy of Medicine. 1848.\nCode of Medical Ethics. Academy of Medicine. https://hdl.handle.net/2027/chi.57108026.\n\n\nAndersen, Robert, and David Armstrong. 2021. Presenting Statistical\nResults Effectively. London: Sage.\n\n\nAnderson, Margo. (1988) 2015. The American Census: A Social\nHistory. 2nd ed. Yale University Press.\n\n\nAnderson, Margo, and Stephen Fienberg. 1999. Who Counts?: The Politics of Census-Taking in\nContemporary America. Russell Sage Foundation. http://www.jstor.org/stable/10.7758/9781610440059.\n\n\nAndrews, David, and Agnes Herzberg. 2012. Data: A Collection of\nProblems from Many Fields for the Student and Research Worker. New\nYork: Springer Science & Business Media.\n\n\nAngelucci, Charles, and Julia Cagé. 2019. “Newspapers in Times of\nLow Advertising Revenues.” American Economic Journal:\nMicroeconomics 11 (3): 319–64. https://doi.org/10.1257/mic.20170306.\n\n\nAngrist, Joshua, and Alan Krueger. 2001. “Instrumental Variables\nand the Search for Identification: From Supply and Demand to Natural\nExperiments.” Journal of Economic Perspectives 15 (4):\n69–85. https://doi.org/10.1257/jep.15.4.69.\n\n\nAngrist, Joshua, and Jörn-Steffen Pischke. 2010. “The Credibility\nRevolution in Empirical Economics: How Better Research Design Is Taking\nthe Con Out of Econometrics.” Journal of Economic\nPerspectives 24 (2): 3–30. https://doi.org/10.1257/jep.24.2.3.\n\n\nAnnas, George. 2003. “HIPAA Regulations: A New Era of\nMedical-Record Privacy?” New England Journal of Medicine\n348 (15): 1486–90. https://doi.org/10.1056/NEJMlim035027.\n\n\nAnsolabehere, Stephen, Brian Schaffner, and Sam Luks. 2021. “Guide to the 2020 Cooperative Election\nStudy.” https://doi.org/10.7910/DVN/E9N6PH.\n\n\nAprameya, Lavanya. 2020. “Improving Duolingo, One Experiment at a\nTime.” Duolingo Blog, January. https://blog.duolingo.com/improving-duolingo-one-experiment-at-a-time/.\n\n\nArel-Bundock, Vincent. 2021. WDI: World\nDevelopment Indicators and Other World Bank Data. https://CRAN.R-project.org/package=WDI.\n\n\n———. 2022. “modelsummary: Data and\nModel Summaries in R.” Journal of Statistical\nSoftware 103 (1): 1–23. https://doi.org/10.18637/jss.v103.i01.\n\n\n———. 2023. marginaleffects: Predictions,\nComparisons, Slopes, Marginal Means, and Hypothesis Tests.\nhttps://vincentarelbundock.github.io/marginaleffects/.\n\n\nArel-Bundock, Vincent, Ryan Briggs, Hristos Doucouliagos, Marco Mendoza\nAviña, and T. D. Stanley. 2022. “Quantitative Political Science\nResearch Is Greatly Underpowered.” https://osf.io/bzj9y/.\n\n\nArmstrong, Zan. 2022. “Stop Aggregating Away the Signal in Your\nData.” The Overflow, March. https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/.\n\n\nArnold, Jeffrey. 2021. ggthemes: Extra Themes,\nScales and Geoms for “ggplot2”. https://CRAN.R-project.org/package=ggthemes.\n\n\nAsher, Sam, Tobias Lunt, Ryu Matsuura, and Paul Novosad. 2021.\n“Development Research at High Geographic Resolution: An Analysis\nof Night Lights, Firms, and Poverty in India Using the SHRUG Open Data\nPlatform.” World Bank Economic Review 35 (4). https://shrug-assets-ddl.s3.amazonaws.com/static/main/assets/other/almn-shrug.pdf.\n\n\nAthey, Susan, and Guido Imbens. 2017a. “The Econometrics of\nRandomized Experiments.” In Handbook of Field\nExperiments, 73–140. Elsevier. https://doi.org/10.1016/bs.hefe.2016.10.003.\n\n\n———. 2017b. “The State of Applied Econometrics: Causality and\nPolicy Evaluation.” Journal of Economic Perspectives 31\n(2): 3–32. https://doi.org/10.1257/jep.31.2.3.\n\n\nAthey, Susan, Guido Imbens, Jonas Metzger, and Evan Munro. 2021.\n“Using Wasserstein Generative Adversarial Networks for the Design\nof Monte Carlo Simulations.” Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2020.09.013.\n\n\nAu, Randy. 2020. “Data Cleaning IS Analysis, Not Grunt\nWork,” September. https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt.\n\n\n———. 2022. “Celebrating Everyone Counting Things,”\nFebruary. https://counting.substack.com/p/celebrating-everyone-counting-things.\n\n\nBååth, Rasmus. 2018. beepr: Easily Play\nNotification Sounds on any Platform. https://CRAN.R-project.org/package=beepr.\n\n\nBache, Stefan Milton, and Hadley Wickham. 2022. magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.\n\n\nBackus, John. 1981. “The History of FORTRAN\nI, II, and III.” In History of Programming\nLanguages, edited by Richard Wexelblat, 25–74. Academic Press.\n\n\nBailey, Rosemary. 2008. Design of Comparative Experiments.\nCambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511611483.\n\n\nBaio, Gianluca, and Marta Blangiardo. 2010. “Bayesian Hierarchical\nModel for the Prediction of Football Results.” Journal of\nApplied Statistics 37 (2): 253–64. https://doi.org/10.1080/02664760802684177.\n\n\nBaker, Dominique. 2023. “Scams Will Not Save Us (Tuition\nDollars),” February. http://www.dominiquebaker.com/blog/2023/2/16/scams-will-not-save-us-tuition-dollars.\n\n\nBaker, Reg, Michael Brick, Nancy Bates, Mike Battaglia, Mick Couper,\nJill Dever, Krista Gile, and Roger Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-Probability\nSampling.” Journal of Survey Statistics and\nMethodology 1 (2): 90–143. https://doi.org/10.1093/jssam/smt008.\n\n\nBandy, John, and Nicholas Vincent. 2021. “Addressing\n‘Documentation Debt’ in Machine Learning: A Retrospective\nDatasheet for BookCorpus.” In Proceedings of the Neural\nInformation Processing Systems Track on Datasets and Benchmarks,\nedited by J. Vanschoren and S. Yeung. Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf.\n\n\nBanerjee, Abhijit, and Esther Duflo. 2011. Poor Economics: A Radical\nRethinking of the Way to Fight Global Poverty. New York:\nPublicAffairs.\n\n\nBanerjee, Abhijit, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan.\n2015. “The Miracle of Microfinance? Evidence from a Randomized\nEvaluation.” American Economic Journal: Applied\nEconomics 7 (1): 22–53. https://doi.org/10.1257/app.20130533.\n\n\nBanes, Graham, Emily Fountain, Alyssa Karklus, Robert Fulton, Lucinda\nAntonacci-Fulton, and Joanne Nelson. 2022. “Nine out of ten samples were mistakenly switched by The\nOrang-utan Genome Consortium.” Scientific Data 9\n(1). https://doi.org/10.1038/s41597-022-01602-0.\n\n\nBarba, Lorena. 2018. “Terminologies for Reproducible\nResearch.” https://arxiv.org/abs/1802.03311.\n\n\nBarrett, Malcolm. 2021a. Data Science as an Atomic Habit. https://malco.io/articles/2021-01-04-data-science-as-an-atomic-habit.\n\n\n———. 2021b. ggdag: Analyze and Create Elegant\nDirected Acyclic Graphs. https://CRAN.R-project.org/package=ggdag.\n\n\nBarron, Alexander, Jenny Huang, Rebecca Spang, and Simon DeDeo. 2018.\n“Individuals, Institutions, and Innovation in the Debates of the\nFrench Revolution.” Proceedings of the National Academy of\nSciences 115 (18): 4607–12. https://doi.org/10.1073/pnas.1717729115.\n\n\nBaumer, Benjamin, Daniel Kaplan, and Nicholas Horton. 2021.\nModern Data Science With R. 2nd ed. Chapman;\nHall/CRC. https://mdsr-book.github.io/mdsr2e/.\n\n\nBaumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and\nJeremy Blackburn. 2020. “The Pushshift Reddit Dataset.”\narXiv. https://doi.org/10.48550/arxiv.2001.08435.\n\n\nBaumgartner, Peter. 2021. “Ways I Use Testing\nas a Data Scientist,” December. https://www.peterbaumgartner.com/blog/testing-for-data-science/.\n\n\nBeaumont, Jean-Francois. 2020. “Are Probability Surveys Bound to\nDisappear for the Production of Official Statistics?” Survey\nMethodology 46 (1): 1–29.\n\n\nBeauregard, Katrine, and Jill Sheppard. 2021. “Antiwomen but\nProquota: Disaggregating Sexism and Support for Gender Quota\nPolicies.” Political Psychology 42 (2): 219–37. https://doi.org/10.1111/pops.12696.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex\nDeckmyn. 2022. maps: Draw Geographical\nMaps. https://CRAN.R-project.org/package=maps.\n\n\nBeelen, Kaspar, Timothy Alberdingk Thim, Christopher Cochrane, Kees\nHalvemaan, Graeme Hirst, Michael Kimmins, Sander Lijbrink, et al. 2017.\n“Digitization of the Canadian Parliamentary Debates.”\nCanadian Journal of Political Science 50 (3): 849–64.\n\n\nBegley, Glenn, and Lee Ellis. 2012. “Raise Standards for\nPreclinical Cancer Research.” Nature 483 (7391):\n531--533. https://doi.org/10.1038/483531a.\n\n\nBender, Emily, Timnit Gebru, Angelina McMillan-Major, and Shmargaret\nShmitchell. 2021. “On the Dangers of Stochastic Parrots: Can\nLanguage Models Be Too Big?” In Proceedings of the 2021\nACM Conference on Fairness, Accountability, and\nTransparency. ACM. https://doi.org/10.1145/3442188.3445922.\n\n\nBengtsson, Henrik. 2021. “A Unifying\nFramework for Parallel and Distributed Processing in R using\nFutures.” The R Journal 13 (2): 208–27. https://doi.org/10.32614/RJ-2021-048.\n\n\nBenoit, Kenneth. 2020. “Text as Data: An Overview.” In\nThe SAGE Handbook of Research Methods in Political Science and\nInternational Relations, edited by Luigi Curini and Robert\nFranzese, 461–97. London: SAGE Publishing. https://doi.org/10.4135/9781526486387.n29.\n\n\nBenoit, Kenneth, and Michael Laver. 2006. Party\nPolicy in Modern Democracies. Routledge.\n\n\n———. 2007. “Estimating Party Policy Positions: Comparing Expert\nSurveys and Hand-Coded Content Analysis.” Electoral\nStudies 26 (1): 90–107. https://doi.org/10.1016/j.electstud.2006.04.008.\n\n\nBenoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng,\nStefan Müller, and Akitaka Matsuo. 2018. “quanteda: An R package for the quantitative analysis of\ntextual data.” Journal of Open Source Software 3\n(30): 774. https://doi.org/10.21105/joss.00774.\n\n\nBensinger, Greg. 2020. “Google Redraws the Borders on Maps\nDepending on Who’s Looking.” The Washington Post,\nFebruary. https://www.washingtonpost.com/technology/2020/02/14/google-maps-political-borders/.\n\n\nBerdine, Gilbert, Vincent Geloso, and Benjamin Powell. 2018.\n“Cuban Infant Mortality and Longevity: Health Care or\nRepression?” Health Policy and Planning 33 (6): 755–57.\nhttps://doi.org/10.1093/heapol/czy033.\n\n\nBerkson, Joseph. 1946. “Limitations of the Application of Fourfold\nTable Analysis to Hospital Data.” Biometrics Bulletin 2\n(3): 47–53. https://doi.org/10.2307/3002000.\n\n\nBerners-Lee, Timothy. 1989. “Information Management: A\nProposal.” https://www.w3.org/History/1989/proposal.html.\n\n\nBerry, Donald. 1989. “Comment: Ethics and ECMO.”\nStatistical Science 4 (4): 306–10. https://www.jstor.org/stable/2245830.\n\n\nBertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and\nGreg More Employable Than Lakisha and Jamal? A Field Experiment on Labor\nMarket Discrimination.” American Economic Review 94 (4):\n991–1013. https://doi.org/10.1257/0002828042002561.\n\n\nBethlehem, R. A. I., J. Seidlitz, S. R. White, J. W. Vogel, K. M.\nAnderson, C. Adamson, S. Adler, et al. 2022. “Brain Charts for the\nHuman Lifespan.” Nature 604 (7906): 525–33. https://doi.org/10.1038/s41586-022-04554-y.\n\n\nBetz, Timm, Scott Cook, and Florian Hollenbach. 2018. “On the Use\nand Abuse of Spatial Instruments.” Political Analysis 26\n(4): 474–79. https://doi.org/10.1017/pan.2018.10.\n\n\nBickel, Peter, Eugene Hammel, and William O’Connell. 1975. “Sex\nBias in Graduate Admissions: Data from Berkeley: Measuring Bias Is\nHarder Than Is Usually Assumed, and the Evidence Is Sometimes Contrary\nto Expectation.” Science 187 (4175): 398–404. https://doi.org/10.1126/science.187.4175.398.\n\n\nBiderman, Stella, Kieran Bicheno, and Leo Gao. 2022. “Datasheet\nfor the Pile.” https://arxiv.org/abs/2201.07311.\n\n\nBirkmeyer, John, Jonathan Finks, Amanda O’Reilly, Mary Oerline, Arthur\nCarlin, Andre Nunn, Justin Dimick, Mousumi Banerjee, and Nancy\nBirkmeyer. 2013. “Surgical Skill and Complication Rates After\nBariatric Surgery.” New England Journal of Medicine 369\n(15): 1434–42. https://doi.org/10.1056/nejmsa1300625.\n\n\nBlair, Ed, Seymour Sudman, Norman M Bradburn, and Carol Stocking. 1977.\n“How to Ask Questions about Drinking and Sex: Response Effects in\nMeasuring Consumer Behavior.” Journal of Marketing\nResearch 14 (3): 316–21. https://doi.org/10.2307/3150769.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys.\n2019. “Declaring and Diagnosing Research Designs.”\nAmerican Political Science Review 113 (3): 838–59. https://doi.org/10.1017/S0003055419000194.\n\n\nBlair, Graeme, Jasper Cooper, Alexander Coppock, Macartan Humphreys, and\nLuke Sonnet. 2021. estimatr: Fast Estimators\nfor Design-Based Inference. https://CRAN.R-project.org/package=estimatr.\n\n\nBlair, James. 2019. Democratizing R with\nPlumber APIs. https://posit.co/resources/videos/democratizing-r-with-plumber-apis/.\n\n\nBland, Martin, and Douglas Altman. 1986. “Statistical Methods for\nAssessing Agreement Between Two Methods of Clinical Measurement.”\nThe Lancet 327 (8476): 307–10. https://doi.org/10.1016/S0140-6736(86)90837-8.\n\n\nBlei, David. 2012. “Probabilistic Topic Models.”\nCommunications of the ACM 55 (4): 77–84. https://doi.org/10.1145/2133806.2133826.\n\n\nBlei, David, Andrew Ng, and Michael Jordan. 2003. “Latent\nDirichlet Allocation.” Journal of Machine Learning\nResearch 3 (Jan): 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.\n\n\nBloom, Howard, Andrew Bell, and Kayla Reiman. 2020. “Using Data\nfrom Randomized Trials to Assess the Likely Generalizability of\nEducational Treatment-Effect Estimates from Regression Discontinuity\nDesigns.” Journal of Research on Educational\nEffectiveness 13 (3): 488–517. https://doi.org/10.1080/19345747.2019.1634169.\n\n\nBlumenthal, Mark. 2014. “Polls, Forecasts, and\nAggregators.” PS: Political Science & Politics 47\n(02): 297–300. https://doi.org/10.1017/s1049096514000055.\n\n\nBoland, Philip. 1984. “A Biographical Glimpse of William Sealy\nGosset.” The American Statistician 38 (3): 179–83. https://doi.org/10.2307/2683648.\n\n\nBolker, Ben, and David Robinson. 2022. broom.mixed: Tidying Methods for Mixed\nModels. https://CRAN.R-project.org/package=broom.mixed.\n\n\nBolton, Ruth, and Randall Chapman. 1986. “Searching for Positive\nReturns at the Track.” Management Science 32 (August):\n1040–60. https://doi.org/10.1287/mnsc.32.8.1040.\n\n\nBombieri, Giulia, Vincenzo Penteriani, Kamran Almasieh, Hüseyin Ambarlı,\nMohammad Reza Ashrafzadeh, Chandan Surabhi Das, Nishith Dharaiya, et al.\n2023. “A Worldwide Perspective on Large Carnivore Attacks on\nHumans.” PLOS Biology 21 (1): e3001946. https://doi.org/10.1371/journal.pbio.3001946.\n\n\nBor, Jacob, Atheendar Venkataramani, David Williams, and Alexander Tsai.\n2018. “Police Killings and Their Spillover Effects on the Mental\nHealth of Black Americans: A Population-Based, Quasi-Experimental\nStudy.” The Lancet 392 (10144): 302–10. https://doi.org/10.1016/s0140-6736(18)31130-9.\n\n\nBorer, Elizabeth T., Eric W. Seabloom, Matthew B. Jones, and Mark\nSchildhauer. 2009. “Some Simple Guidelines for Effective Data\nManagement.” Bulletin of the Ecological Society of\nAmerica 90 (2): 205–14. https://doi.org/10.1890/0012-9623-90.2.205.\n\n\nBorghi, John, and Ana Van Gulick. 2022. “Promoting Open Science\nThrough Research Data Management.” Harvard Data Science\nReview 4 (3). https://doi.org/10.1162/99608f92.9497f68e.\n\n\nBorkin, Michelle, Zoya Bylinskii, Nam Wook Kim, Constance May\nBainbridge, Chelsea Yeh, Daniel Borkin, Hanspeter Pfister, and Aude\nOliva. 2015. “Beyond Memorability: Visualization Recognition and\nRecall.” IEEE Transactions on Visualization and Computer\nGraphics 22 (1): 519–28. https://doi.org/10.1109/TVCG.2015.2467732.\n\n\nBosch, Oriol, and Melanie Revilla. 2022. “When survey science met web tracking: Presenting an error\nframework for metered data.” Journal of the Royal\nStatistical Society: Series A (Statistics in Society), November,\n1–29. https://doi.org/10.1111/rssa.12956.\n\n\nBouguen, Adrien, Yue Huang, Michael Kremer, and Edward Miguel. 2019.\n“Using Randomized Controlled Trials to Estimate Long-Run Impacts\nin Development Economics.” Annual Review of Economics 11\n(1): 523–61. https://doi.org/10.1146/annurev-economics-080218-030333.\n\n\nBouie, Jamelle. 2022. “We Still Can’t See American Slavery for\nWhat It Was.” The New York Times, January. https://www.nytimes.com/2022/01/28/opinion/slavery-voyages-data-sets.html.\n\n\nBowen, Claire McKay. 2022. Protecting Your\nPrivacy in a Data-Driven World. 1st ed. Chapman; Hall/CRC.\nhttps://doi.org/10.1201/9781003122043.\n\n\nBowers, Jake, and Maarten Voors. 2016. “How to Improve Your\nRelationship with Your Future Self.” Revista de Ciencia\nPolı́tica 36 (3): 829–48. https://doi.org/10.4067/S0718-090X2016000300011.\n\n\nBowley, Arthur Lyon. 1901. Elements of Statistics. London: P.\nS. King.\n\n\n———. 1913. “Working-Class Households in Reading.”\nJournal of the Royal Statistical Society 76 (7): 672–701. https://doi.org/10.2307/2339708.\n\n\nBox, George E. P. 1976. “Science and Statistics.”\nJournal of the American Statistical Association 71 (356):\n791–99. https://doi.org/10.1080/01621459.1976.10480949.\n\n\nBoykis, Vicki. 2019. “A Deep Dive on Python Type Hints,”\nJuly. https://vickiboykis.com/2019/07/08/a-deep-dive-on-python-type-hints/.\n\n\nBoysel, Sam, and Davis Vaughan. 2021. fredr: An\nR Client for the “FRED” API. https://CRAN.R-project.org/package=fredr.\n\n\nBradley, Valerie, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic,\nXiao-Li Meng, and Seth Flaxman. 2021. “Unrepresentative Big\nSurveys Significantly Overestimated US Vaccine\nUptake.” Nature 600 (7890): 695–700. https://doi.org/10.1038/s41586-021-04198-4.\n\n\nBraginsky, Mika. 2020. wordbankr: Accessing the\nWordbank Database. https://CRAN.R-project.org/package=wordbankr.\n\n\nBrandt, Allan. 1978. “Racism and Research: The Case of the\nTuskegee Syphilis Study.” Hastings Center Report, 21–29.\nhttps://doi.org/10.2307/3561468.\n\n\nBreiman, Leo. 1994. “The 1991 Census Adjustment: Undercount or Bad\nData?” Statistical Science 9 (4). https://doi.org/10.1214/ss/1177010259.\n\n\n———. 2001. “Statistical Modeling: The Two Cultures.”\nStatistical Science 16 (3): 199–231. https://doi.org/10.1214/ss/1009213726.\n\n\nBremer, Nadieh, and Shirley Wu. 2021. Data Sketches. A K\nPeters/CRC Press. https://doi.org/10.1201/9780429445019.\n\n\nBrewer, Cynthia. 2015. Designing Better Maps: A Guide for GIS\nUsers. 2nd ed.\n\n\nBrewer, Ken. 2013. “Three Controversies in the History of Survey\nSampling.” Survey Methodology 39 (2): 249–63.\n\n\nBreznau, Nate, Eike Mark Rinke, Alexander Wuttke, Hung HV Nguyen, Muna\nAdem, Jule Adriaans, Amalia Alvarez-Benjumea, et al. 2022.\n“Observing Many Researchers Using the Same Data and Hypothesis\nReveals a Hidden Universe of Uncertainty.” Proceedings of the\nNational Academy of Sciences 119 (44): e2203150119. https://doi.org/10.1073/pnas.2203150119.\n\n\nBriggs, Ryan. 2021. “Why Does Aid Not Target the Poorest?”\nInternational Studies Quarterly 65 (3): 739–52. https://doi.org/10.1093/isq/sqab035.\n\n\nBrodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. “Methods Matter: p-Hacking and Publication Bias in Causal\nAnalysis in Economics.” American Economic Review\n110 (11): 3634–60. https://doi.org/10.1257/aer.20190687.\n\n\nBrokowski, Carolyn, and Mazhar Adli. 2019. “CRISPR Ethics: Moral\nConsiderations for Applications of a Powerful Tool.” Journal\nof Molecular Biology 431 (1): 88–101. https://doi.org/10.1016/j.jmb.2018.05.044.\n\n\nBronner, Laura. 2020. “Why Statistics Don’t Capture the Full\nExtent of the Systemic Bias in Policing.”\nFiveThirtyEight, June. https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/.\n\n\n———. 2021. “Quantitative Editing.” YouTube, June.\nhttps://youtu.be/LI5m9RzJgWc.\n\n\nBrontë, Charlotte. 1847. Jane Eyre. https://www.gutenberg.org/files/1260/1260-h/1260-h.htm.\n\n\n———. 1857. The Professor. https://www.gutenberg.org/files/1028/1028-h/1028-h.htm.\n\n\nBrook, Robert, John Ware, William Rogers, Emmett Keeler, Allyson Ross\nDavies, Cathy Sherbourne, George Goldberg, Kathleen Lohr, Patricia Camp,\nand Joseph Newhouse. 1984. “The Effect of Coinsurance on the\nHealth of Adults: Results from the RAND Health Insurance\nExperiment.” https://www.rand.org/pubs/reports/R3055.html.\n\n\nBrown, Zack. 2018. “A Git Origin Story.” Linux\nJournal, July. https://www.linuxjournal.com/content/git-origin-story.\n\n\nBryan, Jenny. 2015. “Naming Things.” Reproducible\nScience Workshop, May. https://speakerdeck.com/jennybc/how-to-name-files.\n\n\n———. 2018a. “Excuse Me, Do You Have a Moment to Talk about Version\nControl?” The American Statistician 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928.\n\n\n———. 2018b. “Code Smells and Feels.” YouTube,\nJuly. https://youtu.be/7oyiPBjLAWY.\n\n\n———. 2020. Happy Git and GitHub for the\nuseR. https://happygitwithr.com.\n\n\nBryan, Jenny, and Jim Hester. 2020. What They\nForgot to Teach You About R. https://rstats.wtf/index.html.\n\n\nBryan, Jenny, Jim Hester, David Robinson, Hadley Wickham, and Christophe\nDervieux. 2022. reprex: Prepare Reproducible\nExample Code via the Clipboard. https://CRAN.R-project.org/package=reprex.\n\n\nBryan, Jenny, and Hadley Wickham. 2021. gh:\nGitHub API. https://CRAN.R-project.org/package=gh.\n\n\nBuckheit, Jonathan, and David Donoho. 1995. “Wavelab and\nReproducible Research.” In Wavelets and Statistics,\n55–81. Springer. https://doi.org/10.1007/978-1-4612-2544-7_5.\n\n\nBueno de Mesquita, Ethan, and Anthony Fowler. 2021. Thinking Clearly\nwith Data: A Guide to Quantitative Reasoning and Analysis. New\nJersey: Princeton University Press.\n\n\nBuhr, Ray. 2017. Using R as a Production\nMachine Learning Language (Part I). https://raybuhr.github.io/blog/posts/making-predictions-over-http/.\n\n\nBuja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung\nLee, Deborah F. Swayne, and Hadley Wickham. 2009. “Statistical\nInference for Exploratory Data Analysis and Model Diagnostics.”\nPhilosophical Transactions of the Royal Society A:\nMathematical, Physical and Engineering Sciences 367 (1906):\n4361–83. https://doi.org/10.1098/rsta.2009.0120.\n\n\nBuja, Andreas, Dianne Cook, and Deborah Swayne. 1996. “Interactive\nHigh-Dimensional Data Visualization.” Journal of\nComputational and Graphical Statistics 5 (1): 78–99. https://doi.org/10.2307/1390754.\n\n\nBuneman, Peter, Sanjeev Khanna, and Tan Wang-Chiew. 2001. “Why and\nWhere: A Characterization of Data Provenance.” In Database\nTheory ICDT 2001, 316–30. Springer\nBerlin Heidelberg. https://doi.org/10.1007/3-540-44503-x_20.\n\n\nBuolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades:\nIntersectional Accuracy Disparities in Commercial Gender\nClassification.” In Conference on Fairness, Accountability\nand Transparency, 77–91.\n\n\nBurch, Tyler James. 2023. “2023 NHL Playoff\nPredictions,” April. https://tylerjamesburch.com/blog/misc/nhl-predictions.\n\n\nBurton, Jason, Nicole Cruz, and Ulrike Hahn. 2021. “Reconsidering\nEvidence of Moral Contagion in Online Social Networks.”\nNature Human Behaviour 5 (12): 1629–35. https://doi.org/10.1038/s41562-021-01133-5.\n\n\nBush, Vannevar. 1945. “As We May Think.” The Atlantic\nMonthly, July. https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/.\n\n\nByrd, James Brian, Anna Greene, Deepashree Venkatesh Prasad, Xiaoqian\nJiang, and Casey Greene. 2020. “Responsible, Practical Genomic\nData Sharing That Accelerates Research.” Nature Reviews\nGenetics 21 (10): 615–29. https://doi.org/10.1038/s41576-020-0257-5.\n\n\nCahill, Niamh, Michelle Weinberger, and Leontine Alkema. 2020.\n“What Increase in Modern Contraceptive Use Is Needed in FP2020\nCountries to Reach 75% Demand Satisfied by 2030? An Assessment Using the\nAccelerated Transition Method and Family Planning Estimation\nModel.” Gates Open Research 4. https://doi.org/10.12688/gatesopenres.13125.1.\n\n\nCalonico, Sebastian, Matias Cattaneo, Max Farrell, and Rocio Titiunik.\n2021. rdrobust: Robust Data-Driven Statistical\nInference in Regression-Discontinuity Designs. https://CRAN.R-project.org/package=rdrobust.\n\n\nCambon, Jesse, and Christopher Belanger. 2021. “tidygeocoder: Geocoding Made Easy.” Zenodo.\nhttps://doi.org/10.5281/zenodo.3981510.\n\n\nCanty, Angelo, and B. D. Ripley. 2021. boot:\nBootstrap R (S-Plus) Functions.\n\n\nCardoso, Tom. 2020. “Bias behind bars: A\nGlobe investigation finds a prison system stacked against Black and\nIndigenous inmates.” The Globe and Mail, October.\nhttps://www.theglobeandmail.com/canada/article-investigation-racial-bias-in-canadian-prison-risk-assessments/.\n\n\nCarl, Sebastian, Ben Baldwin, Lee Sharpe, Tan Ho, and John Edwards.\n2023. Nflverse: Easily Install and Load the ’Nflverse’. https://CRAN.R-project.org/package=nflverse.\n\n\nCarleton, Chris. 2021. “wccarleton/conflict-europe: Acce.” Zenodo.\nhttps://doi.org/10.5281/zenodo.4550688.\n\n\nCarleton, Chris, Dave Campbell, and Mark Collard. 2021. “A\nReassessment of the Impact of Temperature Change on European Conflict\nDuring the Second Millennium CE Using a Bespoke Bayesian Time-Series\nModel.” Climatic Change 165 (1): 1–16. https://doi.org/10.1007/s10584-021-03022-2.\n\n\nCaro, Robert. 2019. Working. 1st ed. New York: Knopf.\n\n\nCarpenter, Christopher, and Carlos Dobkin. 2014. “Replication data for: The Minimum Legal Drinking Age and\nCrime.” https://doi.org/10.7910/DVN/27070.\n\n\n———. 2015. “The Minimum Legal Drinking Age\nand Crime.” The Review of Economics and\nStatistics 97 (2): 521–24. https://doi.org/10.1162/REST_a_00489.\n\n\nCarroll, Lewis. 1871. Through the Looking-Glass. Macmillan. https://www.gutenberg.org/files/12/12-h/12-h.htm.\n\n\nCastro, Marcia, Susie Gurzenda, Cassio Turra, Sun Kim, Theresa\nAndrasfay, and Noreen Goldman. 2023. “Research Note:\nCOVID-19 Is Not an Independent Cause of Death.”\nDemography, February. https://doi.org/10.1215/00703370-10575276.\n\n\nCaughey, Devin, and Jasjeet Sekhon. 2011. “Elections and the Regression Discontinuity Design:\nLessons from Close U.S. House Races, 1942–2008.”\nPolitical Analysis 19 (4): 385–408. https://doi.org/10.1093/pan/mpr032.\n\n\nChamberlain, Scott, Hadley Wickham, Winston Chang, and Mauricio Vargas.\n2022. Analogsea: Interface to “Digital Ocean”. https://CRAN.R-project.org/package=analogsea.\n\n\nChamberlin, Donald. 2012. “Early History of\nSQL.” IEEE Annals of the History of\nComputing 34 (4): 78–82. https://doi.org/10.1109/mahc.2012.61.\n\n\nChambliss, Daniel. 1989. “The Mundanity of Excellence: An\nEthnographic Report on Stratification and Olympic Swimmers.”\nSociological Theory 7 (1): 70–86. https://doi.org/10.2307/202063.\n\n\nChambru, Cédric, and Paul Maneuvrier-Hervieu. 2022. “Introducing HiSCoD: A new gateway for the study of\nhistorical social conflict.” Working Paper Series,\nDepartment of Economics, University of Zurich. https://doi.org/10.5167/uzh-217109.\n\n\nChan, Duo. 2021. “Combining Statistical, Physical, and Historical\nEvidence to Improve Historical Sea-Surface Temperature Records.”\nHarvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.edcee38f.\n\n\nChang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke,\nYihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara\nBorges. 2021. shiny: Web Application Framework\nfor R. https://CRAN.R-project.org/package=shiny.\n\n\nChase, William. 2020. “The Glamour of Graphics.”\nRStudio Conference, January. https://posit.co/resources/videos/the-glamour-of-graphics/.\n\n\nChawla, Dalmeet Singh. 2020. “Critiqued Coronavirus Simulation\nGets Thumbs up from Code-Checking Efforts.” Nature 582:\n323–24. https://doi.org/10.1038/d41586-020-01685-y.\n\n\nChellel, Kit. 2018. “The Gambler Who Cracked the Horse-Racing\nCode.” Bloomberg Businessweek, May. https://www.bloomberg.com/news/features/2018-05-03/the-gambler-who-cracked-the-horse-racing-code.\n\n\nChen, Heng, Marie-Hélène Felt, and Christopher Henry. 2018. “2017\nMethods-of-Payment Survey: Sample Calibration and Variance\nEstimation.” Bank of Canada. https://doi.org/10.34989/tr-114.\n\n\nChen, Wei, Xilu Chen, Chang-Tai Hsieh, and Zheng Song. 2019. “A\nForensic Examination of China’s National Accounts.” Brookings\nPapers on Economic Activity, 77–127. https://www.jstor.org/stable/26798817.\n\n\nChen, Weijun, Yan Qi, Yuwen Zhang, Christina Brown, Akos Lada, and\nHarivardan Jayaraman. 2022. “Notifications: Why Less Is\nMore,” December. https://medium.com/@AnalyticsAtMeta/notifications-why-less-is-more-how-facebook-has-been-increasing-both-user-satisfaction-and-app-9463f7325e7d.\n\n\nCheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2021. leaflet: Create Interactive Web Maps with the JavaScript\n“Leaflet” Library. https://CRAN.R-project.org/package=leaflet.\n\n\nCheriet, Mohamed, Nawwaf Kharma, Cheng-Lin Liu, and Ching Suen. 2007.\nCharacter Recognition Systems: A Guide for Students and\nPractitioner. Wiley.\n\n\nChouldechova, Alexandra, Diana Benavides-Prado, Oleksandr Fialko, and\nRhema Vaithianathan. 2018. “A Case Study of Algorithm-Assisted\nDecision Making in Child Maltreatment Hotline Screening\nDecisions.” In Proceedings of the 1st Conference on Fairness,\nAccountability and Transparency, edited by Sorelle Friedler and\nChristo Wilson, 81:134–48. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/chouldechova18a.html.\n\n\nChrétien, Jean. 2007. My Years as Prime Minister. 1st ed.\nToronto: Knopf Canada.\n\n\nChristensen, Garret, Allan Dafoe, Edward Miguel, Don Moore, and Andrew\nRose. 2019. “A Study of the Impact of Data Sharing on Article\nCitations Using Journal Policies as a Natural Experiment.”\nPLOS ONE 14 (12): e0225883. https://doi.org/10.1371/journal.pone.0225883.\n\n\nChristensen, Garret, Jeremy Freese, and Edward Miguel. 2019.\nTransparent and Reproducible Social Science Research.\nCalifornia: University of California Press.\n\n\nChristian, Brian. 2012. “The A/B Test: Inside\nthe Technology That’s Changing the Rules of Business.”\nWired, April. https://www.wired.com/2012/04/ff-abtesting/.\n\n\nCirone, Alexandra, and Arthur Spirling. 2021. “Turning History\ninto Data: Data Collection, Measurement, and Inference in HPE.”\nJournal of Historical Political Economy 1 (1): 127–54. https://doi.org/10.1561/115.00000005.\n\n\nCity of Toronto. 2021. 2021 Street Needs Assessment. https://www.toronto.ca/city-government/data-research-maps/research-reports/housing-and-homelessness-research-and-reports/.\n\n\nCleveland, William. (1985) 1994. The Elements of Graphing Data.\n2nd ed. New Jersey: Hobart Press.\n\n\nClinton, Joshua, John Lapinski, and Marc Trussler. 2022.\n“Reluctant Republicans, Eager Democrats?” Public\nOpinion Quarterly 86 (2): 247–69. https://doi.org/10.1093/poq/nfac011.\n\n\nCohen, Glenn, and Michelle Mello. 2018. “HIPAA and\nProtecting Health Information in the 21st Century.”\nJAMA 320 (3): 231. https://doi.org/10.1001/jama.2018.5630.\n\n\nCohen, Jason, Steven Teleki, and Eric Brown. 2006. Best Kept Secrets\nof Peer Code Review. Smart Bear Incorporated.\n\n\nCohn, Alain. 2019. “Data and code for: Civic\nHonesty Around the Globe.” Harvard Dataverse. https://doi.org/10.7910/dvn/ykbodn.\n\n\nCohn, Alain, Michel André Maréchal, David Tannenbaum, and Christian\nLukas Zünd. 2019a. “Civic Honesty Around the Globe.”\nScience 365 (6448): 70–73. https://doi.org/10.1126/science.aau8712.\n\n\n———. 2019b. “Supplementary Materials for: Civic Honesty Around the\nGlobe.” Science 365 (6448): 70–73.\n\n\nCohn, Nate. 2016. “We Gave Four Good Pollsters the Same Raw Data.\nThey Had Four Different Results.” The New York Times,\nSeptember. https://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html.\n\n\nCollins, Annie, and Rohan Alexander. 2022. “Reproducibility of\nCOVID-19 Pre-Prints.” Scientometrics 127: 4655–73. https://doi.org/10.1007/s11192-022-04418-2.\n\n\nColombo, Tommaso, Holger Fröning, Pedro Javier Garcı̀a, and Wainer\nVandelli. 2016. “Optimizing the Data-Collection Time of a\nLarge-Scale Data-Acquisition System Through a Simulation\nFramework.” The Journal of Supercomputing 72 (12):\n4546–72. https://doi.org/10.1007/s11227-016-1764-1.\n\n\nComer, Benjamin P., and Jason R. Ingram. 2022. “Comparing Fatal\nEncounters, Mapping Police Violence, and Washington Post Fatal Police\nShooting Data from 2015-2019: A Research Note.” Criminal\nJustice Review, January, 073401682110710. https://doi.org/10.1177/07340168211071014.\n\n\nCongelio, Bradley. 2024. Introduction to NFL\nAnalytics with R. 1st ed. Chapman; Hall/CRC. https://bradcongelio.com/nfl-analytics-with-r-book/.\n\n\nCook, Dianne, Andreas Buja, Javier Cabrera, and Catherine Hurley. 1995.\n“Grand Tour and Projection\nPursuit.” Journal of Computational and Graphical\nStatistics 4 (3): 155–72. https://doi.org/10.1080/10618600.1995.10474674.\n\n\nCook, Dianne, Nancy Reid, and Emi Tanaka. 2021. “The Foundation Is\nAvailable for Thinking about Data Visualization Inferentially.”\nHarvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.8453435d.\n\n\nCook, Dianne, and Deborah Swayne. 2007. Interactive and Dynamic Graphics for Data Analysis: With\nR and GGobi. 1st ed. Springer.\n\n\nCooley, David. 2020. mapdeck: Interactive Maps\nUsing “Mapbox GL JS” and\n“Deck.gl”. https://CRAN.R-project.org/package=mapdeck.\n\n\nCouncil of European Union. 2016. “General Data Protection\nRegulation 2016/679.” https://eur-lex.europa.eu/eli/reg/2016/679/oj.\n\n\nCowen, Tyler. 2021. “Episode 132: Amia Srinivasan on Utopian\nFeminism.” Conversations with Tyler, September. https://conversationswithtyler.com/episodes/amia-srinivasan/.\n\n\n———. 2023. “Episode 168: Katherine Rundell on the Art of\nWords.” Conversations with Tyler, January. https://conversationswithtyler.com/episodes/katherine-rundell/.\n\n\nCox, David. 2018. “In Gentle Praise of Significance Tests.”\nYouTube, October. https://youtu.be/txLj%5FP9UlCQ.\n\n\nCox, David, and Nancy Reid. 1987. “Parameter Orthogonality and\nApproximate Conditional Inference.” Journal of the Royal\nStatistical Society: Series B (Methodological) 49 (1): 1–18. https://doi.org/10.1111/j.2517-6161.1987.tb01422.x.\n\n\nCox, Murray. 2021. “Inside Airbnb—Toronto\nData.” http://insideairbnb.com/get-the-data.html.\n\n\nCoyle, Edward, Andrew Coggan, Mari Hopper, and Thomas Walters. 1988.\n“Determinants of Endurance in Well-Trained\nCyclists.” Journal of Applied Physiology 64 (6):\n2622–30. https://doi.org/10.1152/jappl.1988.64.6.2622.\n\n\nCraiu, Radu. 2019. “The Hiring Gambit: In Search of the Twofer\nData Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.\n\n\nCramer, Jan Salomon. 2003. “The Origins of Logistic\nRegression.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.360300.\n\n\nCrane, Nicola, Stephanie Hazlitt, and Apache Arrow. 2023.\nApache Arrow R Cookbook. https://arrow.apache.org/cookbook/r/.\n\n\nCrawford, Kate. 2021. Atlas of AI.\n1st ed. New Haven: Yale University Press.\n\n\nCrosby, Alfred. 1997. The Measure of Reality: Quantification in\nWestern Europe, 1250-1600. Cambridge: Cambridge University Press.\n\n\nCsárdi, Gábor. 2022. gitcreds: Query\n“git” Credentials from “R”. https://CRAN.R-project.org/package=gitcreds.\n\n\nCsárdi, Gábor, Jim Hester, Hadley Wickham, Winston Chang, Martin Morgan,\nand Dan Tenenbaum. 2021. remotes: R Package\nInstallation from Remote Repositories, Including\n“GitHub”. https://CRAN.R-project.org/package=remotes.\n\n\nCummins, Neil. 2022. “The Hidden Wealth of English Dynasties,\n1892–2016.” The Economic History Review 75 (3): 667–702.\nhttps://doi.org/10.1111/ehr.13120.\n\n\nCunningham, Scott. 2021. Causal Inference: The Mixtape. 1st ed.\nNew Haven: Yale Press. https://mixtape.scunning.com.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism.\nMassachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nda Silva, Natalia, Dianne Cook, and Eun-Kyung Lee. 2023. “Interactive graphics for visually diagnosing forest\nclassifiers in R.” Computational Statistics,\nJanuary. https://doi.org/10.1007/s00180-023-01323-x.\n\n\nDagan, Noa, Noam Barda, Eldad Kepten, Oren Miron, Shay Perchik, Mark\nKatz, Miguel Hernán, Marc Lipsitch, Ben Reis, and Ran Balicer. 2021.\n“BNT162b2 mRNA Covid-19 Vaccine in a Nationwide Mass Vaccination\nSetting.” New England Journal of Medicine 384 (15):\n1412–23. https://doi.org/10.1056/NEJMoa2101765.\n\n\nDaston, Lorraine. 2000. “Why Statistics Tend Not Only to Describe\nthe World but to Change It.” London Review of Books 22\n(8). https://www.lrb.co.uk/the-paper/v22/n08/lorraine-daston/why-statistics-tend-not-only-to-describe-the-world-but-to-change-it.\n\n\nData and Justice Criminology Lab, Institute of Criminology and Criminal\nJustice, Carleton University; The Centre for Research & Innovation\nfor Black Survivors of Homicide Victims (The CRIB), at the\nFactor-Inwentash Faculty of Social Work, University of Toronto; Canadian\nCivil Liberties Association; Ethics and Technology Lab, Queen’s\nUniversity. 2022. “Tracking (in)justice: A Living Data Set\nTracking Canadian Police-Involved Deaths.” https://trackinginjustice.ca.\n\n\nDattani, Saloni. 2024. “The Rise in Reported Maternal Mortality\nRates in the US Is Largely Due to a Change in Measurement.”\nOur World in Data.\n\n\nDavidson, Thomas, Debasmita Bhattacharya, and Ingmar Weber. 2019.\n“Racial Bias in Hate Speech and Abusive Language Detection\nDatasets.” In Proceedings of the Third Workshop on Abusive\nLanguage Online, 25–35.\n\n\nDavies, Neil M., Gibran Hemani, Jenae M. Neiderhiser, Hilary C. Martin,\nMelinda C. Mills, Peter M. Visscher, Loïc Yengo, Alexander Strudwick\nYoung, and Matthew C. Keller. 2024. “The Importance of\nFamily-Based Sampling for Biobanks.” Nature 634 (8035):\n795–803. https://doi.org/10.1038/s41586-024-07721-5.\n\n\nDavies, Rhian, Steph Locke, and Lucy D’Agostino McGowan. 2022. datasauRus: Datasets from the Datasaurus\nDozen. https://CRAN.R-project.org/package=datasauRus.\n\n\nDavis, Darren. 1997. “Nonrandom Measurement Error and Race of\nInterviewer Effects Among African Americans.” The Public\nOpinion Quarterly 61 (1): 183–207. https://doi.org/10.1086/297792.\n\n\nDavison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their\nApplications. Cambridge: Cambridge University Press. http://statwww.epfl.ch/davison/BMA/.\n\n\nDe Jonge, Edwin, and Mark van der Loo. 2013. An\nintroduction to data cleaning with R. Statistics Netherlands\nHeerlen. https://cran.r-project.org/doc/contrib/de%5FJonge+van%5Fder%5FLoo-Introduction%5Fto%5Fdata%5Fcleaning%5Fwith%5FR.pdf.\n\n\nDean, Natalie. 2022. “Tracking COVID-19 Infections:\nTime for Change.” Nature 602 (7896): 185. https://doi.org/10.1038/d41586-022-00336-8.\n\n\nDeaton, Angus. 2010. “Instruments, Randomization, and Learning\nabout Development.” Journal of Economic Literature 48\n(2): 424–55. https://doi.org/10.1257/jel.48.2.424.\n\n\nDenby, Lorraine, and Colin Mallows. 2009. “Variations on the\nHistogram.” Journal of Computational and Graphical\nStatistics 18 (1): 21–31. https://doi.org/10.1198/jcgs.2009.0002.\n\n\nDeWitt, Helen. 2000. The Last Samurai. 1st ed. United States:\nTalk Mirimax Books.\n\n\nDillman, Don, Jolene Smyth, and Leah Christian. (1978) 2014.\nInternet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design\nMethod. 4th ed. Wiley.\n\n\nDoggers, Peter. 2021. “Carlsen Wins Game 6, Longest World Chess\nChampionship Game of All Time,” December. https://www.chess.com/news/view/fide-world-chess-championship-2021-game-6.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed,\nand Allison Jones-Farmer. 2021. “Explaining Predictive Model\nPerformance: An Experimental Study of Data Preparation and Model\nChoice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nDoll, Richard, and Bradford Hill. 1950. “Smoking and Carcinoma of\nthe Lung.” British Medical Journal 2 (4682): 739–48. https://doi.org/10.1136/bmj.2.4682.739.\n\n\nDruckman, James, and Donald Green. 2021. “A New Era of\nExperimental Political Science.” In Advances in Experimental\nPolitical Science, 1–16. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108777919.002.\n\n\nDu, Kai, Steven Huddart, and Xin Daniel Jiang. 2022. “Lost in\nStandardization: Effects of Financial Statement Database Discrepancies\non Inference.” Journal of Accounting and Economics,\nDecember, 101573. https://doi.org/10.1016/j.jacceco.2022.101573.\n\n\nDuflo, Esther. 2020. “Field Experiments and the Practice of\nPolicy.” American Economic Review 110 (7): 1952–73. https://doi.org/10.1257/aer.110.7.1952.\n\n\nDwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006.\n“Calibrating Noise to Sensitivity in Private Data\nAnalysis.” In Theory of Cryptography Conference, 265–84.\nSpringer. https://doi.org/10.1007/11681878_14.\n\n\nDwork, Cynthia, and Aaron Roth. 2013. “The Algorithmic Foundations\nof Differential Privacy.” Foundations and Trends in\nTheoretical Computer Science 9 (3-4): 211–407. https://doi.org/10.1561/0400000042.\n\n\nEdelman, Murray, Liberty Vittert, and Xiao-Li Meng. 2021. “An\nInterview with Murray Edelman on the History of the Exit Poll.”\nHarvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.3a25cd24.\n\n\nEdgeworth, Francis Ysidro. 1885. “Methods of Statistics.”\nJournal of the Statistical Society of London, 181–217.\n\n\nEdwards, Jonathan. 2017. “PACE team response\nshows a disregard for the principles of science.”\nJournal of Health Psychology 22 (9): 1155–58. https://doi.org/10.1177/1359105317700886.\n\n\nEfron, Bradley, and Carl Morris. 1977. “Stein’s Paradox in\nStatistics.” Scientific American 236 (May): 119–27. https://doi.org/10.1038/scientificamerican0577-119.\n\n\nEghbal, Nadia. 2020. Working in Public: The Making and Maintenance\nof Open Source Software. California: Stripe Press.\n\n\nEisenstein, Michael. 2022. “Need Web Data? Here’s How to Harvest\nThem.” Nature 607: 200–201. https://doi.org/10.1038/d41586-022-01830-9.\n\n\nElliott, Michael, Brady West, Xinyu Zhang, and Stephanie Coffey. 2022.\n“The Anchoring Method: Estimation of Interviewer Effects in the\nAbsence of Interpenetrated Sample Assignment.” Survey\nMethodology 48 (1): 25–48. http://www.statcan.gc.ca/pub/12-001-x/2022001/article/00005-eng.htm.\n\n\nElson, Malte. 2018. “Question Wording and Item\nFormulation.” https://doi.org/10.31234/osf.io/e4ktc.\n\n\nEnns, Peter, and Jake Rothschild. 2022. “Do You Know Where Your\nSurvey Data Come From?” May. https://medium.com/3streams/surveys-3ec95995dde2.\n\n\nFarrugia, Patricia, Bradley Petrisor, Forough Farrokhyar, and Mohit\nBhandari. 2010. “Research Questions, Hypotheses and\nObjectives.” Canadian Journal of Surgery 53 (4): 278.\n\n\nFeldman, Gilad. 2024. RRR Assessment Peer Review.\nhttps://mgto.org/rrrassessmentreviewtemplate.\n\n\nFinkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan\nGruber, Joseph Newhouse, Heidi Allen, Katherine Baicker, and Oregon\nHealth Study Group. 2012. “The Oregon Health Insurance Experiment:\nEvidence from the First Year.” The Quarterly Journal of\nEconomics 127 (3): 1057–1106. https://doi.org/10.1093/qje/qjs020.\n\n\nFirke, Sam. 2023. janitor: Simple Tools for\nExamining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.\n\n\nFisher, Ronald. (1925) 1928. Statistical Methods for Research\nWorkers. 2nd ed. London: Oliver; Boyd.\n\n\n———. (1935) 1949. The Design of Experiments. 5th ed. London:\nOliver; Boyd.\n\n\nFiske, Susan, and Shiro Kuriwaki. 2021. “Words to the Wise on\nWriting Scientific Papers,” November. https://doi.org/10.31234/osf.io/n32qw.\n\n\nFitts, Alexis Sobel. 2014. “The King of Content: How Upworthy Aims\nto Alter the Web, and Could End up Altering the World.”\nColumbia Journalism Review 53: 34–38. https://archives.cjr.org/feature/the%5Fking%5Fof%5Fcontent.php.\n\n\nFlake, Jessica, and Eiko Fried. 2020. “Measurement Schmeasurement:\nQuestionable Measurement Practices and How to Avoid Them.”\nAdvances in Methods and Practices in Psychological Science 3\n(4): 456–65. https://doi.org/10.1177/2515245920952393.\n\n\nFlynn, Michael. 2022. troopdata: Tools for\nAnalyzing Cross-National Military Deployment and Basing\nData. https://CRAN.R-project.org/package=troopdata.\n\n\nFord, Paul. 2015. “What Is Code?” Bloomberg\nBusinessweek, June. https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.\n\n\nForster, Edward Morgan. 1927. Aspects of the Novel. London:\nEdward Arnold.\n\n\nFoster, Gordon. 1968. “Computers, Statistics and Planning: Systems\nor Chaos?” Geary Lecture. https://www.esri.ie/system/files/publications/GLS2.pdf.\n\n\nFourcade, Marion, and Kieran Healy. 2017. “Seeing Like a\nMarket.” Socio-Economic Review 15 (1): 9–29. https://doi.org/10.1093/ser/mww033.\n\n\nFowler, Martin, and Kent Beck. 2018. Refactoring: Improving the Design of Existing\nCode. 2nd ed. New York: Addison-Wesley Professional.\n\n\nFox, John, and Robert Andersen. 2006. “Effect Displays for\nMultinomial and Proportional-Odds Logit Models.” Sociological\nMethodology 36 (1): 225–55. https://doi.org/10.1111/j.1467-9531.2006.00180.\n\n\nFox, John, Sanford Weisberg, and Brad Price. 2022. carData:\nCompanion to Applied Regression Data Sets. https://CRAN.R-project.org/package=carData.\n\n\nFranconeri, Steven, Lace Padilla, Priti Shah, Jeffrey Zacks, and Jessica\nHullman. 2021. “The Science of Visual Data Communication: What\nWorks.” Psychological Science in the Public Interest 22\n(3): 110–61. https://doi.org/10.1177/15291006211051956.\n\n\nFrandell, Ashlee, Mary Feeney, Timothy Johnson, Eric Welch, Lesley\nMichalegko, and Heyjie Jung. 2021. “The Effects of Electronic\nAlert Letters for Internet Surveys of Academic Scientists.”\nScientometrics 126 (8): 7167–81. https://doi.org/10.1007/s11192-021-04029-3.\n\n\nFranklin, Laura. 2005. “Exploratory Experiments.”\nPhilosophy of Science 72 (5): 888–99. https://doi.org/10.1086/508117.\n\n\nFrei, Christoph, and Liam Welsh. 2022. “How\nthe Closure of a U.S. Tax Loophole May Affect Investor\nPortfolios.” Journal of Risk and Financial\nManagement 15 (5): 209. https://doi.org/10.3390/jrfm15050209.\n\n\nFrick, Hannah, Fanny Chow, Max Kuhn, Michael Mahoney, Julia Silge, and\nHadley Wickham. 2022. rsample: General\nResampling Infrastructure. https://CRAN.R-project.org/package=rsample.\n\n\nFried, Eiko, Jessica Flake, and Donald Robinaugh. 2022.\n“Revisiting the Theoretical and Methodological Foundations of\nDepression Measurement.” Nature Reviews Psychology 1\n(6): 358–68. https://doi.org/10.1038/s44159-022-00050-2.\n\n\nFriedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2009. The\nElements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.\n\n\nFriendly, Michael. 2021. HistData: Data Sets from the History of\nStatistics and Data Visualization. https://CRAN.R-project.org/package=HistData.\n\n\nFriendly, Michael, and Howard Wainer. 2021. A History of Data\nVisualization and Graphic Communication. 1st ed. Massachusetts:\nHarvard University Press.\n\n\nFry, Hannah. 2020. “Big Tech Is Testing You.” The New\nYorker, February, 61–65. https://www.newyorker.com/magazine/2020/03/02/big-tech-is-testing-you.\n\n\nFryzlewicz, Piotr. 2024. “Telling Stories\nwith Data: With Applications in R.” The American\nStatistician, April, 1–5. https://doi.org/10.1080/00031305.2024.2339562.\n\n\nFuller, Mark, and James Mosher. 1987. “Raptor Survey\nTechniques.” In Raptor Management Techniques Manual,\nedited by Beth Pendleton, Brian Millsap, Keith Cline, and David Bird,\n37–65. National Wildlife Federation. https://www.sandiegocounty.gov/content/dam/sdc/pds/ceqa/JVR/AdminRecord/IncorporatedByReference/Appendices/Appendix-D---Biological-Resources-Report/Fuller%20and%20Mosher%201987.pdf.\n\n\nFunkhouser, Gray. 1937. “Historical Development of the Graphical\nRepresentation of Statistical Data.” Osiris 3: 269–404.\nhttps://doi.org/10.1086/368480.\n\n\nGagolewski, Marek. 2022. “stringi:\nFast and Portable Character String Processing in\nR.” Journal of Statistical Software 103\n(2): 1–59. https://doi.org/10.18637/jss.v103.i02.\n\n\nGalef, Julia. 2020. “Episode 248: Are Democrats Being Irrational?\n(David Shor).” Rationally Speaking, December. http://rationallyspeakingpodcast.org/248-are-democrats-being-irrational-david-shor/.\n\n\nGao, Lucy, Jacob Bien, and Daniela Witten. 2022. “Selective\nInference for Hierarchical Clustering.” Journal of the\nAmerican Statistical Association, October, 1–11. https://doi.org/10.1080/01621459.2022.2116331.\n\n\nGao, Zheng, Christian Bird, and Earl T. Barr. 2017. “To Type or\nNot to Type: Quantifying Detectable Bugs in\nJavaScript.” In 2017\nIEEE/ACM 39th International Conference on\nSoftware Engineering (ICSE). IEEE. https://doi.org/10.1109/icse.2017.75.\n\n\nGarfinkel, Irwin, Lee Rainwater, and Timothy Smeeding. 2006. “A\nRe-Examination of Welfare States and Inequality in Rich Nations: How\nin-Kind Transfers and Indirect Taxes Change the Story.”\nJournal of Policy Analysis and Management 25 (4): 897–919. https://doi.org/10.1002/pam.20213.\n\n\nGargiulo, Maria. 2022. “Statistical Biases, Measurement\nChallenges, and Recommendations for Studying Patterns of Femicide in\nConflict.” Peace Review 34 (2): 163–76. https://doi.org/10.1080/10402659.2022.2049002.\n\n\nGarnier, Simon, Noam Ross, Robert Rudis, Antônio Camargo, Marco Sciaini,\nand Cédric Scherer. 2021. viridis –\nColorblind-Friendly Color Maps for R. https://doi.org/10.5281/zenodo.4679424.\n\n\nGazeley, Ursula, Georges Reniers, Hallie Eilerts-Spinelli, Julio Romero\nPrieto, Momodou Jasseh, Sammy Khagayi, and Veronique Filippi. 2022.\n“Women’s Risk of Death Beyond 42 Days Post Partum: A Pooled\nAnalysis of Longitudinal Health and Demographic Surveillance System Data\nin Sub-Saharan Africa.” The Lancet Global Health 10\n(11): e1582–89. https://doi.org/10.1016/s2214-109x(22)00339-4.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021.\n“Datasheets for Datasets.” Communications of the\nACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\nGelfand, Sharla. 2021. “Make a ReprEx... Please.”\nYouTube, February. https://youtu.be/G5Nm-GpmrLw.\n\n\n———. 2022a. Astrologer: Chani Nicholas Weekly Horoscopes\n(2013-2017). http://github.com/sharlagelfand/astrologer.\n\n\n———. 2022b. opendatatoronto: Access the City of\nToronto Open Data Portal. https://CRAN.R-project.org/package=opendatatoronto.\n\n\nGelman, Andrew. 2016. “What has happened down\nhere is the winds have changed,” September. https://statmodeling.stat.columbia.edu/2016/09/21/what-has-happened-down-here-is-the-winds-have-changed/.\n\n\n———. 2019. “Another Regression Discontinuity Disaster and What Can\nWe Learn from It,” June. https://statmodeling.stat.columbia.edu/2019/06/25/another-regression-discontinuity-disaster-and-what-can-we-learn-from-it/.\n\n\n———. 2020. “Statistical Models of Election Outcomes.”\nYouTube, August. https://youtu.be/7gjDnrbLQ4k.\n\n\nGelman, Andrew, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and\nDonald Rubin. (1995) 2014. Bayesian Data Analysis. 3rd ed.\nChapman; Hall/CRC.\n\n\nGelman, Andrew, Sharad Goel, Douglas Rivers, and David Rothschild. 2016.\n“The Mythical Swing Voter.” Quarterly Journal of\nPolitical Science 11 (1): 103–30. https://doi.org/10.1561/100.00015031.\n\n\nGelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using\nRegression and Multilevel/Hierarchical Models. 1st ed. Cambridge\nUniversity Press.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and\nOther Stories. Cambridge University Press. https://avehtari.github.io/ROS-Examples/.\n\n\nGelman, Andrew, and Guido Imbens. 2019. “Why High-Order\nPolynomials Should Not Be Used in Regression Discontinuity\nDesigns.” Journal of Business & Economic Statistics\n37 (3): 447–56. https://doi.org/10.1080/07350015.2017.1366909.\n\n\nGelman, Andrew, and Eric Loken. 2013. “The Garden of Forking\nPaths: Why Multiple Comparisons Can Be a Problem, Even When There Is No\n‘Fishing Expedition’ or ‘p-Hacking’ and the\nResearch Hypothesis Was Posited Ahead of Time.” Department of\nStatistics, Columbia University. http://www.stat.columbia.edu/~gelman/research/unpublished/p%5Fhacking.pdf.\n\n\nGelman, Andrew, Greggor Mattson, and Daniel Simpson. 2018. “Gaydar\nand the Fallacy of Decontextualized Measurement.”\nSociological Science 5 (12): 270–80. https://doi.org/10.15195/v5.a12.\n\n\nGelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s\nPractice What We Preach: Turning Tables into Graphs.” The\nAmerican Statistician 56 (2): 121–30. https://doi.org/10.1198/000313002317572790.\n\n\nGelman, Andrew, and Aki Vehtari. 2021. “What Are the Most\nImportant Statistical Ideas of the Past 50 Years?” Journal of\nthe American Statistical Association 116 (536): 2087–97. https://doi.org/10.1080/01621459.2021.1938081.\n\n\n———. 2023. Learn Statistics: Hundreds of Stories, Activities, and\nExamples.\n\n\nGelman, Andrew, Aki Vehtari, Daniel Simpson, Charles Margossian, Bob\nCarpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian\nBürkner, and Martin Modrák. 2020. “Bayesian Workflow.”\narXiv. https://doi.org/10.48550/arXiv.2011.01808.\n\n\nGentemann, Chelle Leigh, Chris Holdgraf, Ryan Abernathey, Daniel\nCrichton, James Colliander, Edward Joseph Kearns, Yuvi Panda, and\nRichard Signell. 2021. “Science Storms the Cloud.”\nAGU Advances 2 (2). https://doi.org/10.1029/2020av000354.\n\n\nGerber, Alan, and Donald Green. 2012. Field Experiments: Design,\nAnalysis, and Interpretation. New York: WW Norton.\n\n\nGerring, John. 2012. “Mere Description.” British\nJournal of Political Science 42 (4): 721–46. https://doi.org/10.1017/s0007123412000130.\n\n\nGertler, Paul, Sebastian Martinez, Patrick Premand, Laura Rawlings, and\nChristel Vermeersch. 2016. Impact Evaluation in Practice. 2nd\ned. The World Bank. https://doi.org/10.1596/978-1-4648-0779-4.\n\n\nGeuenich, Michael, Jinyu Hou, Sunyun Lee, Shanza Ayub, Hartland Jackson,\nand Kieran Campbell. 2021a. “Automated Assignment of Cell Identity\nfrom Single-Cell Multiplexed Imaging and Proteomic Data.”\nCell Systems 12 (12): 1173–86. https://doi.org/10.1016/j.cels.2021.08.012.\n\n\n———. 2021b. “Replication Materials: \"Automated Assignment of Cell\nIdentity from Single-Cell Multiplexed Imaging and Proteomic\nData\".” https://doi.org/10.5281/ZENODO.5156049.\n\n\nGhitza, Yair, and Andrew Gelman. 2020. “Voter Registration\nDatabases and MRP: Toward the Use of Large-Scale Databases in Public\nOpinion Research.” Political Analysis 28 (4): 507–31. https://doi.org/10.1017/pan.2020.3.\n\n\nGibney, Elizabeth. 2022. “The leap second’s\ntime is up: world votes to stop pausing clocks.”\nNature 612 (7938): 18–18. https://doi.org/10.1038/d41586-022-03783-5.\n\n\nGleick, James. 1990. “The Census: Why We Can’t Count.”\nThe New York Times, July. https://www.nytimes.com/1990/07/15/magazine/the-census-why-we-can-t-count.html.\n\n\nGodfrey, Ernest. 1918. “History and Development of Statistics in\nCanada.” In The History of Statistics–Their Development and\nProgress in Many Countries. New York: Macmillan, edited by John\nKoren, 179–98. Macmillan Company of New York.\n\n\nGoodman, Leo. 1961. “Snowball Sampling.” The Annals of\nMathematical Statistics 32 (1): 148–70. https://doi.org/10.1214/aoms/1177705148.\n\n\nGoodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2023.\n“rstanarm: Bayesian applied\nregression modeling via Stan.” https://mc-stan.org/rstanarm.\n\n\nGoogle. 2022. “What to Look for in a Code Review.” Google\nEngineering Practices Documentation. https://google.github.io/eng-practices/review/reviewer/looking-for.html.\n\n\nGordon, Brett, Robert Moakler, and Florian Zettelmeyer. 2022.\n“Close Enough? A Large-Scale Exploration of Non-Experimental\nApproaches to Advertising Measurement.” Marketing\nScience, November. https://doi.org/10.1287/mksc.2022.1413.\n\n\nGordon, Brett, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky.\n2019. “A Comparison of Approaches to Advertising Measurement:\nEvidence from Big Field Experiments at Facebook.” Marketing\nScience 38 (2): 193–225. https://doi.org/10.1287/mksc.2018.1135.\n\n\nGould, Elliot, Hannah Fraser, Timothy Parker, Shinichi Nakagawa, Simon\nGriffith, Peter Vesk, and Fiona Fidler. 2023. “Same Data,\nDifferent Analysts: Variation in Effect Sizes Due to Analytical\nDecisions in Ecology and Evolutionary Biology,” October. https://doi.org/10.32942/x2gg62.\n\n\nGraham, Paul. 2020. “How to Write Usefully,” February. http://paulgraham.com/useful.html.\n\n\nGray, Charles T., and Ben Marwick. 2019. “Truth, Proof, and\nReproducibility: There’s No Counter-Attack for the Codeless.” In\nCommunications in Computer and Information Science, 111–29.\nSpringer Singapore. https://doi.org/10.1007/978-981-15-1960-4_8.\n\n\nGreen, Donald, Terence Leong, Holger Kern, Alan Gerber, and Christopher\nLarimer. 2009. “Testing the Accuracy of Regression Discontinuity\nAnalysis Using Experimental Benchmarks.” Political\nAnalysis 17 (4): 400–417. https://doi.org/10.1093/pan/mpp018.\n\n\nGreen, Eric. 2020. “Nivi Research: Mister P\nhelps us understand vaccine hesitancy,” December. https://research.nivi.io/posts/2020-12-08-mister-p-helps-us-understand-vaccine-hesitancy/.\n\n\nGreenberg, Bernard, Abdel-Latif Abul-Ela, Walt Simmons, and Daniel\nHorvitz. 1969. “The Unrelated Question Randomized Response Model:\nTheoretical Framework.” Journal of the American Statistical\nAssociation 64 (326): 520–39. https://doi.org/10.1080/01621459.1969.10500991.\n\n\nGreenland, Sander, Stephen Senn, Kenneth Rothman, John Carlin, Charles\nPoole, Steven Goodman, and Douglas Altman. 2016. “Statistical Tests, P values, Confidence Intervals, and\nPower: A Guide to Misinterpretations.” European\nJournal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.\n\n\nGreifer, Noah. 2021. “Why Do We Do Matching for Causal Inference\nVs Regressing on Confounders?” Cross Validated,\nSeptember. https://stats.stackexchange.com/q/544958.\n\n\nGrimmer, Justin, Margaret Roberts, and Brandon Stewart. 2022. Text As Data: A New Framework for Machine Learning and\nthe Social Sciences. New Jersey: Princeton University Press.\n\n\nGrolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times\nMade Easy with lubridate.”\nJournal of Statistical Software 40 (3): 1–25. https://doi.org/10.18637/jss.v040.i03.\n\n\nGronsbell, Jessica, Jessica Minnier, Sheng Yu, Katherine Liao, and\nTianxi Cai. 2019. “Automated Feature Selection of Predictors in\nElectronic Medical Records Data.” Biometrics 75 (1):\n268–77. https://doi.org/10.1111/biom.12987.\n\n\nGroves, Robert. 2011. “Three Eras of Survey Research.”\nPublic Opinion Quarterly 75 (5): 861–71. https://doi.org/10.1093/poq/nfr057.\n\n\nGroves, Robert, and Lars Lyberg. 2010. “Total\nSurvey Error: Past, Present, and Future.” Public\nOpinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.\n\n\nGrün, Bettina, and Kurt Hornik. 2011. “topicmodels: An R Package for Fitting\nTopic Models.” Journal of Statistical Software 40 (13):\n1–30. https://doi.org/10.18637/jss.v040.i13.\n\n\nGustafsson, Karl, and Linus Hagström. 2017. “What Is the Point?\nTeaching Graduate Students How to Construct Political Science Research\nPuzzles.” European Political Science 17 (4): 634–48. https://doi.org/10.1057/s41304-017-0130-y.\n\n\nGutman, Robert. 1958. “Birth and Death Registration in\nMassachusetts: II. The Inauguration of a Modern System,\n1800-1849.” The Milbank Memorial Fund Quarterly 36 (4):\n373–402.\n\n\nHackett, Robert. 2016. “Researchers Caused an\nUproar By Publishing Data From 70,000 OkCupid Users.”\nFortune, May. https://fortune.com/2016/05/18/okcupid-data-research/.\n\n\nHalberstam, David. 1972. The Best and the\nBrightest. 1st ed. New York: Random House.\n\n\nHamming, Richard. (1997) 2020. The Art of Doing\nScience and Engineering. 2nd ed. Stripe Press.\n\n\nHammond, Jennifer, Heidi Leister-Tebbe, Annie Gardner, Paula Abreu,\nWeihang Bao, Wayne Wisemandle, MaryLynn Baniecki, et al. 2022.\n“Oral Nirmatrelvir for High-Risk, Nonhospitalized Adults with\nCovid-19.” New England Journal of Medicine 386 (15):\n1397–1408. https://doi.org/10.1056/nejmoa2118542.\n\n\nHand, David. 2018. “Statistical Challenges of Administrative and\nTransaction Data.” Journal of the Royal Statistical Society:\nSeries A (Statistics in Society) 181 (3): 555–605. https://doi.org/10.1111/rssa.12315.\n\n\nHandcock, Mark, and Krista Gile. 2011. “Comment: On the Concept of\nSnowball Sampling.” Sociological Methodology 41 (1):\n367–71. https://doi.org/10.1111/j.1467-9531.2011.01243.x.\n\n\nHangartner, Dominik, Daniel Kopp, and Michael Siegenthaler. 2021.\n“Monitoring Hiring Discrimination Through Online Recruitment\nPlatforms.” Nature 589 (7843): 572–76. https://doi.org/10.1038/s41586-020-03136-0.\n\n\nHanretty, Chris. 2020. “An Introduction to Multilevel Regression\nand Post-Stratification for Estimating Constituency Opinion.”\nPolitical Studies Review 18 (4): 630–45. https://doi.org/10.1177/1478929919864773.\n\n\nHao, Karen. 2019. “This is How AI Bias Really\nHappens—And Why It’s So Hard To Fix.” MIT Technology\nReview, February. https://www.technologyreview.com/2019/02/04/137602/this-is-how-ai-bias-really-happensand-why-its-so-hard-to-fix/.\n\n\nHart, Edmund, Pauline Barmby, David LeBauer, François Michonneau, Sarah\nMount, Patrick Mulrooney, Timothée Poisot, Kara Woo, Naupaka Zimmerman,\nand Jeffrey Hollister. 2016. “Ten Simple Rules for Digital Data\nStorage.” PLOS Computational Biology 12\n(10): e1005097. https://doi.org/10.1371/journal.pcbi.1005097.\n\n\nHartocollis, Anemona. 2022. “U.S. News Ranked\nColumbia No. 2, but a Math Professor Has His Doubts.”\nThe New York Times, March. https://www.nytimes.com/2022/03/17/us/columbia-university-rank.html.\n\n\nHassan, Mai. 2022. “New Insights on Africa’s Autocratic\nPast.” African Affairs 121 (483): 321–33. https://doi.org/10.1093/afraf/adac002.\n\n\nHastie, Trevor, and Robert Tibshirani. 1990. Generalized Additive\nModels. 1st ed. Boca Raton: Chapman; Hall/CRC.\n\n\nHawes, Michael. 2020. “Implementing Differential\nPrivacy: Seven Lessons From the\n2020 United States\nCensus.” Harvard Data Science Review 2 (2).\nhttps://doi.org/10.1162/99608f92.353c6f99.\n\n\nHayot, Eric. 2014. The Elements of Academic Style. New York:\nColumbia University Press.\n\n\nHealy, Kieran. 2018. Data Visualization. New Jersey: Princeton\nUniversity Press. https://socviz.co.\n\n\n———. 2020. “The Kitchen Counter Observatory,” May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/.\n\n\n———. 2022. “Unhappy in Its Own Way,” July. https://kieranhealy.org/blog/archives/2022/07/22/unhappy-in-its-own-way/.\n\n\nHeckathorn, Douglas. 1997. “Respondent-Driven Sampling: A New\nApproach to the Study of Hidden Populations.” Social\nProblems 44 (2): 174–99. https://doi.org/10.2307/3096941.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey\nGreene, and Stephanie Hicks. 2021. “Reproducibility Standards for\nMachine Learning in the Life Sciences.” Nature Methods\n18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHeller, Jean. 2022. “AP Exposes the Tuskegee Syphilis Study: The\n50th Anniversary.” AP, July. https://apnews.com/article/tuskegee-study-ap-story-investigation-syphilis-53403657e77d76f52df6c2e2892788c9.\n\n\nHermans, Felienne. 2017. “Peter Hilton on Naming.” IEEE\nSoftware 34 (3): 117–20. https://doi.org/10.1109/MS.2017.81.\n\n\n———. 2021. The Programmer’s Brain: What Every Programmer Needs to\nKnow about Cognition. 1st ed. New York: Simon; Schuster. https://www.manning.com/books/the-programmers-brain.\n\n\nHernán, Miguel, David Clayton, and Niels Keiding. 2011. “The\nSimpson’s Paradox Unraveled.” International Journal of\nEpidemiology 40 (3): 780–85. https://doi.org/10.1093/ije/dyr041.\n\n\nHernán, Miguel, and James Robins. 2023. What If. 1st ed. Boca\nRaton: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.\n\n\nHerndon, Thomas, Michael Ash, and Robert Pollin. 2014. “Does High\nPublic Debt Consistently Stifle Economic Growth? A Critique of Reinhart\nand Rogoff.” Cambridge Journal of Economics 38 (2):\n257–79. https://doi.org/10.1093/cje/bet075.\n\n\nHester, Jim, Florent Angly, Russ Hyde, Michael Chirico, Kun Ren,\nAlexander Rosenstock, and Indrajeet Patil. 2022. lintr: A “Linter” for R Code. https://CRAN.R-project.org/package=lintr.\n\n\nHester, Jim, Hadley Wickham, and Gábor Csárdi. 2021. fs: Cross-Platform File System Operations Based on\n“libuv”. https://CRAN.R-project.org/package=fs.\n\n\nHill, Austin Bradford. 1965. “The Environment and Disease:\nAssociation or Causation?” Proceedings of the Royal Society\nof Medicine 58 (5): 295–300.\n\n\nHillel, Wayne. 2017. How Do We Trust Our Science Code? https://www.hillelwayne.com/how-do-we-trust-science-code/.\n\n\nHo, Daniel, Kosuke Imai, Gary King, and Elizabeth Stuart. 2011.\n“MatchIt: Nonparametric Preprocessing for Parametric\nCausal Inference.” Journal of Statistical Software 42\n(8): 1–28. https://doi.org/10.18637/jss.v042.i08.\n\n\nHodgetts, Paul. 2022. “The Negative Space of Data,” March.\nhttps://hodgettsp.netlify.app/post/data-negativespace/.\n\n\nHofmeister, Johannes, Janet Siegmund, and Daniel Holt. 2017.\n“Shorter Identifier Names Take Longer to Comprehend.” In\n2017 IEEE 24th International Conference on Software Analysis,\nEvolution and Reengineering (SANER), 217–27. https://doi.org/10.1109/saner.2017.7884623.\n\n\nHolland, Paul. 1986. “Statistics and Causal Inference.”\nJournal of the American Statistical Association 81 (396):\n945–60. https://doi.org/10.2307/2289064.\n\n\nHolliday, Derek, Tyler Reny, Alex Rossell Hayes, Aaron Rudkin, Chris\nTausanovitch, and Lynn Vavreck. 2021. “Democracy Fund + UCLA Nationscape Methodology and\nRepresentativeness Assessment.”\n\n\nHopper, Nate. 2022. “The Thorny Problem of Keeping the Internet’s\nTime.” The New Yorker, September. https://www.newyorker.com/tech/annals-of-technology/the-thorny-problem-of-keeping-the-internets-time.\n\n\nHorst, Allison Marie, Alison Presmanes Hill, and Kristen Gorman. 2020.\npalmerpenguins: Palmer Archipelago (Antarctica)\npenguin data. https://doi.org/10.5281/zenodo.3960218.\n\n\nHorton, Nicholas, Rohan Alexander, Micaela Parker, Aneta Piekut, and\nColin Rundel. 2022. “The Growing Importance of Reproducibility and\nResponsible Workflow in the Data Science and Statistics\nCurriculum.” Journal of Statistics and Data Science\nEducation 30 (3): 207–8. https://doi.org/10.1080/26939169.2022.2141001.\n\n\nHorton, Nicholas, and Stuart Lipsitz. 2001. “Multiple Imputation\nin Practice.” The American Statistician 55 (3): 244–54.\nhttps://doi.org/10.1198/000313001317098266.\n\n\nHotz, Joseph, Christopher Bollinger, Tatiana Komarova, Charles Manski,\nRobert Moffitt, Denis Nekipelov, Aaron Sojourner, and Bruce Spencer.\n2022. “Balancing Data Privacy and Usability in the Federal\nStatistical System.” Proceedings of the National Academy of\nSciences 119 (31): 1–10. https://doi.org/10.1073/pnas.2104906119.\n\n\nHowes, Adam. 2022. “Representing Uncertainty Using Significant\nFigures,” April. https://athowes.github.io/posts/2022-04-24-representing-uncertainty-using-significant-figures/.\n\n\nHug, Lucia, Monica Alexander, Danzhen You, Leontine Alkema, and UN\nInter-agency Group for Child. 2019. “National, Regional, and\nGlobal Levels and Trends in Neonatal Mortality Between 1990 and 2017,\nwith Scenario-Based Projections to 2030: A Systematic Analysis.”\nLancet Global Health 7 (6): e710–20. https://doi.org/10.1016/S2214-109X(19)30163-9.\n\n\nHughes, Nicola, and Jill Rutter. 2016. “Ministers Reflect:\nInterview with Oliver Letwin,” December. https://www.instituteforgovernment.org.uk/ministers-reflect/person/oliver-letwin/.\n\n\nHulley, Stephen, Steven Cummings, Warren Browner, Deborah Grady, and\nThomas Newman. 2007. Designing Clinical Research. 3rd ed.\nLippincott Williams & Wilkins.\n\n\nHullman, Jessica, and Andrew Gelman. 2021. “Designing for\nInteractive Exploratory Data Analysis Requires Theories of Graphical\nInference.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.3ab8a587.\n\n\nHuntington-Klein, Nick. 2021. The Effect: An Introduction to\nResearch Design and Causality. 1st ed. Chapman & Hall. https://theeffectbook.net.\n\n\n———. 2022. “Library of Statistical Techniques.” https://lost-stats.github.io.\n\n\nHuntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni,\nJeffrey Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The\nInfluence of Hidden Researcher Decisions in Applied\nMicroeconomics.” Economic Inquiry 59: 944–60. https://doi.org/10.1111/ecin.12992.\n\n\nHuyen, Chip. 2020. “Machine Learning Is Going Real-Time,”\nDecember. https://huyenchip.com/2020/12/27/real-time-machine-learning.html.\n\n\nHvitfeldt, Emil, and Julia Silge. 2021. Supervised Machine Learning for Text Analysis in\nR. 1st ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781003093459.\n\n\nHyman, Michael, Luca Sartore, and Linda J Young. 2021. “Capture-Recapture Estimation of Characteristics of U.S.\nLocal Food Farms Using a Web-Scraped List Frame.”\nJournal of Survey Statistics and Methodology 10 (4): 979–1004.\nhttps://doi.org/10.1093/jssam/smab008.\n\n\nHyndman, Rob, Timothy Hyndman, Charles Gray, Sayani Gupta, and Jacquie\nTran. 2022. cricketdata: International Cricket\nData. https://CRAN.R-project.org/package=cricketdata.\n\n\nIannone, Richard. 2022. DiagrammeR: Graph/Network\nVisualization. https://CRAN.R-project.org/package=DiagrammeR.\n\n\nIannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra\nLauer, and JooYoung Seo. 2022. gt: Easily\nCreate Presentation-Ready Display Tables.\n\n\nIannone, Richard, and Mauricio Vargas. 2022. pointblank: Data Validation and Organization of Metadata\nfor Local and Remote Tables. https://CRAN.R-project.org/package=pointblank.\n\n\nInternational Organization Of Legal Metrology. 2007. International\nVocabulary of Metrology – Basic and General Concepts and Associated\nTerms. 3rd ed. https://www.oiml.org/en/files/pdf%5Fv/v002-200-e07.pdf.\n\n\nIoannidis, John. 2005. “Why Most Published Research Findings Are\nFalse.” PLOS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.\n\n\nIrizarry, Rafael. 2020. “The Role of Academia\nin Data Science Education.” Harvard Data Science\nReview 2 (1). https://doi.org/10.1162/99608f92.dd363929.\n\n\nIrving, Damien, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte\nWickham, and Greg Wilson. 2021. Research Software Engineering with\nPython. Chapman; Hall/CRC.\n\n\nIsaacson, Walter. 2011. Steve Jobs. 1st ed. Simon &\nSchuster.\n\n\nIshiguro, Kazuo. 1989. The Remains of the Day. 1st ed. Faber;\nFaber.\n\n\nIzrailev, Sergei. 2022. tictoc: Functions for\nTiming R Scripts, as Well as Implementations of “Stack” and\n“List” Structures. https://CRAN.R-project.org/package=tictoc.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.\n(2013) 2021. An Introduction to Statistical\nLearning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.\n\n\nJenkins, Jennifer, Steven Rich, Andrew Ba Tran, Paige Moody, Julie Tate,\nand Ted Mellnik. 2022. “How the Washington Post Examines Police\nShootings in the United States.” https://www.washingtonpost.com/investigations/2022/12/05/washington-post-fatal-police-shootings-methodology/.\n\n\nJet Propulsion Laboratory. 2009. “JPL\nInstitutional Coding Standard for the C Programming\nLanguage.” Document Number D-60411, March. https://web.archive.org/web/20111015064908/http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf.\n\n\nJohnson, Alicia, Miles Ott, and Mine Dogucu. 2022. Bayes Rules! An Introduction to Bayesian Modeling with\nR. 1st ed. Chapman; Hall/CRC. https://www.bayesrulesbook.com.\n\n\nJohnson, Kaneesha. 2021. “Two Regimes of Prison Data\nCollection.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.72825001.\n\n\nJohnston, Myfanwy, and David Robinson. 2022. gutenbergr: Download and Process Public Domain Works from\nProject Gutenberg. https://CRAN.R-project.org/package=gutenbergr.\n\n\nJones, Arnold. 1953. “Census Records of the Later Roman\nEmpire.” The Journal of Roman Studies 43: 49–64. https://doi.org/10.2307/297781.\n\n\nJordan, Michael. 2004. “Graphical Models.” Statistical\nScience 19 (1). https://doi.org/10.1214/088342304000000026.\n\n\n———. 2019. “Artificial Intelligence–The\nRevolution Hasn’t Happened Yet.” Harvard Data Science\nReview 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.\n\n\nJoyner, Michael. 1991. “Modeling: Optimal Marathon Performance on\nthe Basis of Physiological Factors.” Journal of Applied\nPhysiology 70 (2): 683–87. https://doi.org/10.1152/jappl.1991.70.2.683.\n\n\nJurafsky, Dan, and James Martin. (2000) 2023. Speech and Language\nProcessing. 3rd ed. https://web.stanford.edu/~jurafsky/slp3/.\n\n\nKahan, Brennan, Suzie Cro, Fan Li, and Michael Harhay. 2023.\n“Eliminating Ambiguous Treatment Effects Using Estimands.”\nAmerican Journal of Epidemiology, February. https://doi.org/10.1093/aje/kwad036.\n\n\nKahan, Brennan, Joanna Hindley, Mark Edwards, Suzie Cro, and Tim Morris.\n2024. “The estimands framework: a primer on\nthe ICH E9(R1) addendum.” BMJ, January, e076316.\nhttps://doi.org/10.1136/bmj-2023-076316.\n\n\nKahan, Brennan, Fan Li, Andrew Copas, and Michael Harhay. 2022.\n“Estimands in Cluster-Randomized Trials: Choosing Analyses That\nAnswer the Right Question.” International Journal of\nEpidemiology, July. https://doi.org/10.1093/ije/dyac131.\n\n\nKahle, David, and Hadley Wickham. 2013. “ggmap: Spatial Visualization with ggplot2.”\nThe R Journal 5 (1): 144–61. http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.\n\n\nKahneman, Daniel, Olivier Sibony, and Cass Sunstein. 2021. Noise: A\nFlaw in Human Judgment. William Collins.\n\n\nKalamara, Eleni, Arthur Turrell, Chris Redl, George Kapetanios, and\nSujit Kapadia. 2022. “Making text count:\nEconomic forecasting using newspaper text.”\nJournal of Applied Econometrics 37 (5): 896–919.\nhttps://doi.org/10.1002/jae.2907.\n\n\nKalgin, Alexander. 2014. “Implementation of\nPerformance Management in Regional Government in Russia: Evidence of\nData Manipulation.” Public Management Review 18\n(1): 110–38. https://doi.org/10.1080/14719037.2014.965271.\n\n\nKapoor, Sayash, and Arvind Narayanan. 2023. “Leakage and the\nReproducibility Crisis in Machine-Learning-Based Science.”\nPatterns 4 (9): 1–12. https://doi.org/10.1016/j.patter.2023.100804.\n\n\nKarsten, Karl. 1923. Charts and Graphs. New York:\nPrentice-Hall.\n\n\nKasy, Maximilian, and Alexander Teytelboym. 2023. “Matching with\nSemi-Bandits.” The Econometrics Journal 26 (1): 45–66.\nhttps://doi.org/10.1093/ectj/utac021.\n\n\nKatz, Lindsay, and Rohan Alexander. 2023a. “A\nnew, comprehensive database of all proceedings of the Australian\nParliamentary Debates (1998-2022).” Zenodo. https://doi.org/10.5281/zenodo.7799678.\n\n\n———. 2023b. “Digitization of the Australian Parliamentary Debates,\n1998–2022.” Scientific Data 10 (1): 1–14. https://doi.org/10.1038/s41597-023-02464-w.\n\n\nKay, Matthew. 2022. tidybayes: Tidy Data\nand Geoms for Bayesian Models. https://doi.org/10.5281/zenodo.1308151.\n\n\nKennedy, Lauren, and Jonah Gabry. 2020. “MRP\nwith rstanarm,” July. https://mc-stan.org/rstanarm/articles/mrp.html.\n\n\nKennedy, Lauren, and Andrew Gelman. 2021. “Know Your Population\nand Know Your Model: Using Model-Based Regression and Poststratification\nto Generalize Findings Beyond the Observed Sample.”\nPsychological Methods 26 (5): 547–58. https://doi.org/10.1037/met0000362.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun\nJia, and Julien Teitler. 2022. “He, She, They: Using Sex and\nGender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKenny, Christopher T., Shiro Kuriwaki, Cory McCartan, Evan T. R.\nRosenman, Tyler Simko, and Kosuke Imai. 2021. “The use of differential privacy for census data and its\nimpact on redistricting: The case of the 2020 U.S.\nCensus.” Science Advances 7 (41). https://doi.org/10.1126/sciadv.abk3283.\n\n\n———. 2023. “Comment: The Essential Role of Policy Evaluation for\nthe 2020 Census Disclosure Avoidance System.” Harvard Data\nScience Review, no. Special Issue 2. https://doi.org/10.1162/99608f92.abc2c765.\n\n\nKent, William. 1993. “My Height: A Model for Numeric\nInformation.” https://www.bkent.net/Doc/myheight.htm.\n\n\nKeshav, Srinivasan. 2007. “How to Read a Paper.”\nACM SIGCOMM Computer Communication\nReview 37 (3): 83–84. https://doi.org/10.1145/1273445.1273458.\n\n\nKeyes, Os. 2019. “Counting the Countless.” Real\nLife. https://reallifemag.com/counting-the-countless/.\n\n\nKharecha, Pushker, and James Hansen. 2013. “Prevented Mortality\nand Greenhouse Gas Emissions from Historical and Projected Nuclear\nPower.” Environmental Science & Technology 47 (9):\n4889–95. https://doi.org/10.1021/es3051197.\n\n\nKiang, Mathew, Alexander Tsai, Monica Alexander, David Rehkopf, and\nSanjay Basu. 2021. “Racial/Ethnic Disparities in Opioid-Related\nMortality in the USA, 1999–2019: The Extreme Case of Washington\nDC.” Journal of Urban Health 98 (5): 589–95. https://doi.org/10.1007/s11524-021-00573-8.\n\n\nKing, Gary. 2006. “Publication, Publication.” PS:\nPolitical Science & Politics 39 (1): 119–25. https://doi.org/10.1017/S1049096506060252.\n\n\nKing, Gary, and Richard Nielsen. 2019. “Why Propensity Scores\nShould Not Be Used for Matching.” Political Analysis 27\n(4): 435–54. https://doi.org/10.1017/pan.2019.11.\n\n\nKing, Stephen. 2000. On Writing: A Memoir of the Craft. 1st ed.\nScribner.\n\n\nKirkegaard, Emil, and Julius Bjerrekær. 2016. “The OKCupid\nDataset: A Very Large Public Dataset of Dating Site Users.”\nOpen Differential Psychology, 1–10. https://doi.org/10.26775/ODP.2016.11.03.\n\n\nKish, Leslie. 1959. “Some Statistical Problems in Research\nDesign.” American Sociological Review 24 (3): 328–38. https://doi.org/10.2307/2089381.\n\n\nKleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics\nwith R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.\n\n\nKnuth, Donald. 1984. “Literate Programming.” The\nComputer Journal 27 (2): 97–111. https://doi.org/10.1093/comjnl/27.2.97.\n\n\n———. 1998. Art of Computer Programming, Volume 2: Seminumerical\nAlgorithms. 2nd ed.\n\n\nKnutson, Victoria, Serge Aleshin-Guendel, Ariel Karlinsky, William\nMsemburi, and Jon Wakefield. 2022. “Estimating Global and\nCountry-Specific Excess Mortality During the COVID-19 Pandemic,”\nMay. https://cdn.who.int/media/docs/default-source/world-health-data-platform/covid-19-excessmortality/covid-methods-paper-revision.pdf.\n\n\nKoenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey,\nZion Mengesha, Connor Toups, John Rickford, Dan Jurafsky, and Sharad\nGoel. 2020. “Racial Disparities in Automated Speech\nRecognition.” Proceedings of the National Academy of\nSciences 117 (14): 7684–89. https://doi.org/10.1073/pnas.1915768117.\n\n\nKoenecke, Allison, and Hal Varian. 2020. “Synthetic Data\nGeneration for Economists.” https://arxiv.org/abs/2011.01374.\n\n\nKoenker, Roger, and Achim Zeileis. 2009. “On Reproducible\nEconometric Research.” Journal of Applied Econometrics\n24 (5): 833–47. https://doi.org/10.1002/jae.1083.\n\n\nKoerner, Lisbet. 2000. Linnaeus: Nature and Nation. Cambridge:\nHarvard University Press.\n\n\nKohavi, Ron, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and\nYa Xu. 2012. “Trustworthy Online Controlled Experiments.”\nIn Proceedings of the 18th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining -\nKDD 12, 1st ed. ACM Press.\nhttps://doi.org/10.1145/2339530.2339653.\n\n\nKohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical\nGuide to A/B Testing. Cambridge University Press.\n\n\nKoitsalu, Marie, Martin Eklund, Jan Adolfsson, Henrik Grönberg, and\nYvonne Brandberg. 2018. “Effects of Pre-Notification, Invitation\nLength, Questionnaire Length and Reminder on Participation Rate: A\nQuasi-Randomised Controlled Trial.” BMC Medical Research\nMethodology 18 (3): 1–5. https://doi.org/10.1186/s12874-017-0467-5.\n\n\nKrantz, Sebastian. 2023. collapse: Advanced and\nFast Data Transformation. https://CRAN.R-project.org/package=collapse.\n\n\nKuhn, Max. 2022. tune: Tidy Tuning\nTools. https://CRAN.R-project.org/package=tune.\n\n\nKuhn, Max, and Hannah Frick. 2022. poissonreg:\nModel Wrappers for Poisson Regression. https://CRAN.R-project.org/package=poissonreg.\n\n\nKuhn, Max, and Davis Vaughan. 2022. parsnip: A\nCommon API to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip.\n\n\nKuhn, Max, Davis Vaughan, and Emil Hvitfeldt. 2022. yardstick: Tidy Characterizations of Model\nPerformance. https://CRAN.R-project.org/package=yardstick.\n\n\nKuhn, Max, and Hadley Wickham. 2020. tidymodels: a collection of packages for modeling and\nmachine learning using tidyverse principles. https://www.tidymodels.org.\n\n\n———. 2022. recipes: Preprocessing and Feature\nEngineering Steps for Modeling. https://CRAN.R-project.org/package=recipes.\n\n\nKuriwaki, Shiro, Will Beasley, and Thomas Leeper. 2023. dataverse: R Client for Dataverse 4+\nRepositories.\n\n\nKuznets, Simon, Lillian Epstein, and Elizabeth Jenks. 1941. National Income and Its Composition,\n1919-1938. National Bureau of Economic Research.\n\n\nLamott, Anne. 1994. Bird by Bird: Some Instructions on Writing and\nLife. Anchor Books.\n\n\nLandau, William Michael. 2021. “The targets R\nPackage: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for\nReproducibility and High-Performance Computing.”\nJournal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.\n\n\nLane, Nick. 2015. “The Unseen World: Reflections on Leeuwenhoek\n(1677) ‘Concerning Little Animals’.”\nPhilosophical Transactions of the Royal Society B: Biological\nSciences 370 (1666): 20140344. https://doi.org/10.1098/rstb.2014.0344.\n\n\nLaouenan, Morgane, Palaash Bhargava, Jean-Benoı̂t Eyméoud, Olivier\nGergaud, Guillaume Plique, and Etienne Wasmer. 2022. “A Cross-Verified Database of Notable People,\n3500BC–2018AD.” Scientific Data 9 (290). https://doi.org/10.1038/s41597-022-01369-4.\n\n\nLarmarange, Joseph. 2023. labelled:\nManipulating Labelled Data. https://CRAN.R-project.org/package=labelled.\n\n\nLatour, Bruno. 1996. “On Actor-Network Theory: A Few\nClarifications.” Soziale Welt 47 (4): 369–81. http://www.jstor.org/stable/40878163.\n\n\nLauderdale, Benjamin, Delia Bailey, Jack Blumenau, and Douglas Rivers.\n2020. “Model-Based Pre-Election Polling for National and\nSub-National Outcomes in the US and UK.” International\nJournal of Forecasting 36 (2): 399–413. https://doi.org/10.1016/j.ijforecast.2019.05.012.\n\n\nLaver, Michael, Kenneth Benoit, and John Garry. 2003. “Extracting\nPolicy Positions from Political Texts Using Words as Data.”\nAmerican Political Science Review 97 (2): 311–31. https://doi.org/10.1017/S0003055403000698.\n\n\nLeek, Jeff, Blakeley McShane, Andrew Gelman, David Colquhoun, Michèle\nNuijten, and Steven Goodman. 2017. “Five Ways to Fix\nStatistics.” Nature 551 (7682): 557–59. https://doi.org/10.1038/d41586-017-07522-z.\n\n\nLeek, Jeff, and Roger Peng. 2020. “Advanced Data Science\n2020.” http://jtleek.com/ads2020/index.html.\n\n\nLeonelli, Sabina. 2020. “Learning from Data Journeys.” In\nData Journeys in the Sciences, 1–24. Springer International\nPublishing. https://doi.org/10.1007/978-3-030-37177-7_1.\n\n\nLeos-Barajas, Vianey, Theoni Photopoulou, Roland Langrock, Toby\nPatterson, Yuuki Watanabe, Megan Murgatroyd, and Yannis Papastamatiou.\n2016. “Analysis of Animal Accelerometer Data Using Hidden Markov\nModels.” Methods in Ecology and Evolution 8 (2): 161–73.\nhttps://doi.org/10.1111/2041-210x.12657.\n\n\nLetterman, Clark. 2021. “Q&A: How Pew\nResearch Center surveyed nearly 30,000 people in India,”\nJuly. https://medium.com/pew-research-center-decoded/q-a-how-pew-research-center-surveyed-nearly-30-000-people-in-india-7c778f6d650e.\n\n\nLevay, Kevin, Jeremy Freese, and James Druckman. 2016. “The\nDemographic and Political Composition of Mechanical Turk\nSamples.” SAGE Open 6 (1): 1–17. https://doi.org/10.1177/2158244016636433.\n\n\nLevine, Judah, Patrizia Tavella, and Martin Milton. 2022. “Towards\na Consensus on a Continuous Coordinated Universal Time.”\nMetrologia 60 (1): 014001. https://doi.org/10.1088/1681-7575/ac9da5.\n\n\nLewis, Crystal. 2024. Data Management in Large-Scale Education\nResearch. 1st ed. Chapman; Hall/CRC. https://datamgmtinedresearch.com/index.html.\n\n\nLichand, Guilherme, and Sharon Wolf. 2022. “Measuring Child Labor:\nWhom Should Be Asked, and Why It Matters,” March. https://doi.org/10.21203/rs.3.rs-1474562/v1.\n\n\nLight, Richard, Judith Singer, and John Willett. 1990. By Design: Planning Research on Higher\nEducation. 1st ed. Cambridge: Harvard University Press.\n\n\nLima, Renato de, Oliver Phillips, Alvaro Duque, Sebastian Tello, Stuart\nDavies, Alexandre Adalardo de Oliveira, Sandra Muller, et al. 2022.\n“Making Forest Data Fair and Open.” Nature Ecology\n& Evolution 6 (April): 656–58. https://doi.org/10.1038/s41559-022-01738-7.\n\n\nLin, Herbert. 2014. “A Proposal to Reduce Government\nOverclassification of Information Related to National Security.”\nJournal of National Security Law and Policy 7: 443–63.\n\n\nLin, Sarah, Ibraheem Ali, and Greg Wilson. 2021. “Ten Quick Tips\nfor Making Things Findable.” PLOS Computational Biology\n16 (12): 1–10. https://doi.org/10.1371/journal.pcbi.1008469.\n\n\nLips, Hilary. 2020. Sex and Gender: An Introduction. 7th ed.\nIllinois: Waveland Press.\n\n\nLittle, Roderick, and Roger Lewis. 2021. “Estimands, Estimators,\nand Estimates.” JAMA 326 (10): 967. https://doi.org/10.1001/jama.2021.2886.\n\n\nLiu, Emily, Lenny Bronner, and Jeremy Bowers. 2022. “What the\nWashington Post Elections Engineering Team Had to Learn about Election\nData.” Washington Post Engineering, April. https://washpost.engineering/what-the-washington-post-elections-engineering-team-had-to-learn-about-election-data-a41603daf9ca.\n\n\nLockheed Martin. 2005. “Joint Strike Fighter Air Vehicle C++\nCoding Standards For The System Development And Demonstration\nProgram.” Document Number 2RDU00001 Rev C,\nDecember. https://www.stroustrup.com/JSF-AV-rules.pdf.\n\n\nLohr, Sharon. (1999) 2022. Sampling: Design and Analysis. 3rd\ned. Chapman; Hall/CRC.\n\n\nLoken, Meredith, and Hilary Matfess. 2023. “Introducing the\nWomen’s Activities in Armed Rebellion (WAAR) Project, 1946-2015.”\nJournal of Peace Research.\n\n\nLovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. 1st ed. Chapman;\nHall/CRC. https://geocompr.robinlovelace.net.\n\n\nLucas, Jack, Reed Merrill, Kelly Blidook, Sandra Breux, Laura Conrad,\nGabriel Eidelman, Royce Koop, et al. 2020. “Canadian\nMunicipal Elections Database.” Scholars Portal Dataverse.\nhttps://doi.org/10.5683/sp2/4mzjpq.\n\n\nLucas, Robert. 1978. “Asset Prices in an Exchange Economy.”\nEconometrica 46 (6): 1429–45. https://doi.org/10.2307/1913837.\n\n\nLuebke, David Martin, and Sybil Milton. 1994. “Locating the\nVictim: An Overview of Census-Taking, Tabulation Technology, and\nPersecution in Nazi Germany.” IEEE Annals of the History of\nComputing 16 (3): 25–39. https://doi.org/10.1109/MAHC.1994.298418.\n\n\nLumley, Thomas. 2020. “survey: analysis of\ncomplex survey samples.” https://cran.r-project.org/web/packages/survey/index.html.\n\n\nLundberg, Ian, Rebecca Johnson, and Brandon Stewart. 2021. “What\nIs Your Estimand? Defining the Target Quantity Connects Statistical\nEvidence to Theory.” American Sociological Review 86\n(3): 532–65. https://doi.org/10.1177/00031224211004187.\n\n\nLuscombe, Alex, Kevin Dick, and Kevin Walby. 2021. “Algorithmic\nThinking in the Public Interest: Navigating Technical, Legal, and\nEthical Hurdles to Web Scraping in the Social Sciences.”\nQuality & Quantity 56 (3): 1–22. https://doi.org/10.1007/s11135-021-01164-0.\n\n\nLuscombe, Alex, Jamie Duncan, and Kevin Walby. 2022. “Jumpstarting\nthe Justice Disciplines: A Computational-Qualitative Approach to\nCollecting and Analyzing Text and Image Data in Criminology and Criminal\nJustice Studies.” Journal of Criminal Justice Education\n33 (2): 151–71. https://doi.org/10.1080/10511253.2022.2027477.\n\n\nLuscombe, Alex, and Alexander McClelland. 2020. “Policing the\nPandemic: Tracking the Policing of Covid-19 Across Canada,”\nApril. https://doi.org/10.31235/osf.io/9pn27.\n\n\nLyman, Frank. 1981. “The Responsive Classroom Discussion: The\nInclusion of All Students.” Mainstreaming Digest 109:\n109–13.\n\n\nMacDorman, Marian, and Eugene Declercq. 2018. “The Failure of\nUnited States Maternal Mortality Reporting and Its Impact on Women’s\nLives.” Birth 45 (2): 105–8. https://doi.org/1111/birt.12333.\n\n\nMaher, Michael. 1982. “Modelling Association Football\nScores.” Statistica Neerlandica 36 (3): 109–18. https://doi.org/10.1111/j.1467-9574.1982.tb00782.x.\n\n\nMaier, Maximilian, František Bartoš, Tom Stanley, David Shanks, Adam\nHarris, and Eric-Jan Wagenmakers. 2022. “No Evidence for Nudging\nAfter Adjusting for Publication Bias.” Proceedings of the\nNational Academy of Sciences 119 (31): e2200300119. https://doi.org/10.1073/pnas.2200300119.\n\n\nMammoliti, Anthony, Petr Smirnov, Minoru Nakano, Zhaleh Safikhani,\nChristopher Eeles, Heewon Seo, Sisira Kadambat Nair, et al. 2021.\n“Orchestrating and Sharing Large Multimodal Data for Transparent\nand Reproducible Research.” Nature Communications 12\n(1). https://doi.org/10.1038/s41467-021-25974-w.\n\n\nManski, Charles. 2022. “Inference with Imputed Data: The Allure of\nMaking Stuff Up.” arXiv. https://doi.org/10.48550/arXiv.2205.07388.\n\n\nMarchese, David. 2022. “Her Discovery Changed the World. How Does\nShe Think We Should Use It?” The New York Times, August.\nhttps://www.nytimes.com/interactive/2022/08/15/magazine/jennifer-doudna-crispr-interview.html.\n\n\nMartin, Charles, and Ben Popper. 2021. “Don’t Push That Button:\nExploring the Software That Flies SpaceX Rockets and Starships.”\nThe Overflow, December. https://stackoverflow.blog/2021/12/27/dont-push-that-button-exploring-the-software-that-flies-spacex-starships/.\n\n\nMartı́nez, Luis. 2022. “How Much Should We Trust the Dictator’s\nGDP Growth Estimates?” Journal of Political\nEconomy 130 (10): 2731–69. https://doi.org/10.1086/720458.\n\n\nMatias, Nathan, Kevin Munger, Marianne Aubin Le Quere, and Charles\nEbersole. 2021. “The Upworthy Research\nArchive, a time series of 32,487 experiments in U.S.\nmedia.” Scientific Data 8 (1): 1–8. https://doi.org/10.1038/s41597-021-00934-7.\n\n\nMatsumoto, Yukihiro. 2007. “Treating Code as\nan Essay.” In Beautiful Code, edited by Andy Oram\nand Greg Wilson, 477–81. O’Reilly.\n\n\nMattson, Greggor. 2017. “Artificial Intelligence Discovers\nGayface. Sigh.” https://greggormattson.com/2017/09/09/artificial-intelligence-discovers-gayface/amp/.\n\n\nMcCarthy, Fiona M., Tamsin E. M. Jones, Anne E. Kwitek, Cynthia L.\nSmith, Peter D. Vize, Monte Westerfield, and Elspeth A. Bruford. 2023.\n“The Case for Standardizing Gene Nomenclature in\nVertebrates.” Nature 614 (7948): E31–32. https://doi.org/10.1038/s41586-022-05633-w.\n\n\nMcClelland, Alexander. 2019. “‘Lock This Whore up’:\nLegal Violence and Flows of Information Precipitating Personal Violence\nAgainst People Criminalised for HIV-Related Crimes in Canada.”\nEuropean Journal of Risk Regulation 10 (1): 132–47. https://doi.org/10.1017/err.2019.20.\n\n\nMcElreath, Richard. (2015) 2020. Statistical\nRethinking: A Bayesian Course with Examples in R and Stan.\n2nd ed. Chapman; Hall/CRC.\n\n\n———. 2020. “Science as Amateur Software Development.”\nYouTube, September. https://youtu.be/zwRdO9%5FGGhY.\n\n\nMcIlroy, Doug, Ray Brownrigg, Thomas Minka, and Roger Bivand. 2023.\nmapproj: Map Projections. https://CRAN.R-project.org/package=mapproj.\n\n\nMcKenzie, David. 2021. “What Do You Need To\nDo To Make A Matching Estimator Convincing? Rhetorical vs Statistical\nChecks.” World Bank Blogs—Development Impact,\nFebruary. https://blogs.worldbank.org/impactevaluations/what-do-you-need-do-make-matching-estimator-convincing-rhetorical-vs-statistical.\n\n\nMcKinney, Wes. (2011) 2022. Python for Data Analysis. 3rd ed.\nhttps://wesmckinney.com/book/.\n\n\nMcPhee, John. 2017. Draft No. 4. 1st ed. Farrar, Straus;\nGiroux.\n\n\nMcQuire, Scott. 2019. “One Map to Rule Them All? Google Maps as\nDigital Technical Object.” Communication and the Public\n4 (2): 150–65. https://doi.org/10.1177/2057047319850192.\n\n\nMellon, Jonathan. 2024. “Rain, Rain, Go Away: 194 Potential\nExclusion‐restriction Violations for Studies Using Weather as an\nInstrumental Variable.” American Journal of Political\nScience, 1–18. https://doi.org/10.1111/ajps.12894.\n\n\nMeng, Xiao-Li. 1994. “Multiple-Imputation Inferences with\nUncongenial Sources of Input.” Statistical Science 9\n(4): 538–58. https://doi.org/10.1214/ss/1177010269.\n\n\n———. 2012. “You Want Me to Analyze Data i Don’t Have? Are You\nInsane?” Shanghai Archives of Psychiatry 24 (5):\n297–301. https://doi.org/10.3969/j.issn.1002-0829.2012.05.011.\n\n\n———. 2018. “Statistical Paradises and Paradoxes in Big Data (i):\nLaw of Large Populations, Big Data Paradox, and the 2016 US Presidential\nElection.” The Annals of Applied Statistics 12 (2):\n685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\n———. 2021. “What Are the Values of Data, Data Science, or Data\nScientists?” Harvard Data Science Review 3 (1). https://doi.org/10.1162/99608f92.ee717cf7.\n\n\nMerali, Zeeya. 2010. “Computational Science:... Error.”\nNature 467 (7317): 775–77. https://doi.org/10.1038/467775a.\n\n\nMiceli, Milagros, Julian Posada, and Tianling Yang. 2022.\n“Studying up Machine Learning Data.” Proceedings of the\nACM on Human-Computer Interaction 6 (January): 1–14.\nhttps://doi.org/10.1145/3492853.\n\n\nMichener, William. 2015. “Ten Simple Rules for Creating a Good\nData Management Plan.” PLOS Computational Biology 11\n(10): e1004525. https://doi.org/10.1371/journal.pcbi.1004525.\n\n\nMill, James. 1817. The History of British India. 1st ed. https://books.google.ca/books?id=Orw_AAAAcAAJ.\n\n\nMiller, Greg. 2014. “The Cartographer Who’s\nTransforming Map Design.” Wired, October. https://www.wired.com/2014/10/cindy-brewer-map-design/.\n\n\nMiller, Michael, and Joseph Sutherland. 2022. “The Effect of\nGender on Interruptions at Congressional Hearings.” American\nPolitical Science Review, 1–19. https://doi.org/10.1017/S0003055422000260.\n\n\nMills, David L. 1991. “Internet Time Synchronization: The Network\nTime Protocol.” IEEE Transactions on Communications 39\n(10): 1482–93.\n\n\nMindell, David. 2008. Digital Apollo: Human and\nMachine in Spaceflight. 1st ed. New York: The MIT Press.\n\n\nMineault, Patrick, and The Good Research Code Handbook Community. 2021.\n“The Good Research Code Handbook.” https://doi.org/10.5281/zenodo.5796873.\n\n\nMinsky, Yaron. 2011. “OCaml for the\nmasses.” Communications of the ACM 54 (11):\n53–58. https://doi.org/10.1145/2018396.2018413.\n\n\n———. 2015. “Automated Trading and OCaml with Yaron Minsky.”\nHackers — Software Engineering Daily, November. https://softwareengineeringdaily.com/2015/11/09/automated-trading-and-ocaml-with-yaron-minsky/.\n\n\nMitchell, Alanna. 2022a. “Get Ready for the New, Improved\nSecond.” The New York Times, April. https://www.nytimes.com/2022/04/25/science/time-second-measurement.html.\n\n\n———. 2022b. “Time Has Run Out for the Leap Second.” The\nNew York Times, November. https://www.nytimes.com/2022/11/14/science/time-leap-second.html.\n\n\nMitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy\nVasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and\nTimnit Gebru. 2019. “Model Cards for Model Reporting.”\nProceedings of the Conference on Fairness, Accountability, and\nTransparency, January. https://doi.org/10.1145/3287560.3287596.\n\n\nMitrovski, Alen, Xiaoyan Yang, and Matthew Wankiewicz. 2020. “Joe\nBiden Projected to Win Popular Vote in 2020 US Election.” https://github.com/matthewwankiewicz/US_election_forecast.\n\n\nMiyakawa, Tsuyoshi. 2020. “No Raw Data, No Science: Another\nPossible Source of the Reproducibility Crisis.” Molecular\nBrain 13 (1): 1–6. https://doi.org/10.1186/s13041-020-0552-2.\n\n\nMok, Lillio, Samuel Way, Lucas Maystre, and Ashton Anderson. 2022.\n“The Dynamics of Exploration on Spotify.” In\nProceedings of the International AAAI Conference on Web and Social\nMedia, 16:663–74. https://doi.org/10.1609/icwsm.v16i1.19324.\n\n\nMolanphy, Chris. 2012. “100 & Single: Three Rules to Define\nthe Term ‘One-Hit Wonder’ in 2012.” The Village\nVoice, September. https://www.villagevoice.com/2012/09/10/100-single-three-rules-to-define-the-term-one-hit-wonder-in-2012/.\n\n\nMorange, Michel. 2016. A History of Biology. New Jersey:\nPrinceton University Press.\n\n\nMoyer, Brian, and Abe Dunn. 2020. “Measuring the\nGross Domestic Product\n(GDP): The Ultimate Data\nScience Project.” Harvard Data\nScience Review 2 (1). https://doi.org/10.1162/99608f92.414caadb.\n\n\nMullard, Asher. 2021. “Half of Top Cancer Studies Fail\nHigh-Profile Reproducibility Effort.” Nature 600 (7889):\n368--369. https://doi.org/10.1038/d41586-021-03691-0.\n\n\nMüller, Kirill. 2020. here: A Simpler Way to\nFind Your Files. https://CRAN.R-project.org/package=here.\n\n\nMüller, Kirill, Tobias Schieferdecker, and Patrick Schratz. 2019.\nVisualization, Transformation and Reporting with the Tidyverse.\nhttps://krlmlr.github.io/vistransrep/.\n\n\nMüller, Kirill, and Lorenz Walthert. 2022. styler: Non-Invasive Pretty Printing of R\nCode. https://CRAN.R-project.org/package=styler.\n\n\nMüller, Kirill, and Hadley Wickham. 2022. tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.\n\n\nMurphy, Heather. 2017. “Why Stanford Researchers Tried to Create a\n‘Gaydar’ Machine.” The New York Times,\nOctober. https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html.\n\n\nNational Academies of Sciences, Engineering, and Medicine. 2019.\nReproducibility and Replicability in Science. 1st ed. National\nAcademies Press. https://doi.org/10.17226/25303.\n\n\nNavarro, Danielle. 2022. “Binding Apache\nArrow to R,” January. https://blog.djnavarro.net/posts/2022-01-18%5Fbinding-arrow-to-r/.\n\n\nNavarro, Danielle, Jonathan Keane, and Stephanie Hazlitt. 2022.\n“Larger-Than-Memory Data Workflows with\nApache Arrow,” June. https://arrow-user2022.netlify.app.\n\n\nNelder, John. 1999. “From Statistics to Statistical\nScience.” Journal of the Royal Statistical Society: Series D\n(The Statistician) 48 (2): 257–69. https://doi.org/10.1111/1467-9884.00187.\n\n\nNelder, John, and Robert Wedderburn. 1972. “Generalized Linear\nModels.” Journal of the Royal Statistical Society: Series A\n(General) 135 (3): 370–84. https://doi.org/10.2307/2344614.\n\n\nNeufeld, Anna, and Daniela Witten. 2021. “Discussion of Breiman’s\n\"Two Cultures\": From Two Cultures to One.” Observational\nStudies 7 (1): 171–74. https://doi.org/10.1353/obs.2021.0004.\n\n\nNeufeld, Michael. 2002. “Wernher von Braun, the SS, and\nConcentration Camp Labor: Questions of Moral, Political, and Criminal\nResponsibility.” German Studies Review 25 (1): 57–78. https://doi.org/10.2307/1433245.\n\n\nNeuwirth, Erich. 2022. RColorBrewer: ColorBrewer\nPalettes. https://CRAN.R-project.org/package=RColorBrewer.\n\n\nNewman, Daniel. 2014. “Missing Data: Five Practical\nGuidelines.” Organizational Research Methods 17 (4):\n372–411. https://doi.org/10.1177/1094428114548590.\n\n\nNeyman, Jerzy. 1934. “On the Two Different Aspects of the\nRepresentative Method: The Method of Stratified Sampling and the Method\nof Purposive Selection.” Journal of the Royal Statistical\nSociety 97 (4): 558–625. https://doi.org/10.2307/2342192.\n\n\nNix, Justin, and M. James Lozada. 2020. “Police Killings of\nUnarmed Black Americans: A Reassessment of Community Mental Health\nSpillover Effects,” January. https://doi.org/10.31235/osf.io/ajz2q.\n\n\nNobles, Melissa. 2002. “Racial Categorization and\nCensuses.” In Census and Identity: The Politics of Race,\nEthnicity, and Language in National Censuses, edited by David\nKertzer and Dominique Arel, 43–70. Cambridge: Cambridge University\nPress. https://doi.org/10.1017/CBO9780511606045.003.\n\n\nNorthcutt, Curtis, Anish Athalye, and Jonas Mueller. 2021.\n“Pervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.” In Proceedings of the 35th Conference on Neural\nInformation Processing Systems Track on Datasets and Benchmarks. https://doi.org/10.48550/arXiv.2103.14749.\n\n\nObermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil\nMullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used\nto Manage the Health of Populations.” Science 366\n(6464): 447–53. https://doi.org/10.1126/science.aax2342.\n\n\nOberski, Daniel, and Frauke Kreuter. 2020. “Differential Privacy\nand Social Science: An Urgent\nPuzzle.” Harvard Data Science Review 2 (1).\nhttps://doi.org/10.1162/99608f92.63a22079.\n\n\nOECD. 2014. “The Essential Macroeconomic Aggregates.” In\nUnderstanding National Accounts, 13–46. OECD. https://doi.org/10.1787/9789264214637-2-en.\n\n\n———. 2022. Quarterly GDP. https://data.oecd.org/gdp/quarterly-gdp.htm.\n\n\nOoms, Jeroen. 2014. “The jsonlite Package: A\nPractical and Consistent Mapping Between JSON Data and R\nObjects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.\n\n\n———. 2022a. openssl: Toolkit for Encryption,\nSignatures and Certificates Based on OpenSSL. https://CRAN.R-project.org/package=openssl.\n\n\n———. 2022b. pdftools: Text Extraction,\nRendering and Converting of PDF Documents. https://CRAN.R-project.org/package=pdftools.\n\n\n———. 2022c. ssh: Secure Shell (SSH) Client for\nR. https://CRAN.R-project.org/package=ssh.\n\n\n———. 2022d. tesseract: Open Source OCR\nEngine. https://CRAN.R-project.org/package=tesseract.\n\n\nOpen Science Collaboration. 2015. “Estimating the Reproducibility\nof Psychological Science.” Science 349 (6251): aac4716.\nhttps://doi.org/10.1126/science.aac4716.\n\n\nOrwell, George. 1946. Politics and the English Language. https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/politics-and-the-english-language/.\n\n\nOsborne, Jason. 2012. Best Practices in Data\nCleaning: A Complete Guide to Everything You Need to Do Before and After\nCollecting Your Data. SAGE Publications.\n\n\nOsgood, D. Wayne. 2000. “Poisson-Based Regression Analysis of\nAggregate Crime Rates.” Journal of Quantitative\nCriminology 16 (1): 21–43. https://doi.org/10.1023/a:1007521427059.\n\n\nPalmer Station Antarctica LTER, and Gorman, Kristen. 2020.\n“Structural Size Measurements and Isotopic Signatures of Foraging\nAmong Adult Male and Female Adélie Penguins (Pygoscelis Adeliae) Nesting\nAlong the Palmer Archipelago Near Palmer Station, 2007-2009.” https://doi.org/10.6073/PASTA/98B16D7D563F265CB52372C8CA99E60F.\n\n\nPasek, Josh. 2015. “Predicting Elections:\nConsidering Tools to Pool the Polls.” Public Opinion\nQuarterly 79 (2): 594–619. https://doi.org/10.1093/poq/nfu060.\n\n\nPatki, Neha, Roy Wedge, and Kalyan Veeramachaneni. 2016. “The\nSynthetic Data Vault.” In 2016 IEEE International Conference\non Data Science and Advanced Analytics (DSAA), 399–410. https://doi.org/10.1109/DSAA.2016.49.\n\n\nPaullada, Amandalynne, Inioluwa Deborah Raji, Emily Bender, Emily\nDenton, and Alex Hanna. 2021. “Data and Its (Dis)contents: A\nSurvey of Dataset Development and Use in Machine Learning\nResearch.” Patterns 2 (11): 100336. https://doi.org/10.1016/j.patter.2021.100336.\n\n\nPavlik, Kaylin. 2019. “Understanding + Classifying Genres Using\nSpotify Audio Features.” https://www.kaylinpavlik.com/classifying-songs-genres/.\n\n\nPedersen, Thomas Lin. 2022. patchwork: The\nComposer of Plots. https://CRAN.R-project.org/package=patchwork.\n\n\nPerepolkin, Dmytro. 2022. polite: Be Nice on\nthe Web. https://CRAN.R-project.org/package=polite.\n\n\nPerkel, Jeffrey. 2021. “Ten Computer Codes That Transformed\nScience.” Nature 589 (7842): 344–48. https://doi.org/10.1038/d41586-021-00075-2.\n\n\n———. 2023. “The Sleight-of-Hand Trick That Can Simplify Scientific\nComputing.” Nature 617 (7959): 212--213. https://doi.org/10.1038/d41586-023-01469-0.\n\n\nPhillips, Alban. 1958. “The Relation Between Unemployment and the\nRate of Change of Money Wage Rates in the United Kingdom,\n1861-1957.” Economica 25 (100): 283–99. https://doi.org/10.1111/j.1468-0335.1958.tb00003.x.\n\n\nPiller, Charles. 2022. “Blots on a Field?” Science\n377 (6604): 358–63. https://doi.org/10.1126/science.ade0209.\n\n\nPineau, Joelle, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent\nLarivière, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Hugo\nLarochelle. 2021. “Improving Reproducibility in Machine Learning\nResearch (a Report from the NeurIPS 2019 Reproducibility\nProgram).” Journal of Machine Learning Research 22\n(164): 1–20. http://jmlr.org/papers/v22/20-303.html.\n\n\nPitman, Jim. 1993. Probability. 1st ed. New York: Springer. https://doi.org/10.1007/978-1-4612-4374-8.\n\n\nPlant, Anne, and Robert Hanisch. 2020. “Reproducibility in\nScience: A Metrology Perspective.” Harvard Data Science\nReview 2 (4). https://doi.org/10.1162/99608f92.eb6ddee4.\n\n\nPodlogar, Tim, Peter Leo, and James Spragg. 2022. “Using VO2max as a marker of training status in\nathletes—Can we do better?” Journal of Applied\nPhysiology 133 (6): 144–47. https://doi.org/10.1152/japplphysiol.00723.2021.\n\n\nPreece, Donald Arthur. 1981. “Distributions of Final Digits in\nData.” The Statistician 30 (1): 31. https://doi.org/10.2307/2987702.\n\n\nPrévost, Jean-Guy, and Jean-Pierre Beaud. 2015. Statistics, Public\nDebate and the State, 1800–1945: A Social, Political and Intellectual\nHistory of Numbers. Routledge.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical\nComputing. Vienna, Austria: R Foundation for Statistical Computing.\nhttps://www.R-project.org/.\n\n\nR Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and\nKirill Müller. 2022. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.\n\n\nRadcliffe, Nicholas. 2023. Test-Driven Data\nAnalysis (Python TDDA library). https://tdda.readthedocs.io/en/latest/index.html.\n\n\nRegister, Yim. 2020a. “Introduction to Sampling and\nRandomization.” YouTube, November. https://youtu.be/U272FFxG8LE.\n\n\n———. 2020b. “Data Science Ethics in 6 Minutes.”\nYouTube, December. https://youtu.be/mA4gypAiRYU.\n\n\nRehaag, Sean. 2023. “Supreme Court of Canada Bulk Decisions\nDataset.” Refugee Law Laboratory. https://refugeelab.ca/bulk-data/scc.\n\n\nReid, Nancy. 2003. “Asymptotics and the Theory of\nInference.” The Annals of Statistics 31 (6): 1695–1731.\nhttps://doi.org/10.1214/aos/1074290325.\n\n\nRichardson, Neal, Ian Cook, Nic Crane, Dewey Dunnington, Romain\nFrançois, Jonathan Keane, Dragoș Moldovan-Grünfeld, Jeroen Ooms, and\nApache Arrow. 2023. arrow: Integration to\nApache Arrow. https://CRAN.R-project.org/package=arrow.\n\n\nRiederer, Emily. 2020. “Column Names as Contracts,”\nSeptember. https://emilyriederer.netlify.app/post/column-name-contracts/.\n\n\n———. 2021. “Causal Design Patterns for Data Analysts,”\nJanuary. https://emilyriederer.netlify.app/post/causal-design-patterns/.\n\n\nRiffe, Tim, Enrique Acosta, Enrique José Acosta, Diego Manuel Aburto,\nAnna Alburez-Gutierrez, Ainhoa Altová, Ugofilippo Alustiza, et al. 2021.\n“Data Resource Profile: COVerAGE-DB: A\nGlobal Demographic Database of COVID-19 Cases and\nDeaths.” International Journal of Epidemiology 50 (2):\n390–390f. https://doi.org/10.1093/ije/dyab027.\n\n\nRiley, Richard, Tim Cole, Jon Deeks, Jamie Kirkham, Julie Morris, Rafael\nPerera, Angie Wade, and Gary Collins. 2022. “On the 12th Day of\nChristmas, a Statistician Sent to Me...”\nBMJ, December, e072883. https://doi.org/10.1136/bmj-2022-072883.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet.\nPenguin Classics.\n\n\nRoberts, Margaret, Brandon Stewart, and Dustin Tingley. 2019.\n“stm: An R Package for\nStructural Topic Models.” Journal of Statistical\nSoftware 91 (2): 1–40. https://doi.org/10.18637/jss.v091.i02.\n\n\nRobinson, David, Alex Hayes, and Simon Couch. 2022. broom: Convert Statistical Objects into Tidy\nTibbles. https://CRAN.R-project.org/package=broom.\n\n\nRobinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data\nScience. Shelter Island: Manning Publications. https://livebook.manning.com/book/build-a-career-in-data-science.\n\n\nRockoff, Hugh. 2019. “On the Controversies Behind the Origins of\nthe Federal Economic Statistics.” Journal of Economic\nPerspectives 33 (1): 147–64. https://doi.org/10.1257/jep.33.1.147.\n\n\nRomer, Paul. 2018. “Jupyter, Mathematica, and the Future of the\nResearch Paper,” April. https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/.\n\n\nRose, Angela, Rebecca Grais, Denis Coulombier, and Helga Ritter. 2006.\n“A Comparison of Cluster and Systematic Sampling Methods for\nMeasuring Crude Mortality.” Bulletin of the World Health\nOrganization 84: 290–96. https://doi.org/10.2471/blt.05.029181.\n\n\nRosenau, James N. 1999. “A Transformed Observer in a Transforming\nWorld.” Studia Diplomatica 52 (1/2): 5–14. http://www.jstor.org/stable/44838096.\n\n\nRoss, Casey. 2022. “How a Decades-Old Database Became a Hugely\nProfitable Dossier on the Health of 270 Million Americans.”\nStat, February. https://www.statnews.com/2022/02/01/ibm-watson-health-marketscan-data/.\n\n\nRubinstein, Benjamin, and Francesco Alda. 2017. “Pain-Free Random\nDifferential Privacy with Sensitivity Sampling.” In 34th\nInternational Conference on Machine Learning (ICML’2017).\n\n\nRudis, Bob. 2020. hrbrthemes: Additional\nThemes, Theme Components and Utilities for\n“ggplot2”. https://CRAN.R-project.org/package=hrbrthemes.\n\n\nRuggles, Steven, Catherine Fitch, Diana Magnuson, and Jonathan\nSchroeder. 2019. “Differential Privacy and Census Data:\nImplications for Social and Economic Research.” AEA Papers\nand Proceedings 109 (May): 403–8. https://doi.org/10.1257/pandp.20191107.\n\n\nRuggles, Steven, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas,\nMegan Schouweiler, and Matthew Sobek. 2021. “IPUMS USA: Version\n11.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/d010.v11.0.\n\n\nRyan, Philip. 2015. “Keeping a Lab Notebook.”\nYouTube, May. https://youtu.be/-MAIuaOL64I.\n\n\nSadowski, Caitlin, Emma Söderberg, Luke Church, Michal Sipko, and\nAlberto Bacchelli. 2018. “Modern Code Review: A Case Study at\nGoogle.” In Proceedings of the 40th International Conference\non Software Engineering: Software Engineering in Practice, 181–90.\nICSE-SEIP ’18. New York, NY, USA: Association for Computing Machinery.\nhttps://doi.org/10.1145/3183519.3183525.\n\n\nSakshaug, Joseph, Ting Yan, and Roger Tourangeau. 2010.\n“Nonresponse Error, Measurement Error, and Mode of Data\nCollection: Tradeoffs in a Multi-Mode Survey of Sensitive and\nNon-Sensitive Items.” Public Opinion Quarterly 74 (5):\n907–33. https://doi.org/10.1093/poq/nfq057.\n\n\nSalganik, Matthew. 2018. Bit by Bit: Social Research in the Digital\nAge. New Jersey: Princeton University Press.\n\n\nSalganik, Matthew, Peter Sheridan Dodds, and Duncan Watts. 2006.\n“Experimental Study of Inequality and Unpredictability in an\nArtificial Cultural Market.” Science 311 (5762): 854–56.\nhttps://doi.org/10.1126/science.1121066.\n\n\nSalganik, Matthew, and Douglas Heckathorn. 2004. “Sampling and\nEstimation in Hidden Populations Using Respondent-Driven\nSampling.” Sociological Methodology 34 (1): 193–240. https://doi.org/10.1111/j.0081-1750.2004.00152.x.\n\n\nSambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong,\nPraveen Paritosh, and Lora Aroyo. 2021. “‘Everyone Wants to\nDo the Model Work, Not the Data Work’: Data Cascades in\nHigh-Stakes AI.” In Proceedings of the 2021\nCHI Conference on Human Factors in Computing Systems.\nACM. https://doi.org/10.1145/3411764.3445518.\n\n\nSamuel, Arthur. 1959. “Some Studies in Machine Learning Using the\nGame of Checkers.” IBM Journal of Research and\nDevelopment 3 (3): 210–29. https://doi.org/10.1147/rd.33.0210.\n\n\nSaulnier, Lucile, Siddharth Karamcheti, Hugo Laurençon, Léo Tronchon,\nThomas Wang, Victor Sanh, Amanpreet Singh, et al. 2022. “Putting\nEthical Principles at the Core of the Research Lifecycle.” https://huggingface.co/blog/ethical-charter-multimodal.\n\n\nSavage, Van, and Pamela Yeh. 2019. “Novelist Cormac\nMcCarthy’s Tips on How to Write a Great Science\nPaper.” Nature 574 (7778): 441–42. https://doi.org/10.1038/d41586-019-02918-5.\n\n\nSchaffner, Brian, Stephen Ansolabehere, and Sam Luks. 2021.\n“Cooperative Election Study Common Content,\n2020.” Harvard Dataverse. https://doi.org/10.7910/DVN/E9N6PH.\n\n\nSchloerke, Barret, and Jeff Allen. 2022. plumber: An API Generator for R. https://CRAN.R-project.org/package=plumber.\n\n\nSchmertmann, Carl. 2022. “UN API Test,” July. https://bonecave.schmert.net/un-api-example.html.\n\n\nSchofield, Alexandra, Måns Magnusson, and David Mimno. 2017.\n“Pulling Out the Stops: Rethinking Stopword Removal for Topic\nModels.” In Proceedings of the 15th Conference of the\nEuropean Chapter of the Association for Computational\nLinguistics: Volume 2, Short Papers, 432–36. Valencia, Spain:\nAssociation for Computational Linguistics. https://aclanthology.org/E17-2069.\n\n\nSchofield, Alexandra, Måns Magnusson, Laure Thompson, and David Mimno.\n2017. “Understanding Text Pre-Processing for Latent Dirichlet\nAllocation.” In ACL Workshop for Women in NLP (WiNLP).\nhttps://www.cs.cornell.edu/~xanda/winlp2017.pdf.\n\n\nSchofield, Alexandra, Laure Thompson, and David Mimno. 2017.\n“Quantifying the Effects of Text Duplication on Semantic\nModels.” In Proceedings of the 2017 Conference on Empirical\nMethods in Natural Language Processing, 2737–47. Copenhagen,\nDenmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1290.\n\n\nScott, James. 1998. Seeing Like a State. Yale University Press.\n\n\nSekhon, Jasjeet, and Rocío Titiunik. 2017. “Understanding\nRegression Discontinuity Designs as Observational Studies.”\nObservational Studies 3 (2): 174–82. https://doi.org/10.1353/obs.2017.0005.\n\n\nSen, Amartya. 1980. “Description as\nChoice.” Oxford Economic Papers 32 (3): 353–69.\nhttps://doi.org/10.1093/oxfordjournals.oep.a041484.\n\n\nShankar, Shreya, Rolando Garcia, Joseph Hellerstein, and Aditya\nParameswaran. 2022. “Operationalizing Machine Learning: An\nInterview Study.” arXiv. https://doi.org/10.48550/ARXIV.2209.09125.\n\n\nSi, Yajuan. 2020. “On the Use of Auxiliary Variables in Multilevel\nRegression and Poststratification.” https://arxiv.org/abs/2011.00360.\n\n\nSides, John, Lynn Vavreck, and Christopher Warshaw. 2021. “The\nEffect of Television Advertising in United States Elections.”\nAmerican Political Science Review, 1–17. https://doi.org/10.1017/s000305542100112x.\n\n\nSilberzahn, Raphael, Eric Uhlmann, Daniel Martin, Pasquale Anselmi,\nFrederik Aust, Eli Awtrey, Štěpán Bahnı́k, et al. 2018. “Many\nAnalysts, One Data Set: Making Transparent How Variations in Analytic\nChoices Affect Results.” Advances in Methods and Practices in\nPsychological Science 1 (3): 337–56. https://doi.org/10.1177/2515245917747646.\n\n\nSilge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data\nPrinciples in R.” The Journal of Open Source\nSoftware 1 (3). https://doi.org/10.21105/joss.00037.\n\n\nSilver, Nate. 2020. “We Fixed an Issue with How Our Primary\nForecast Was Calculating Candidates’ Demographic Strengths.”\nFiveThirtyEight, February. https://fivethirtyeight.com/features/we-fixed-a-mistake-in-how-our-primary-forecast-was-calculating-candidates-demographic-strengths/.\n\n\nSimonsohn, Uri. 2013. “Just Post It: The Lesson from Two Cases of\nFabricated Data Detected by Statistics Alone.” Psychological\nScience 24 (10): 1875–88. https://doi.org/10.1177/0956797613480366.\n\n\nSimpkinson, Scott. 1971. “Testing to Ensure\nMission Success.” In What Made Apollo a Success,\nedited by NASA, 21–29.\n\n\nSimpson, Edward. 1951. “The Interpretation of Interaction in\nContingency Tables.” Journal of the Royal Statistical\nSociety: Series B (Methodological) 13 (2): 238–41. https://doi.org/10.1111/j.2517-6161.1951.tb00088.x.\n\n\nSmith, Jessie, Saleema Amershi, Solon Barocas, Hanna Wallach, and\nJennifer Wortman Vaughan. 2022. “REAL ML: Recognizing, Exploring,\nand Articulating Limitations of Machine Learning Research.”\n2022 ACM Conference on Fairness, Accountability, and Transparency\n(FAccT ’22). https://doi.org/10.1145/3531146.3533122.\n\n\nSmith, Matthew. 2018. “Should Milk Go in a Cup of Tea First or\nLast?” July. https://yougov.co.uk/topics/consumer/articles-reports/2018/07/30/should-milk-go-cup-tea-first-or-last.\n\n\nSmith, Richard. 2002. “A Statistical Assessment of Buchanan’s Vote\nin Palm Beach County.” Statistical Science 17 (4):\n441–57. https://doi.org/10.1214/ss/1049993203.\n\n\nSobek, Matthew, and Steven Ruggles. 1999. “The IPUMS Project: An\nUpdate.” Historical Methods: A Journal of Quantitative and\nInterdisciplinary History 32 (3): 102–10. https://doi.org/10.1080/01615449909598930.\n\n\nSomers, James. 2015. “Toolkits for the\nMind.” MIT Technology Review, April. https://www.technologyreview.com/2015/04/02/168469/toolkits-for-the-mind/.\n\n\n———. 2017. “Torching the Modern-Day Library of Alexandria.”\nThe Atlantic, April. https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/.\n\n\n———. 2018. “The Scientific Paper Is Obsolete.” The\nAtlantic, April. https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/.\n\n\nSpear, Mary Eleanor. 1952. Charting Statistics. https://archive.org/details/ChartingStatistics_201801/.\n\n\nSprint, Gina, and Jason Conci. 2019. “Mining GitHub Classroom\nCommit Behavior in Elective and Introductory Computer Science\nCourses.” Journal of Computing Sciences in Colleges 35\n(1): 76–84.\n\n\nStaicu, Ana-Maria. 2017. “Interview with Nancy Reid.”\nInternational Statistical Review 85 (3): 381–403. https://doi.org/10.1111/insr.12237.\n\n\nStaniak, Mateusz, and Przemysław Biecek. 2019. “The Landscape of R Packages for Automated Exploratory\nData Analysis.” The R Journal 11\n(2): 347–69. https://doi.org/10.32614/RJ-2019-033.\n\n\nStantcheva, Stefanie. 2023. “How to Run Surveys: A Guide to\nCreating Your Own Identifying Variation and Revealing the\nInvisible.” Annual Review of Economics 15 (1): 205–34.\nhttps://doi.org/10.1146/annurev-economics-091622-010157.\n\n\nStatistics Canada. 2020. “Sex at Birth and Gender: Technical\nReport on Changes for the 2021 Census.” Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-20-0002/982000022020002-eng.pdf.\n\n\n———. 2023. “Guide to the Census of Population, 2021.”\nStatistics Canada. https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/98-304-x2021001-eng.pdf.\n\n\nSteckel, Richard. 1991. “The Quality of Census Data for Historical\nInquiry: A Research Agenda.” Social Science History 15\n(4): 579–99. https://doi.org/10.2307/1171470.\n\n\nSteele, Fiona. 2007. “Multilevel Models for Longitudinal\nData.” Journal of the Royal Statistical Society Series\nA: Statistics in Society 171 (1): 5–19. https://doi.org/10.1111/j.1467-985x.2007.00509.x.\n\n\nSteele, Fiona, Anna Vignoles, and Andrew Jenkins. 2007. “The\nEffect of School Resources on Pupil Attainment: A Multilevel\nSimultaneous Equation Modelling Approach.” Journal of the\nRoyal Statistical Society Series A: Statistics in Society 170 (3):\n801–24. https://doi.org/10.1111/j.1467-985x.2007.00476.x.\n\n\nStevens, Wallace. 1934. The Idea of Order at Key West. https://www.poetryfoundation.org/poems/43431/the-idea-of-order-at-key-west.\n\n\nSteyvers, Mark, and Tom Griffiths. 2006. “Probabilistic Topic\nModels.” In Latent Semantic Analysis: A Road to Meaning,\nedited by T. Landauer, D McNamara, S. Dennis, and W. Kintsch. https://cocosci.princeton.edu/tom/papers/SteyversGriffiths.pdf.\n\n\nStigler, Stephen. 1978. “Francis Ysidro Edgeworth,\nStatistician.” Journal of the Royal Statistical\nSociety. Series A (General) 141 (3): 287–322. https://doi.org/10.2307/2344804.\n\n\n———. 1986. The History of Statistics. Massachusetts: Belknap\nHarvard.\n\n\nStock, James, and Francesco Trebbi. 2003. “Retrospectives: Who\nInvented Instrumental Variable Regression?” Journal of\nEconomic Perspectives 17 (3): 177–94. https://doi.org/10.1257/089533003769204416.\n\n\nStolberg, Michael. 2006. “Inventing the Randomized Double-Blind\nTrial: The Nuremberg Salt Test of 1835.” Journal of the Royal\nSociety of Medicine 99 (12): 642–43. https://doi.org/10.1177/014107680609901216.\n\n\nStoler, Ann Laura. 2002. “Colonial Archives and the Arts of\nGovernance.” Archival Science 2 (March): 87–109. https://doi.org/10.1007/bf02435632.\n\n\nStolley, Paul. 1991. “When Genius Errs: R. A. Fisher and the Lung\nCancer Controversy.” American Journal of Epidemiology\n133 (5): 416–25. https://doi.org/10.1093/oxfordjournals.aje.a115904.\n\n\nStommes, Drew, P. M. Aronow, and Fredrik Sävje. 2023. “On the\nReliability of Published Findings Using the Regression Discontinuity\nDesign in Political Science.” Research & Politics 10\n(2). https://doi.org/https://doi.org/10.1177/2053168023116645.\n\n\nStudent. 1908. “The Probable Error of a Mean.”\nBiometrika 6 (1): 1–25. https://doi.org/10.2307/2331554.\n\n\nSunstein, Cass, and Lucia Reisch. 2017. The Economics of Nudge.\nRoutledge.\n\n\nSuriyakumar, Vinith, Nicolas Papernot, Anna Goldenberg, and Marzyeh\nGhassemi. 2021. “Chasing Your Long Tails.” In\nProceedings of the 2021 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3442188.3445934.\n\n\nSwain, Larry. 1985. “Basic Principles of Questionnaire\nDesign.” Survey Methodology 11 (2): 161–70.\n\n\nSylvester, Christine, Anastasia Ershova, Aleksandra Khokhlova, Nikoleta\nYordanova, and Zachary Greene. 2023. “ParlEE\nplenary speeches V2 data set: Annotated full-text of 15.1 million\nsentence-level plenary speeches of six EU legislative\nchambers.” Harvard Dataverse. https://doi.org/10.7910/DVN/VOPK0E.\n\n\nSzaszi, Barnabas, Anthony Higney, Aaron Charlton, Andrew Gelman, Ignazio\nZiano, Balazs Aczel, Daniel Goldstein, David Yeager, and Elizabeth\nTipton. 2022. “No Reason to Expect Large and Consistent Effects of\nNudge Interventions.” Proceedings of the National Academy of\nSciences 119 (31): e2200732119. https://doi.org/10.1073/pnas.2200732119.\n\n\nTaddy, Matt. 2019. Business Data Science. 1st ed. McGraw Hill.\n\n\nTaflaga, Marija, and Matthew Kerby. 2019. “Who Does What Work in a\nMinisterial Office: Politically Appointed Staff and the Descriptive\nRepresentation of Women in Australian Political Offices,\n19792010.” Political Studies 68 (2):\n463–85. https://doi.org/10.1177/0032321719853459.\n\n\nTal, Eran. 2020. “Measurement in\nScience.” In The Stanford Encyclopedia of\nPhilosophy, edited by Edward Zalta, Fall 2020. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/;\nMetaphysics Research Lab, Stanford University.\n\n\nTang, John. 2015. “Pollution havens and the\ntrade in toxic chemicals: Evidence from U.S. trade flows.”\nEcological Economics 112 (April): 150–60. https://doi.org/10.1016/j.ecolecon.2015.02.022.\n\n\nTang, Jun, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and\nXiaofeng Wang. 2017. “Privacy Loss in Apple’s Implementation of\nDifferential Privacy on MacOS 10.12.” arXiv. https://doi.org/10.48550/arXiv.1709.02753.\n\n\nTausanovitch, Chris, and Lynn Vavreck. 2021. “Democracy Fund\n+ UCLA Nationscape Project.” https://www.voterstudygroup.org/data/nationscape.\n\n\nTaylor, Adam. 2015. “New Zealand Says No to Jedis.” The\nWashington Post, September. https://www.washingtonpost.com/news/worldviews/wp/2015/09/29/new-zealand-says-no-to-jedis/.\n\n\nTeate, Renée. 2022. SQL for Data Scientists. Wiley.\n\n\nThe Economist. 2013. “Johnson: Those Six Little Rules: George\nOrwell on Writing,” July. https://www.economist.com/prospero/2013/07/29/johnson-those-six-little-rules.\n\n\n———. 2022a. “What Spotify Data Show about the Decline of\nEnglish,” January. https://www.economist.com/interactives/graphic-detail/2022/01/29/what-spotify-data-show-about-the-decline-of-english.\n\n\n———. 2022b. “Will Emmanuel Macron Win a Second Term?”\nApril. https://www.economist.com/interactive/france-2022/forecast.\n\n\n———. 2022c. “France’s Presidential Election: The Second Round in\nDetail,” April. https://www.economist.com/interactive/france-2022/results-round-two.\n\n\nThe Washington Post. 2023. “Fatal Force Database.” https://github.com/washingtonpost/data-police-shootings.\n\n\nThe White House. 2023. “Recommendations on the Best Practices for\nthe Collection of Sexual Orientation and Gender Identity Data on Federal\nStatistical Survey,” January. https://www.whitehouse.gov/wp-content/uploads/2023/01/SOGI-Best-Practices.pdf.\n\n\nThieme, Nick. 2018. “R Generation.” Significance\n15 (4): 14–19. https://doi.org/10.1111/j.1740-9713.2018.01169.x.\n\n\nThistlethwaite, Donald, and Donald Campbell. 1960.\n“Regression-Discontinuity Analysis: An Alternative to the Ex Post\nFacto Experiment.” Journal of Educational Psychology 51\n(6): 309–17. https://doi.org/10.1037/h0044319.\n\n\nThompson, Charlie, Daniel Antal, Josiah Parry, Donal Phipps, and Tom\nWolff. 2022. spotifyr: R Wrapper for the\n“Spotify” Web API. https://CRAN.R-project.org/package=spotifyr.\n\n\nThomson-DeVeaux, Amelia, Laura Bronner, and Damini Sharma. 2021.\n“Cities Spend Millions On Police Misconduct\nEvery Year. Here’s Why It’s So Difficult to Hold Departments\nAccountable.” FiveThirtyEight, February. https://fivethirtyeight.com/features/police-misconduct-costs-cities-millions-every-year-but-thats-where-the-accountability-ends/.\n\n\nThornhill, John. 2021. “Lunch with the FT: Mathematician Hannah\nFry.” Financial Times, July. https://www.ft.com/content/a5e33e5a-99b9-4bbc-948f-8a527c7675c3.\n\n\nTierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2021. naniar: Data Structures, Summaries, and Visualisations\nfor Missing Data. https://CRAN.R-project.org/package=naniar.\n\n\nTierney, Nicholas, and Karthik Ram. 2020. “A Realistic Guide to\nMaking Data Available Alongside Code to Improve Reproducibility.”\nhttps://arxiv.org/abs/2002.11626.\n\n\n———. 2021. “Common-Sense Approaches to Sharing Tabular Data\nAlongside Publication.” Patterns 2 (12): 100368. https://doi.org/10.1016/j.patter.2021.100368.\n\n\nTimbers, Tiffany. 2020. canlang: Canadian\nCensus language data. https://ttimbers.github.io/canlang/.\n\n\nTimbers, Tiffany, Trevor Campbell, and Melissa Lee. 2022. Data\nScience: A First Introduction. Chapman; Hall/CRC. https://datasciencebook.ca.\n\n\nTolley, Erin, and Mireille Paquet. 2021. “Gender, Municipal Party\nPolitics, and Montreal’s First Woman Mayor.” Canadian Journal\nof Urban Research 30 (1): 40–52. https://cjur.uwinnipeg.ca/index.php/cjur/article/view/323.\n\n\nTourangeau, Roger, Lance Rips, and Kenneth Rasinski. 2000. The\nPsychology of Survey Response. 1st ed. Cambridge University Press.\nhttps://doi.org/10.1017/CBO9780511819322.003.\n\n\nTouvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet,\nMarie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023.\n“LLaMA: Open and Efficient Foundation\nLanguage Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.13971.\n\n\nTrisovic, Ana, Matthew Lau, Thomas Pasquier, and Mercè Crosas. 2022.\n“A Large-Scale Study on Research Code Quality and\nExecution.” Scientific Data 9 (1). https://doi.org/10.1038/s41597-022-01143-6.\n\n\nTukey, John. 1962. “The Future of Data Analysis.” The\nAnnals of Mathematical Statistics 33 (1): 1–67. https://doi.org/10.1214/aoms/1177704711.\n\n\n———. 1977. Exploratory Data Analysis.\n\n\nTurcotte, Alexi, Aviral Goel, Filip Křikava, and Jan Vitek. 2020.\n“Designing Types for r, Empirically.” Proceedings of\nthe ACM on Programming Languages 4\n(OOPSLA): 1–25. https://doi.org/10.1145/3428249.\n\n\nUN IGME. 2021. “Levels and Trends in Child Mortality,\n2021.” https://childmortality.org/wp-content/uploads/2021/12/UNICEF-2021-Child-Mortality-Report.pdf.\n\n\nUrban, Steve, Rangarajan Sreenivasan, and Vineet Kannan. 2016.\n“It’s All A/Bout Testing: The Netflix\nExperimentation Platform.” Netflix Technology\nBlog, April. https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15.\n\n\nUshey, Kevin. 2022. renv: Project\nEnvironments. https://CRAN.R-project.org/package=renv.\n\n\nvan Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in\nR.” Journal of Statistical Software 45 (3): 1–67.\nhttps://doi.org/10.18637/jss.v045.i03.\n\n\nVan den Broeck, Jan, Solveig Argeseanu Cunningham, Roger Eeckels, and\nKobus Herbst. 2005. “Data Cleaning: Detecting, Diagnosing, and\nEditing Data Abnormalities.” PLOS Medicine 2 (10): e267.\nhttps://doi.org/10.1371/journal.pmed.0020267.\n\n\nvan der Loo, Mark. 2022. The Data Validation Cookbook. https://data-cleaning.github.io/validate/.\n\n\nvan der Loo, Mark, and Edwin De Jonge. 2021. “Data Validation Infrastructure for R.”\nJournal of Statistical Software 97 (10): 1–33. https://doi.org/10.18637/jss.v097.i10.\n\n\nVanderplas, Susan, Dianne Cook, and Heike Hofmann. 2020. “Testing\nStatistical Charts: What Makes a Good Graph?” Annual Review\nof Statistics and Its Application 7: 61–88. https://doi.org/10.1146/annurev-statistics-031219-041252.\n\n\nVanhoenacker, Mark. 2015. Skyfaring: A Journey with a Pilot.\n1st ed. Alfred A. Knopf.\n\n\nVarin, Cristiano, Nancy Reid, and David Firth. 2011. “An Overview\nof Composite Likelihood Methods.” Statistica Sinica,\n5–42. https://www.jstor.org/stable/24309261.\n\n\nVarner, Maddy, and Aaron Sankin. 2020. “Suckers List: How\nAllstate’s Secret Auto Insurance Algorithm Squeezes Big\nSpenders.” The Markup, February. https://themarkup.org/allstates-algorithm/2020/02/25/car-insurance-suckers-list.\n\n\nVavreck, Lynn, and Chris Tausanovitch. 2021. “Democracy Fund\n+ UCLA Nationscape Project User Guide.” https://www.voterstudygroup.org/data/nationscape.\n\n\nVickers, Andrew, and Emily Vertosick. 2016. “An Empirical Study of\nRace Times in Recreational Endurance Runners.”\nBMC Sports Science, Medicine and Rehabilitation 8\n(1). https://doi.org/10.1186/s13102-016-0052-y.\n\n\nVidoni, Melina. 2021. “Evaluating Unit\nTesting Practices in R Packages.” In 2021 IEEE/ACM\n43rd International Conference on Software Engineering (ICSE),\n1523–34. https://doi.org/10.1109/ICSE43902.2021.00136.\n\n\nvon Bergmann, Jens, Dmitry Shkolnik, and Aaron Jacobs. 2021. cancensus: R package to access, retrieve, and work with\nCanadian Census data and geography. https://mountainmath.github.io/cancensus/.\n\n\nWalby, Kevin, and Alex Luscombe. 2019. Freedom of Information and\nSocial Science Research Design. Routledge.\n\n\nWalker, Kyle. 2022. Analyzing US Census Data. Chapman;\nHall/CRC. https://walker-data.com/census-r/index.html.\n\n\nWalker, Kyle, and Matt Herman. 2022. tidycensus: Load US Census Boundary and Attribute Data as\n“tidyverse” and “sf”-Ready Data\nFrames. https://CRAN.R-project.org/package=tidycensus.\n\n\nWallach, Hanna. 2018. “Computational Social Science ≠ Computer Science + Social Data.”\nCommunications of the ACM 61 (3): 42–44. https://doi.org/10.1145/3132698.\n\n\nWan, Mengting, and Julian J. McAuley. 2018. “Item Recommendation\non Monotonic Behavior Chains.” In Proceedings of the 12th\nACM Conference on Recommender Systems, RecSys 2018,\nVancouver, BC, Canada, October 2-7, 2018, edited by Sole Pera,\nMichael D. Ekstrand, Xavier Amatriain, and John O’Donovan, 86–94.\nACM. https://doi.org/10.1145/3240323.3240369.\n\n\nWan, Mengting, Rishabh Misra, Ndapa Nakashole, and Julian J. McAuley.\n2019. “Fine-Grained Spoiler Detection from Large-Scale Review\nCorpora.” In Proceedings of the 57th Conference of the\nAssociation for Computational Linguistics, ACL 2019,\nFlorence, Italy, July 28- August 2, 2019, Volume 1: Long Papers,\nedited by Anna Korhonen, David R. Traum, and Lluı́s Màrquez, 2605–10.\nAssociation for Computational Linguistics. https://doi.org/10.18653/v1/p19-1248.\n\n\nWang, Wei, David Rothschild, Sharad Goel, and Andrew Gelman. 2015.\n“Forecasting Elections with Non-Representative Polls.”\nInternational Journal of Forecasting 31 (3): 980–91. https://doi.org/10.1016/j.ijforecast.2014.06.001.\n\n\nWang, Yilun, and Michal Kosinski. 2018. “Deep Neural Networks Are\nMore Accurate Than Humans at Detecting Sexual Orientation from Facial\nImages.” Journal of Personality and Social Psychology\n114 (2): 246–57. https://doi.org/10.1037/pspa0000098.\n\n\nWardrop, Robert. 1995. “Simpson’s Paradox and the Hot Hand in\nBasketball.” The American Statistician 49 (1): 24–28. https://doi.org/10.2307/2684806.\n\n\nWare, James. 1989. “Investigating Therapies of Potentially Great\nBenefit: ECMO.” Statistical Science 4 (4): 298–306. https://doi.org/10.1214/ss/1177012384.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWei, LJ, and S Durham. 1978. “The Randomized Play-the-Winner Rule\nin Medical Trials.” Journal of the American Statistical\nAssociation 73 (364): 840–43. https://doi.org/10.2307/2286290.\n\n\nWeinberg, Gerald. 1971. The Psychology of Computer Programming.\nNew York: Van Nostrand Reinhold Company.\n\n\nWeissgerber, Tracey, Natasa Milic, Stacey Winham, and Vesna Garovic.\n2015. “Beyond Bar and Line Graphs: Time for a New Data\nPresentation Paradigm.” PLoS Biology 13 (4): e1002128.\nhttps://doi.org/10.1371/journal.pbio.1002128.\n\n\nWhitby, Andrew. 2020. The Sum of the\nPeople. New York: Basic Books.\n\n\nWhitelaw, James. 1805. An Essay on the Population of Dublin. Being\nthe Result of an Actual Survey Taken in 1798, with Great Care and\nPrecision, and Arranged in a Manner Entirely New. Graisberry;\nCampbell.\n\n\nWicherts, Jelte, Marjan Bakker, and Dylan Molenaar. 2011.\n“Willingness to Share Research Data Is Related to the Strength of\nthe Evidence and the Quality of Reporting of Statistical\nResults.” PLOS ONE 6 (11): e26828. https://doi.org/10.1371/journal.pone.0026828.\n\n\nWickham, Hadley. 2009. “Manipulating Data.” In ggplot2, 157–75. Springer New York. https://doi.org/10.1007/978-0-387-98141-3_9.\n\n\n———. 2011. “testthat: Get Started with\nTesting.” The R Journal 3: 5–10. https://journal.r-project.org/archive/2011-1/RJournal%5F2011-1%5FWickham.pdf.\n\n\n———. 2014. “Tidy Data.” Journal of Statistical\nSoftware 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.\n\n\n———. 2016. ggplot2: Elegant Graphics for Data\nAnalysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.\n\n\n———. 2017. tidyverse: Easily Install and Load\nthe “Tidyverse”. https://CRAN.R-project.org/package=tidyverse.\n\n\n———. 2018. “Whole Game.” YouTube, January. https://youtu.be/go5Au01Jrvs.\n\n\n———. 2019. Advanced R. 2nd ed. Chapman; Hall/CRC.\nhttps://adv-r.hadley.nz.\n\n\n———. 2020. Tidyverse. https://www.tidyverse.org/.\n\n\n———. 2021a. babynames: US Baby Names\n1880-2017. https://CRAN.R-project.org/package=babynames.\n\n\n———. 2021b. Mastering Shiny. 1st ed. O’Reilly Media. https://mastering-shiny.org.\n\n\n———. 2021c. The Tidyverse Style Guide. https://style.tidyverse.org/index.html.\n\n\n———. 2022a. R Packages. 2nd ed. O’Reilly Media. https://r-pkgs.org.\n\n\n———. 2022b. rvest: Easily Harvest (Scrape) Web\nPages. https://CRAN.R-project.org/package=rvest.\n\n\n———. 2022c. stringr: Simple, Consistent\nWrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.\n\n\n———. 2023a. forcats: Tools for Working with\nCategorical Variables (Factors). https://CRAN.R-project.org/package=forcats.\n\n\n———. 2023b. httr: Tools for Working with URLs\nand HTTP. https://CRAN.R-project.org/package=httr.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy\nD’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019.\n“Welcome to the Tidyverse.”\nJournal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, and Jennifer Bryan. 2023. readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.\n\n\nWickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. usethis: Automate Package and Project Setup.\nhttps://CRAN.R-project.org/package=usethis.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016)\n2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022.\ndplyr: A Grammar of Data\nManipulation. https://CRAN.R-project.org/package=dplyr.\n\n\nWickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2022. dbplyr: A “dplyr” Back End for\nDatabases. https://CRAN.R-project.org/package=dbplyr.\n\n\nWickham, Hadley, and Lionel Henry. 2022. purrr:\nFunctional Programming Tools. https://CRAN.R-project.org/package=purrr.\n\n\nWickham, Hadley, Jim Hester, and Jenny Bryan. 2022. readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.\n\n\nWickham, Hadley, Jim Hester, Winston Chang, and Jenny Bryan. 2022.\ndevtools: Tools to Make Developing R Packages\nEasier. https://CRAN.R-project.org/package=devtools.\n\n\nWickham, Hadley, Jim Hester, and Jeroen Ooms. 2021. xml2: Parse XML. https://CRAN.R-project.org/package=xml2.\n\n\nWickham, Hadley, Evan Miller, and Danny Smith. 2023. haven: Import and Export “SPSS”\n“Stata” and “SAS” Files. https://CRAN.R-project.org/package=haven.\n\n\nWickham, Hadley, and Dana Seidel. 2022. scales:\nScale Functions for Visualization. https://CRAN.R-project.org/package=scales.\n\n\nWickham, Hadley, and Lisa Stryjewski. 2011. “40 Years of\nBoxplots,” November. https://vita.had.co.nz/papers/boxplots.pdf.\n\n\nWickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2023. tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.\n\n\nWiessner, Polly. 2014. “Embers of Society: Firelight Talk Among\nthe Ju/’hoansi Bushmen.” Proceedings of the National Academy\nof Sciences 111 (39): 14027–35. https://doi.org/10.1073/pnas.1404212111.\n\n\nWilde, Oscar. 1891. The Picture of Dorian Gray. https://www.gutenberg.org/files/174/174-h/174-h.htm.\n\n\nWilford, John Noble. 1977. “Wernher von Braun, Rocket Pioneer,\nDies.” The New York Times, June. https://www.nytimes.com/1977/06/18/archives/wernher-von-braun-rocket-pioneer-dies-wernher-von-braun-pioneer-in.html.\n\n\nWilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed.\nSpringer.\n\n\nWilkinson, Mark, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle\nAppleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016.\n“The FAIR Guiding Principles for Scientific Data Management and\nStewardship.” Scientific Data 3 (1): 1–9. https://doi.org/10.1038/sdata.2016.18.\n\n\nWilson, Greg, Jenny Bryan, Karen Cranston, Justin Kitzes, Lex\nNederbragt, and Tracy Teal. 2017. “Good Enough Practices in\nScientific Computing.” PLOS Computational Biology 13\n(6): 1–20. https://doi.org/10.1371/journal.pcbi.1005510.\n\n\nWong, Julia Carrie. 2020. “One Year Inside Trump’s Monumental\nFacebook Campaign.” The Guardian, January. https://www.theguardian.com/us-news/2020/jan/28/donald-trump-facebook-ad-campaign-2020-election.\n\n\nWood, Simon. 2015. Core Statistics. Cambridge University Press.\nhttps://www.maths.ed.ac.uk/\\%7Eswood34/core-statistics.pdf.\n\n\nWorld Health Organization. 2019. “Trends in Maternal Mortality\n2000 to 2017: Estimates by WHO, UNICEF, UNFPA, World Bank Group and the\nUnited Nations Population Division.” https://apps.who.int/iris/handle/10665/327596.\n\n\nWright, Philip. 1928. The Tariff on Animal and Vegetable Oils.\nNew York: Macmillan Company.\n\n\nWu, Changbao, and Mary Thompson. 2020. Sampling Theory and\nPractice. Springer.\n\n\nXie, Yihui. 2019. “TinyTeX: A lightweight,\ncross-platform, and easy-to-maintain LaTeX distribution based on TeX\nLive.” TUGboat, no. 1: 30–32. https://tug.org/TUGboat/Contents/contents40-1.html.\n\n\n———. 2023. knitr: A General-Purpose Package for\nDynamic Report Generation in R. https://yihui.org/knitr/.\n\n\nXu, Ya. 2020. “Causal Inference Challenges in Industry: A\nPerspective from Experiences at LinkedIn.” YouTube,\nJuly. https://youtu.be/OoKsLAvyIYA.\n\n\nYoshioka, Alan. 1998. “Use of Randomisation in the Medical\nResearch Council’s Clinical Trial of Streptomycin in Pulmonary\nTuberculosis in the 1940s.” BMJ 317 (7167): 1220–23. https://doi.org/10.1136/bmj.317.7167.1220.\n\n\nZhang, Ping, XunPeng Shi, YongPing Sun, Jingbo Cui, and Shuai Shao.\n2019. “Have China’s provinces achieved their\ntargets of energy intensity reduction? Reassessment based on nighttime\nlighting data.” Energy Policy 128 (May): 276–83.\nhttps://doi.org/10.1016/j.enpol.2019.01.014.\n\n\nZhang, Susan, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen,\nShuohui Chen, Christopher Dewan, et al. 2022. “OPT: Open\nPre-Trained Transformer Language Models.” arXiv. https://doi.org/10.48550/arXiv.2205.01068.\n\n\nZimmer, Michael. 2018. “Addressing Conceptual Gaps in Big Data\nResearch Ethics: An Application of Contextual Integrity.”\nSocial Media + Society 4 (2): 1–11. https://doi.org/10.1177/2056305118768300.\n\n\nZinsser, William. 1976. On Writing Well. New York:\nHarperCollins.\n\n\nZook, Matthew, Solon Barocas, danah boyd, Kate Crawford, Emily Keller,\nSeeta Peña Gangadharan, Alyssa Goodman, et al. 2017. “Ten Simple\nRules for Responsible Big Data Research.” PLOS Computational\nBiology 13 (3): e1005399. https://doi.org/10.1371/journal.pcbi.1005399.",
"crumbs": [
"Appendices",
"References"
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 58a5a532..4c05887e 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -6,11 +6,11 @@
https://tellingstorieswithdata.com/00-errata.html
- 2024-10-29T00:37:38.634Z
+ 2024-10-30T01:50:54.477Z
https://tellingstorieswithdata.com/01-introduction.html
- 2024-10-14T16:29:23.129Z
+ 2024-10-30T01:42:17.265Z
https://tellingstorieswithdata.com/02-drinking_from_a_fire_hose.html
@@ -18,7 +18,7 @@
https://tellingstorieswithdata.com/03-workflow.html
- 2024-10-22T18:51:09.688Z
+ 2024-10-30T02:09:58.669Z
https://tellingstorieswithdata.com/04-writing_research.html
@@ -30,7 +30,7 @@
https://tellingstorieswithdata.com/06-farm.html
- 2024-10-29T00:37:24.426Z
+ 2024-10-30T00:30:07.169Z
https://tellingstorieswithdata.com/07-gather.html
@@ -110,7 +110,7 @@
https://tellingstorieswithdata.com/26-sql.html
- 2024-10-22T12:58:27.051Z
+ 2024-10-30T00:26:25.761Z
https://tellingstorieswithdata.com/28-deploy.html
diff --git a/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta b/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta
index a17c0223..a4aa7a8a 100644
Binary files a/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta and b/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta differ
diff --git a/references.bib b/references.bib
index 5062b822..a6fd928c 100644
--- a/references.bib
+++ b/references.bib
@@ -1,3 +1,12 @@
+@book{nationalacademies,
+ title = {Reproducibility and Replicability in Science},
+ author = {{National Academies of Sciences, Engineering, and Medicine}},
+ DOI = {10.17226/25303},
+ publisher = {National Academies Press},
+ year = {2019},
+ month = sep,
+ edition = 1
+}
@article{Davies2024,
title = {The importance of family-based sampling for biobanks},
volume = {634},