diff --git a/00-errata.qmd b/00-errata.qmd
index cccf3974..25ec6ba9 100644
--- a/00-errata.qmd
+++ b/00-errata.qmd
@@ -10,7 +10,7 @@ Chapman and Hall/CRC published this book in July 2023. You can purchase that [he
This online version has some updates to what was printed. An online version that matches the print version is available [here](https://rohanalexander.github.io/telling_stories-published/).
:::
-*Last updated: 12 October 2024.*
+*Last updated: 13 October 2024.*
The book was reviewed by Piotr Fryzlewicz in *The American Statistician* [@Fryzlewicz2024] and Nick Cox on [Amazon](https://www.amazon.com/gp/customer-reviews/R3S602G9RUDOF/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=1032134771). I am grateful they gave such a lot of their time to provide the review, as well as their corrections and suggestions.
diff --git a/01-introduction.qmd b/01-introduction.qmd
index 15c519b6..2ad02bee 100644
--- a/01-introduction.qmd
+++ b/01-introduction.qmd
@@ -4,6 +4,10 @@ engine: knitr
# Telling stories with data {#sec-introduction}
+::: {.callout-note}
+Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
+:::
+
**Prerequisites**
- Read *Counting the Countless*, [@keyes2019]
@@ -194,90 +198,94 @@ Ultimately, we are all just telling stories with data, but these stories are inc
### Quiz {.unnumbered}
1. What is data science (in your own words)?
-2. Based on @register2020, data decisions impact (pick one)?
+2. From @register2020, data decisions impact (pick one)?
a. Real people.
b. No one.
c. Those in the training set.
d. Those in the test set.
-3. Based on @keyes2019, what is data science (pick one)?
- a. The inhumane reduction of humanity down to what can be counted.
+3. From @keyes2019, what is data science (pick one)?
+ a. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.
b. The quantitative analysis of large amounts of data for the purpose of decision-making.
- c. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.
-4. Based on @keyes2019, what is one consequence of data systems that require standardized categories?
- a. Improved user experience
- b. Enhanced security measures
- c. Erasure of individual identities and experiences
- d. Increased innovation in technology
-5. Based on @kieranskitchen, what criticism about working with quantitative data is addressed?
+ c. The inhumane reduction of humanity down to what can be counted.
+4. From @keyes2019, what is one consequence of data systems that require standardized categories?
+ a. Worse user experience.
+ b. Compromised security measures.
+ c. Increased innovation in technology.
+ d. Erasure of individual identities and experiences.
+5. From @kieranskitchen, what is a common criticism about working with data?
a. That it is too time-consuming and inefficient.
- b. That it requires expensive equipment.
- c. That it distances one from the reality of human lives behind the numbers.
- d. That it is only suitable for scientists.
-6. Based on @kieranskitchen, what response to the criticism that quantitative data inures people to human realities is made?
- a. We should stop analyzing data.
- b. Working with data forces a confrontation with questions of meaning.
+ b. That it distances one from the reality of human lives behind the numbers.
+ c. That it requires expensive software and extensive training to analyse.
+6. From @kieranskitchen, what is a response that criticism?
+ a. Working with data forces a confrontation with questions of meaning.
+ b. Data analysis should not be done.
c. Data should only be analyzed by automated processes.
d. Qualitative approaches should be the predominate approach.
7. How can you reconcile @keyes2019 and @kieranskitchen?
-8. Why is ethics a key element of *Telling Stories with Data* (pick one)?
+8. Why is ethics a key element of data science (pick one)?
a. Because data science always involves sensitive personal information.
- b. Because datasets likely concern humans and require careful consideration of their context.
- c. Because ethical considerations make the analysis more complex.
+ b. Because ethical considerations make the analysis easier to do.
+ c. Because datasets likely concern humans and require consideration of context.
d. Because regulations require ethics approval for any data analysis.
9. According to @crawford, as described in this chapter, which of the following forces shape our world, and hence our data (select all that apply)?
- a. Political.
- b. Historical.
- c. Cultural.
- d. Social.
-10. Consider the results of a survey that asked about gender. It finds the following counts: "man: 879", "woman: 912", "non-binary: 10" "prefer not to say: 3", and "other: 1". What is the appropriate way to consider "prefer not to say" (pick one)?
+ a. Political.
+ b. Physical.
+ c. Historical.
+ d. Cultural.
+ e. Social.
+10. From @nottomford, what is a compiler (pick one)?
+ a. Software that takes the symbols you typed into a file and transforms them into lower-level instructions.
+ b. A sequence of symbols (using typical keyboard characters, saved to a file of some kind) that someone typed in, or copied, or pasted from elsewhere.
+ c. A clock with benefits.
+ d. Putting holes in punch cards, then into a box, then loading them, then the computer flips through the cards, identify where the holes were, and update parts of its memory.
+11. Consider the results of a survey that asked about gender. It finds the following counts: "man: 879", "woman: 912", "non-binary: 10" "prefer not to say: 3", and "other: 1". What is the appropriate way to consider "prefer not to say" (pick one)?
a. Drop them.
- b. Merge it into "other".
+ b. It depends.
c. Include them.
- d. It depends.
-11. Imagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?
-12. In *Telling Stories with Data* what is meant by reproducibility in data science (pick one)?
+ d. Merge it into "other".
+12. Imagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?
+13. What is meant by reproducibility in data science (pick one)?
a. Being able to produce similar results with different datasets.
b. Ensuring that all steps of the analysis can be independently redone by others.
c. Publishing results in peer-reviewed journals.
d. Using proprietary software to protect data.
-13. What challenge is associated with the measurement and data collection stage (pick one)?
+14. What is a challenge associated with measurement (pick one)?
a. It is usually straightforward and requires little attention.
- b. Measurements are always accurate and consistent over time.
- c. Data collection is entirely objective and free from bias.
- d. Deciding what to measure and how to measure it is complex and context-dependent.
-14. In the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?
+ b. Deciding what and how to measure is complex and context-dependent.
+ c. Data collection is objective and free from bias.
+ d. Measurements are always accurate and consistent over time.
+15. In the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?
a. Creating complex models to fit the data.
b. Acquiring raw data.
c. Cleaning and preparing the data to reveal the needed dataset.
- d. Visualizing the final results.
-15. Why is exploratory data analysis (EDA) considered an open-ended process (pick one)?
+ d. Visualizing the results.
+16. Why is exploratory data analysis (EDA) an open-ended process (pick one)?
a. Because it has a fixed set of steps to follow.
- b. Because it involves testing hypotheses in a structured way.
- c. Because it requires ongoing iteration to understand the data's shape and patterns.
- d. Because it can be automated with modern software.
-16. Why should statistical models be used carefully (pick one)?
+ b. Because it requires ongoing iteration to understand the data's shape and patterns.
+ c. Because it involves testing hypotheses in a structured way.
+ d. Because it can be automated.
+17. Why should statistical models be used carefully (pick one)?
a. Because they always provide definitive results.
b. Because they can reflect the decisions made in earlier stages.
c. Because they are too complicated for most audiences.
d. Because they are unnecessary if the data are well-presented.
-17. What is one of the key messages from measuring height (pick one)?
+18. What is one lesson from thinking about the challenges of measuring height (pick one)?
a. Height is a straightforward measurement with little variability.
- b. All measurements are accurate if done with modern tools.
+ b. All measurements are accurate if done with the right instrument.
c. Even simple measurements can have complexities that affect data quality.
d. Height is not a useful variable in data analysis.
-18. What is the danger of not considering who is missing from a dataset (pick one)?
+19. What is the danger of not considering who is missing from a dataset (pick one)?
a. It has no significant impact on the analysis.
- b. It can lead to conclusions that do not represent the full context.
- c. It simplifies the analysis by reducing the amount of data.
-19. What is the primary purpose of statistical modeling (pick one)?
- a. To prove hypotheses.
- b. As a tool to help explore and understand the data.
+ b. It simplifies the analysis by reducing the amount of data.
+ c. It can lead to conclusions that do not represent the full context.
+20. What is a purpose of statistical modeling (pick one)?
+ a. As a tool to help explore and understand the data.
+ b. To prove hypotheses.
c. To replace exploratory data analysis.
-20. What is meant by "our data are a simplification of the messy, complex world" (pick one)?
+21. What is meant by "our data are a simplification of the messy, complex world" (pick one)?
a. Data perfectly capture all aspects of reality.
- b. Data are always inaccurate and useless.
- c. Data simplify reality to make analysis possible, but they cannot capture every detail.
-
+ b. Data simplify reality to make analysis possible, but they cannot capture every detail.
+ c. Data are always inaccurate and useless.
### Activity {.unnumbered}
diff --git a/05-static_communication.qmd b/05-static_communication.qmd
index 40536683..8cef031f 100644
--- a/05-static_communication.qmd
+++ b/05-static_communication.qmd
@@ -1634,7 +1634,7 @@ beps |>
b. `geom_bar()`
c. `geom_col()`
d. `geom_density()`
-23. When creating a histogram, adjusting the number of bins or binwidth affects:
+23. What does adjusting the number of bins, or changing the binwith, affect for a histogram (pick one)?
a. The overall shape of the underlying data distribution.
b. The size of the data points.
c. The colors used in the plot.
@@ -1644,21 +1644,21 @@ beps |>
b. They hide the underlying distribution of the data.
c. They take too long to compute.
d. They cannot show outliers.
-25. Which of the following is a recommended way to enhance a boxplot to show more information about the data (pick one)?
+25. How can you deal with that disadvantage (pick one)?
a. Increase the box width.
b. Overlay the actual data points using `geom_jitter()`.
- c. Change the box color to gradient.
+ c. Add colors for each category.
d. Remove the whiskers from the boxplot.
-26. In the context of `ggplot2`, what does `stat_ecdf()` compute (pick one)?
+26. What does `stat_ecdf()` compute (pick one)?
a. A histogram.
b. A scatterplot with error bars.
c. A boxplot.
d. A cumulative distribution function.
-27. What is geocoding in the context of mapping data (pick one)?
- a. The process of converting latitude and longitude into place names.
- b. The process of selecting a map projection.
- c. The process of drawing map boundaries.
- d. The process of converting place names into latitude and longitude coordinates.
+27. What is geocoding (pick one)?
+ a. Converting latitude and longitude into place names.
+ b. Picking a map projection.
+ c. Drawing map boundaries.
+ d. Converting place names into latitude and longitude.
### Activity {.unnumbered}
diff --git a/docs/00-errata.html b/docs/00-errata.html
index 1541cd6a..e71e4e61 100644
--- a/docs/00-errata.html
+++ b/docs/00-errata.html
@@ -478,7 +478,7 @@
Errors and updates
-Last updated: 12 October 2024.
+Last updated: 13 October 2024.
The book was reviewed by Piotr Fryzlewicz in The American Statistician (Fryzlewicz 2024 ) and Nick Cox on Amazon . I am grateful they gave such a lot of their time to provide the review, as well as their corrections and suggestions.
Since the publication of this book in July 2023, there have been a variety of changes in the world. The rise of generative AI has changed the way that people code, Python has become easier to integrate alongside R because of Quarto, and packages continue to update (not to mention a new cohort of students has started going through the book). One advantage of having an online version is that I can make improvements.
I am grateful for the corrections and suggestions of: Andrew Black, Clay Ford, Crystal Lewis, David Jankoski, Donna Mulkern, Emi Tanaka, Emily Su, Inessa De Angelis, James Wade, Julia Kim, Krishiv Jain, Seamus Ross, Tino Kanngiesser, and Zak Varty.
diff --git a/docs/01-introduction.html b/docs/01-introduction.html
index 70b61504..9de0c6e8 100644
--- a/docs/01-introduction.html
+++ b/docs/01-introduction.html
@@ -471,6 +471,16 @@
+
+
+
+
+
+
+
Chapman and Hall/CRC published this book in July 2023. You can purchase that here . This online version has some updates to what was printed.
+
+
+
Prerequisites
Read Counting the Countless , (Keyes 2019 )
@@ -653,90 +663,97 @@ Quiz
What is data science (in your own words)?
-Based on Register (2020 ) , data decisions impact (pick one)?
+ From Register (2020 ) , data decisions impact (pick one)?
Real people.
No one.
Those in the training set.
Those in the test set.
-Based on Keyes (2019 ) , what is data science (pick one)?
+ From Keyes (2019 ) , what is data science (pick one)?
-The inhumane reduction of humanity down to what can be counted.
-The quantitative analysis of large amounts of data for the purpose of decision-making.
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.
+The quantitative analysis of large amounts of data for the purpose of decision-making.
+The inhumane reduction of humanity down to what can be counted.
-Based on Keyes (2019 ) , what is one consequence of data systems that require standardized categories?
+ From Keyes (2019 ) , what is one consequence of data systems that require standardized categories?
-Improved user experience
-Enhanced security measures
-Erasure of individual identities and experiences
-Increased innovation in technology
+Worse user experience.
+Compromised security measures.
+Increased innovation in technology.
+Erasure of individual identities and experiences.
-Based on Healy (2020 ) , what criticism about working with quantitative data is addressed?
+ From Healy (2020 ) , what is a common criticism about working with data?
That it is too time-consuming and inefficient.
-That it requires expensive equipment.
That it distances one from the reality of human lives behind the numbers.
-That it is only suitable for scientists.
+That it requires expensive software and extensive training to analyse.
-Based on Healy (2020 ) , what response to the criticism that quantitative data inures people to human realities is made?
+ From Healy (2020 ) , what is a response that criticism?
-We should stop analyzing data.
Working with data forces a confrontation with questions of meaning.
+Data analysis should not be done.
Data should only be analyzed by automated processes.
Qualitative approaches should be the predominate approach.
How can you reconcile Keyes (2019 ) and Healy (2020 ) ?
-Why is ethics a key element of Telling Stories with Data (pick one)?
+ Why is ethics a key element of data science (pick one)?
Because data science always involves sensitive personal information.
-Because datasets likely concern humans and require careful consideration of their context.
-Because ethical considerations make the analysis more complex.
+Because ethical considerations make the analysis easier to do.
+Because datasets likely concern humans and require consideration of context.
Because regulations require ethics approval for any data analysis.
According to Crawford (2021 ) , as described in this chapter, which of the following forces shape our world, and hence our data (select all that apply)?
Political.
+Physical.
Historical.
Cultural.
Social.
+From Ford (2015 ) , what is a compiler (pick one)?
+
+Software that takes the symbols you typed into a file and transforms them into lower-level instructions.
+A sequence of symbols (using typical keyboard characters, saved to a file of some kind) that someone typed in, or copied, or pasted from elsewhere.
+A clock with benefits.
+Putting holes in punch cards, then into a box, then loading them, then the computer flips through the cards, identify where the holes were, and update parts of its memory.
+
Consider the results of a survey that asked about gender. It finds the following counts: “man: 879”, “woman: 912”, “non-binary: 10” “prefer not to say: 3”, and “other: 1”. What is the appropriate way to consider “prefer not to say” (pick one)?
Drop them.
-Merge it into “other”.
-Include them.
It depends.
+Include them.
+Merge it into “other”.
Imagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?
-In Telling Stories with Data what is meant by reproducibility in data science (pick one)?
+ What is meant by reproducibility in data science (pick one)?
Being able to produce similar results with different datasets.
Ensuring that all steps of the analysis can be independently redone by others.
Publishing results in peer-reviewed journals.
Using proprietary software to protect data.
-What challenge is associated with the measurement and data collection stage (pick one)?
+ What is a challenge associated with measurement (pick one)?
It is usually straightforward and requires little attention.
+Deciding what and how to measure is complex and context-dependent.
+Data collection is objective and free from bias.
Measurements are always accurate and consistent over time.
-Data collection is entirely objective and free from bias.
-Deciding what to measure and how to measure it is complex and context-dependent.
In the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?
Creating complex models to fit the data.
Acquiring raw data.
Cleaning and preparing the data to reveal the needed dataset.
-Visualizing the final results.
+Visualizing the results.
-Why is exploratory data analysis (EDA) considered an open-ended process (pick one)?
+ Why is exploratory data analysis (EDA) an open-ended process (pick one)?
Because it has a fixed set of steps to follow.
-Because it involves testing hypotheses in a structured way.
Because it requires ongoing iteration to understand the data’s shape and patterns.
-Because it can be automated with modern software.
+Because it involves testing hypotheses in a structured way.
+Because it can be automated.
Why should statistical models be used carefully (pick one)?
@@ -745,30 +762,30 @@ Quiz
Because they are too complicated for most audiences.
Because they are unnecessary if the data are well-presented.
-What is one of the key messages from measuring height (pick one)?
+ What is one lesson from thinking about the challenges of measuring height (pick one)?
Height is a straightforward measurement with little variability.
-All measurements are accurate if done with modern tools.
+All measurements are accurate if done with the right instrument.
Even simple measurements can have complexities that affect data quality.
Height is not a useful variable in data analysis.
What is the danger of not considering who is missing from a dataset (pick one)?
It has no significant impact on the analysis.
-It can lead to conclusions that do not represent the full context.
It simplifies the analysis by reducing the amount of data.
+It can lead to conclusions that do not represent the full context.
-What is the primary purpose of statistical modeling (pick one)?
+ What is a purpose of statistical modeling (pick one)?
-To prove hypotheses.
As a tool to help explore and understand the data.
+To prove hypotheses.
To replace exploratory data analysis.
What is meant by “our data are a simplification of the messy, complex world” (pick one)?
Data perfectly capture all aspects of reality.
-Data are always inaccurate and useless.
Data simplify reality to make analysis possible, but they cannot capture every detail.
+Data are always inaccurate and useless.
diff --git a/docs/02-drinking_from_a_fire_hose.html b/docs/02-drinking_from_a_fire_hose.html
index 3f432c33..13057ef9 100644
--- a/docs/02-drinking_from_a_fire_hose.html
+++ b/docs/02-drinking_from_a_fire_hose.html
@@ -836,16 +836,16 @@ Quiz
geom_col()
geom_density()
-When creating a histogram, adjusting the number of bins or binwidth affects:
+ What does adjusting the number of bins, or changing the binwith, affect for a histogram (pick one)?
The overall shape of the underlying data distribution.
The size of the data points.
@@ -4204,26 +4204,26 @@ Quiz
They take too long to compute.
They cannot show outliers.
-Which of the following is a recommended way to enhance a boxplot to show more information about the data (pick one)?
+ How can you deal with that disadvantage (pick one)?
Increase the box width.
Overlay the actual data points using geom_jitter()
.
-Change the box color to gradient.
+Add colors for each category.
Remove the whiskers from the boxplot.
-In the context of ggplot2
, what does stat_ecdf()
compute (pick one)?
+ What does stat_ecdf()
compute (pick one)?
A histogram.
A scatterplot with error bars.
A boxplot.
A cumulative distribution function.
-What is geocoding in the context of mapping data (pick one)?
+ What is geocoding (pick one)?
-The process of converting latitude and longitude into place names.
-The process of selecting a map projection.
-The process of drawing map boundaries.
-The process of converting place names into latitude and longitude coordinates.
+Converting latitude and longitude into place names.
+Picking a map projection.
+Drawing map boundaries.
+Converting place names into latitude and longitude.
diff --git a/docs/07-gather_files/figure-html/fig-readioovertime-1.png b/docs/07-gather_files/figure-html/fig-readioovertime-1.png
index ed06cc42..def26b0d 100644
Binary files a/docs/07-gather_files/figure-html/fig-readioovertime-1.png and b/docs/07-gather_files/figure-html/fig-readioovertime-1.png differ
diff --git a/docs/09-clean_and_prepare.html b/docs/09-clean_and_prepare.html
index 1f17882e..07c40559 100644
--- a/docs/09-clean_and_prepare.html
+++ b/docs/09-clean_and_prepare.html
@@ -3058,9 +3058,9 @@ Pointblank Validation
-
+
-[2024-10-12|21:02:03]
+[2024-10-13|09:48:08]
tibble column_names_as_contracts
@@ -3229,7 +3229,7 @@
-2024-10-12 21:02:03 EDT < 1 s 2024-10-12 21:02:03 EDT
+2024-10-13 09:48:08 EDT < 1 s 2024-10-13 09:48:08 EDT
diff --git a/docs/11-eda.html b/docs/11-eda.html
index adcd1571..e352e462 100644
--- a/docs/11-eda.html
+++ b/docs/11-eda.html
@@ -646,14 +646,14 @@ slice_sample (n = 6 )
# A tibble: 6 × 3
- state population income
- <chr> <dbl> <dbl>
-1 North Dakota 637 5087
-2 Maryland 4122 5299
-3 Texas 12237 4188
-4 Louisiana 3806 3545
-5 Alaska 365 6315
-6 North Carolina 5441 3875
+ state population income
+ <chr> <dbl> <dbl>
+1 Mississippi 2341 3098
+2 Oklahoma 2715 3983
+3 New Jersey 7333 5237
+4 Wisconsin 4589 4468
+5 Utah 1203 4022
+6 Kentucky 3387 3712
us_populations |>
glimpse ()
diff --git a/docs/23-assessment.html b/docs/23-assessment.html
index a4852e7a..adf76fb1 100644
--- a/docs/23-assessment.html
+++ b/docs/23-assessment.html
@@ -573,23 +573,23 @@
-
-
@@ -1211,23 +1211,23 @@
-
-
@@ -1860,23 +1860,23 @@
-
-
@@ -2512,23 +2512,23 @@
-
-
@@ -3159,23 +3159,23 @@
-
-
@@ -3816,23 +3816,23 @@
-
-
@@ -4465,23 +4465,23 @@
-
-
@@ -5110,23 +5110,23 @@
-
-
diff --git a/docs/24-interaction.html b/docs/24-interaction.html
index cd2a2904..40624fc8 100644
--- a/docs/24-interaction.html
+++ b/docs/24-interaction.html
@@ -714,8 +714,8 @@
Figure F.2: Interactive map of US bases
@@ -767,8 +767,8 @@
Figure F.3: Interactive map of US bases with colored circules to indicate spend
@@ -801,8 +801,8 @@
Figure F.4: Interactive map of US bases using Mapdeck
diff --git a/docs/search.json b/docs/search.json
index 2e9ab6fa..69fecfe4 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -170,7 +170,7 @@
"href": "01-introduction.html#exercises",
"title": "1 Telling stories with data",
"section": "1.6 Exercises",
- "text": "1.6 Exercises\n\nQuiz\n\nWhat is data science (in your own words)?\nBased on Register (2020), data decisions impact (pick one)?\n\nReal people.\nNo one.\nThose in the training set.\nThose in the test set.\n\nBased on Keyes (2019), what is data science (pick one)?\n\nThe inhumane reduction of humanity down to what can be counted.\nThe quantitative analysis of large amounts of data for the purpose of decision-making.\nData science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.\n\nBased on Keyes (2019), what is one consequence of data systems that require standardized categories?\n\nImproved user experience\nEnhanced security measures\nErasure of individual identities and experiences\nIncreased innovation in technology\n\nBased on Healy (2020), what criticism about working with quantitative data is addressed?\n\nThat it is too time-consuming and inefficient.\nThat it requires expensive equipment.\nThat it distances one from the reality of human lives behind the numbers.\nThat it is only suitable for scientists.\n\nBased on Healy (2020), what response to the criticism that quantitative data inures people to human realities is made?\n\nWe should stop analyzing data.\nWorking with data forces a confrontation with questions of meaning.\nData should only be analyzed by automated processes.\nQualitative approaches should be the predominate approach.\n\nHow can you reconcile Keyes (2019) and Healy (2020)?\nWhy is ethics a key element of Telling Stories with Data (pick one)?\n\nBecause data science always involves sensitive personal information.\nBecause datasets likely concern humans and require careful consideration of their context.\nBecause ethical considerations make the analysis more complex.\nBecause regulations require ethics approval for any data analysis.\n\nAccording to Crawford (2021), as described in this chapter, which of the following forces shape our world, and hence our data (select all that apply)?\n\nPolitical.\nHistorical.\nCultural.\nSocial.\n\nConsider the results of a survey that asked about gender. It finds the following counts: “man: 879”, “woman: 912”, “non-binary: 10” “prefer not to say: 3”, and “other: 1”. What is the appropriate way to consider “prefer not to say” (pick one)?\n\nDrop them.\nMerge it into “other”.\nInclude them.\nIt depends.\n\nImagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?\nIn Telling Stories with Data what is meant by reproducibility in data science (pick one)?\n\nBeing able to produce similar results with different datasets.\nEnsuring that all steps of the analysis can be independently redone by others.\nPublishing results in peer-reviewed journals.\nUsing proprietary software to protect data.\n\nWhat challenge is associated with the measurement and data collection stage (pick one)?\n\nIt is usually straightforward and requires little attention.\nMeasurements are always accurate and consistent over time.\nData collection is entirely objective and free from bias.\nDeciding what to measure and how to measure it is complex and context-dependent.\n\nIn the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?\n\nCreating complex models to fit the data.\nAcquiring raw data.\nCleaning and preparing the data to reveal the needed dataset.\nVisualizing the final results.\n\nWhy is exploratory data analysis (EDA) considered an open-ended process (pick one)?\n\nBecause it has a fixed set of steps to follow.\nBecause it involves testing hypotheses in a structured way.\nBecause it requires ongoing iteration to understand the data’s shape and patterns.\nBecause it can be automated with modern software.\n\nWhy should statistical models be used carefully (pick one)?\n\nBecause they always provide definitive results.\nBecause they can reflect the decisions made in earlier stages.\nBecause they are too complicated for most audiences.\nBecause they are unnecessary if the data are well-presented.\n\nWhat is one of the key messages from measuring height (pick one)?\n\nHeight is a straightforward measurement with little variability.\nAll measurements are accurate if done with modern tools.\nEven simple measurements can have complexities that affect data quality.\nHeight is not a useful variable in data analysis.\n\nWhat is the danger of not considering who is missing from a dataset (pick one)?\n\nIt has no significant impact on the analysis.\nIt can lead to conclusions that do not represent the full context.\nIt simplifies the analysis by reducing the amount of data.\n\nWhat is the primary purpose of statistical modeling (pick one)?\n\nTo prove hypotheses.\nAs a tool to help explore and understand the data.\nTo replace exploratory data analysis.\n\nWhat is meant by “our data are a simplification of the messy, complex world” (pick one)?\n\nData perfectly capture all aspects of reality.\nData are always inaccurate and useless.\nData simplify reality to make analysis possible, but they cannot capture every detail.\n\n\n\n\nActivity\nThe purpose of this activity is to clarify in your mind the difficulty of measurement, even of seemingly simple things, and hence the likelihood of measurement issues in more complicated areas.\nPlease obtain some seeds for a fast-growing plant such as radishes, mustard greens, or arugula. Plant the seeds and measure how much soil you used. Water them and measure the water you used. Each day take a note of any changes. More generally, measure and record as much as you can. Note your thoughts about the difficulty of measurement. Eventually your seeds will sprout, and you should measure how they grow.\nWe will return to use the data that you gathered.\n\n\n\n\nBandy, John, and Nicholas Vincent. 2021. “Addressing ‘Documentation Debt’ in Machine Learning: A Retrospective Datasheet for BookCorpus.” In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, edited by J. Vanschoren and S. Yeung. Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf.\n\n\nBaumer, Benjamin, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science With R. 2nd ed. Chapman; Hall/CRC. https://mdsr-book.github.io/mdsr2e/.\n\n\nBlei, David, Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.\n\n\nBoland, Philip. 1984. “A Biographical Glimpse of William Sealy Gosset.” The American Statistician 38 (3): 179–83. https://doi.org/10.2307/2683648.\n\n\nBowley, Arthur Lyon. 1901. Elements of Statistics. London: P. S. King.\n\n\nBradley, Valerie, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, and Seth Flaxman. 2021. “Unrepresentative Big Surveys Significantly Overestimated US Vaccine Uptake.” Nature 600 (7890): 695–700. https://doi.org/10.1038/s41586-021-04198-4.\n\n\nCook, Dianne, Nancy Reid, and Emi Tanaka. 2021. “The Foundation Is Available for Thinking about Data Visualization Inferentially.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.8453435d.\n\n\nCraiu, Radu. 2019. “The Hiring Gambit: In Search of the Twofer Data Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Massachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed, and Allison Jones-Farmer. 2021. “Explaining Predictive Model Performance: An Experimental Study of Data Preparation and Model Choice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nEdgeworth, Francis Ysidro. 1885. “Methods of Statistics.” Journal of the Statistical Society of London, 181–217.\n\n\nFord, Paul. 2015. “What Is Code?” Bloomberg Businessweek, June. https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.\n\n\nForster, Edward Morgan. 1927. Aspects of the Novel. London: Edward Arnold.\n\n\nFoster, Gordon. 1968. “Computers, Statistics and Planning: Systems or Chaos?” Geary Lecture. https://www.esri.ie/system/files/publications/GLS2.pdf.\n\n\nFranklin, Laura. 2005. “Exploratory Experiments.” Philosophy of Science 72 (5): 888–99. https://doi.org/10.1086/508117.\n\n\nGould, Elliot, Hannah Fraser, Timothy Parker, Shinichi Nakagawa, Simon Griffith, Peter Vesk, and Fiona Fidler. 2023. “Same Data, Different Analysts: Variation in Effect Sizes Due to Analytical Decisions in Ecology and Evolutionary Biology,” October. https://doi.org/10.32942/x2gg62.\n\n\nHamming, Richard. (1997) 2020. The Art of Doing Science and Engineering. 2nd ed. Stripe Press.\n\n\nHealy, Kieran. 2020. “The Kitchen Counter Observatory,” May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey Greene, and Stephanie Hicks. 2021. “Reproducibility Standards for Machine Learning in the Life Sciences.” Nature Methods 18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHullman, Jessica, and Andrew Gelman. 2021. “Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.3ab8a587.\n\n\nHuntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The Influence of Hidden Researcher Decisions in Applied Microeconomics.” Economic Inquiry 59: 944–60. https://doi.org/10.1111/ecin.12992.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. (2013) 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.\n\n\nJordan, Michael. 2019. “Artificial Intelligence–The Revolution Hasn’t Happened Yet.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKent, William. 1993. “My Height: A Model for Numeric Information.” https://www.bkent.net/Doc/myheight.htm.\n\n\nKeyes, Os. 2019. “Counting the Countless.” Real Life. https://reallifemag.com/counting-the-countless/.\n\n\nLeek, Jeff, and Roger Peng. 2020. “Advanced Data Science 2020.” http://jtleek.com/ads2020/index.html.\n\n\nMacDorman, Marian, and Eugene Declercq. 2018. “The Failure of United States Maternal Mortality Reporting and Its Impact on Women’s Lives.” Birth 45 (2): 105–8. https://doi.org/1111/birt.12333.\n\n\nMcElreath, Richard. (2015) 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. Chapman; Hall/CRC.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nRegister, Yim. 2020. “Data Science Ethics in 6 Minutes.” YouTube, December. https://youtu.be/mA4gypAiRYU.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet. Penguin Classics.\n\n\nThornhill, John. 2021. “Lunch with the FT: Mathematician Hannah Fry.” Financial Times, July. https://www.ft.com/content/a5e33e5a-99b9-4bbc-948f-8a527c7675c3.\n\n\nTimbers, Tiffany, Trevor Campbell, and Melissa Lee. 2022. Data Science: A First Introduction. Chapman; Hall/CRC. https://datasciencebook.ca.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016) 2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWorld Health Organization. 2019. “Trends in Maternal Mortality 2000 to 2017: Estimates by WHO, UNICEF, UNFPA, World Bank Group and the United Nations Population Division.” https://apps.who.int/iris/handle/10665/327596.",
+ "text": "1.6 Exercises\n\nQuiz\n\nWhat is data science (in your own words)?\nFrom Register (2020), data decisions impact (pick one)?\n\nReal people.\nNo one.\nThose in the training set.\nThose in the test set.\n\nFrom Keyes (2019), what is data science (pick one)?\n\nData science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structured and unstructured data.\nThe quantitative analysis of large amounts of data for the purpose of decision-making.\nThe inhumane reduction of humanity down to what can be counted.\n\nFrom Keyes (2019), what is one consequence of data systems that require standardized categories?\n\nWorse user experience.\nCompromised security measures.\nIncreased innovation in technology.\nErasure of individual identities and experiences.\n\nFrom Healy (2020), what is a common criticism about working with data?\n\nThat it is too time-consuming and inefficient.\nThat it distances one from the reality of human lives behind the numbers.\nThat it requires expensive software and extensive training to analyse.\n\nFrom Healy (2020), what is a response that criticism?\n\nWorking with data forces a confrontation with questions of meaning.\nData analysis should not be done.\nData should only be analyzed by automated processes.\nQualitative approaches should be the predominate approach.\n\nHow can you reconcile Keyes (2019) and Healy (2020)?\nWhy is ethics a key element of data science (pick one)?\n\nBecause data science always involves sensitive personal information.\nBecause ethical considerations make the analysis easier to do.\nBecause datasets likely concern humans and require consideration of context.\nBecause regulations require ethics approval for any data analysis.\n\nAccording to Crawford (2021), as described in this chapter, which of the following forces shape our world, and hence our data (select all that apply)?\n\nPolitical.\nPhysical.\nHistorical.\nCultural.\nSocial.\n\nFrom Ford (2015), what is a compiler (pick one)?\n\nSoftware that takes the symbols you typed into a file and transforms them into lower-level instructions.\nA sequence of symbols (using typical keyboard characters, saved to a file of some kind) that someone typed in, or copied, or pasted from elsewhere.\nA clock with benefits.\nPutting holes in punch cards, then into a box, then loading them, then the computer flips through the cards, identify where the holes were, and update parts of its memory.\n\nConsider the results of a survey that asked about gender. It finds the following counts: “man: 879”, “woman: 912”, “non-binary: 10” “prefer not to say: 3”, and “other: 1”. What is the appropriate way to consider “prefer not to say” (pick one)?\n\nDrop them.\nIt depends.\nInclude them.\nMerge it into “other”.\n\nImagine that you have a job in which including race and/or sexuality as predictors improves the performance of your model. When deciding whether to include these in your analysis, what factors would you consider (in your own words)?\nWhat is meant by reproducibility in data science (pick one)?\n\nBeing able to produce similar results with different datasets.\nEnsuring that all steps of the analysis can be independently redone by others.\nPublishing results in peer-reviewed journals.\nUsing proprietary software to protect data.\n\nWhat is a challenge associated with measurement (pick one)?\n\nIt is usually straightforward and requires little attention.\nDeciding what and how to measure is complex and context-dependent.\nData collection is objective and free from bias.\nMeasurements are always accurate and consistent over time.\n\nIn the analogy to the sculptor, what does the act of sculpting represent in the data workflow (pick one)?\n\nCreating complex models to fit the data.\nAcquiring raw data.\nCleaning and preparing the data to reveal the needed dataset.\nVisualizing the results.\n\nWhy is exploratory data analysis (EDA) an open-ended process (pick one)?\n\nBecause it has a fixed set of steps to follow.\nBecause it requires ongoing iteration to understand the data’s shape and patterns.\nBecause it involves testing hypotheses in a structured way.\nBecause it can be automated.\n\nWhy should statistical models be used carefully (pick one)?\n\nBecause they always provide definitive results.\nBecause they can reflect the decisions made in earlier stages.\nBecause they are too complicated for most audiences.\nBecause they are unnecessary if the data are well-presented.\n\nWhat is one lesson from thinking about the challenges of measuring height (pick one)?\n\nHeight is a straightforward measurement with little variability.\nAll measurements are accurate if done with the right instrument.\nEven simple measurements can have complexities that affect data quality.\nHeight is not a useful variable in data analysis.\n\nWhat is the danger of not considering who is missing from a dataset (pick one)?\n\nIt has no significant impact on the analysis.\nIt simplifies the analysis by reducing the amount of data.\nIt can lead to conclusions that do not represent the full context.\n\nWhat is a purpose of statistical modeling (pick one)?\n\nAs a tool to help explore and understand the data.\nTo prove hypotheses.\nTo replace exploratory data analysis.\n\nWhat is meant by “our data are a simplification of the messy, complex world” (pick one)?\n\nData perfectly capture all aspects of reality.\nData simplify reality to make analysis possible, but they cannot capture every detail.\nData are always inaccurate and useless.\n\n\n\n\nActivity\nThe purpose of this activity is to clarify in your mind the difficulty of measurement, even of seemingly simple things, and hence the likelihood of measurement issues in more complicated areas.\nPlease obtain some seeds for a fast-growing plant such as radishes, mustard greens, or arugula. Plant the seeds and measure how much soil you used. Water them and measure the water you used. Each day take a note of any changes. More generally, measure and record as much as you can. Note your thoughts about the difficulty of measurement. Eventually your seeds will sprout, and you should measure how they grow.\nWe will return to use the data that you gathered.\n\n\n\n\nBandy, John, and Nicholas Vincent. 2021. “Addressing ‘Documentation Debt’ in Machine Learning: A Retrospective Datasheet for BookCorpus.” In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, edited by J. Vanschoren and S. Yeung. Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/54229abfcfa5649e7003b83dd4755294-Paper-round1.pdf.\n\n\nBaumer, Benjamin, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science With R. 2nd ed. Chapman; Hall/CRC. https://mdsr-book.github.io/mdsr2e/.\n\n\nBlei, David, Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.\n\n\nBoland, Philip. 1984. “A Biographical Glimpse of William Sealy Gosset.” The American Statistician 38 (3): 179–83. https://doi.org/10.2307/2683648.\n\n\nBowley, Arthur Lyon. 1901. Elements of Statistics. London: P. S. King.\n\n\nBradley, Valerie, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, and Seth Flaxman. 2021. “Unrepresentative Big Surveys Significantly Overestimated US Vaccine Uptake.” Nature 600 (7890): 695–700. https://doi.org/10.1038/s41586-021-04198-4.\n\n\nCook, Dianne, Nancy Reid, and Emi Tanaka. 2021. “The Foundation Is Available for Thinking about Data Visualization Inferentially.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.8453435d.\n\n\nCraiu, Radu. 2019. “The Hiring Gambit: In Search of the Twofer Data Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.\n\n\nCrawford, Kate. 2021. Atlas of AI. 1st ed. New Haven: Yale University Press.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Massachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nDolatsara, Hamidreza Ahady, Ying-Ju Chen, Robert Leonard, Fadel Megahed, and Allison Jones-Farmer. 2021. “Explaining Predictive Model Performance: An Experimental Study of Data Preparation and Model Choice.” Big Data, October. https://doi.org/10.1089/big.2021.0067.\n\n\nEdgeworth, Francis Ysidro. 1885. “Methods of Statistics.” Journal of the Statistical Society of London, 181–217.\n\n\nFord, Paul. 2015. “What Is Code?” Bloomberg Businessweek, June. https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.\n\n\nForster, Edward Morgan. 1927. Aspects of the Novel. London: Edward Arnold.\n\n\nFoster, Gordon. 1968. “Computers, Statistics and Planning: Systems or Chaos?” Geary Lecture. https://www.esri.ie/system/files/publications/GLS2.pdf.\n\n\nFranklin, Laura. 2005. “Exploratory Experiments.” Philosophy of Science 72 (5): 888–99. https://doi.org/10.1086/508117.\n\n\nGould, Elliot, Hannah Fraser, Timothy Parker, Shinichi Nakagawa, Simon Griffith, Peter Vesk, and Fiona Fidler. 2023. “Same Data, Different Analysts: Variation in Effect Sizes Due to Analytical Decisions in Ecology and Evolutionary Biology,” October. https://doi.org/10.32942/x2gg62.\n\n\nHamming, Richard. (1997) 2020. The Art of Doing Science and Engineering. 2nd ed. Stripe Press.\n\n\nHealy, Kieran. 2020. “The Kitchen Counter Observatory,” May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/.\n\n\nHeil, Benjamin, Michael Hoffman, Florian Markowetz, Su-In Lee, Casey Greene, and Stephanie Hicks. 2021. “Reproducibility Standards for Machine Learning in the Life Sciences.” Nature Methods 18 (10): 1132–35. https://doi.org/10.1038/s41592-021-01256-7.\n\n\nHullman, Jessica, and Andrew Gelman. 2021. “Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference.” Harvard Data Science Review 3 (3). https://doi.org/10.1162/99608f92.3ab8a587.\n\n\nHuntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The Influence of Hidden Researcher Decisions in Applied Microeconomics.” Economic Inquiry 59: 944–60. https://doi.org/10.1111/ecin.12992.\n\n\nJames, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. (2013) 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.\n\n\nJordan, Michael. 2019. “Artificial Intelligence–The Revolution Hasn’t Happened Yet.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.\n\n\nKennedy, Lauren, Katharine Khanna, Daniel Simpson, Andrew Gelman, Yajun Jia, and Julien Teitler. 2022. “He, She, They: Using Sex and Gender in Survey Adjustment.” https://arxiv.org/abs/2009.14401.\n\n\nKent, William. 1993. “My Height: A Model for Numeric Information.” https://www.bkent.net/Doc/myheight.htm.\n\n\nKeyes, Os. 2019. “Counting the Countless.” Real Life. https://reallifemag.com/counting-the-countless/.\n\n\nLeek, Jeff, and Roger Peng. 2020. “Advanced Data Science 2020.” http://jtleek.com/ads2020/index.html.\n\n\nMacDorman, Marian, and Eugene Declercq. 2018. “The Failure of United States Maternal Mortality Reporting and Its Impact on Women’s Lives.” Birth 45 (2): 105–8. https://doi.org/1111/birt.12333.\n\n\nMcElreath, Richard. (2015) 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. Chapman; Hall/CRC.\n\n\nMeng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. https://doi.org/10.1214/18-AOAS1161SF.\n\n\nRegister, Yim. 2020. “Data Science Ethics in 6 Minutes.” YouTube, December. https://youtu.be/mA4gypAiRYU.\n\n\nRilke, Rainer Maria. (1929) 2014. Letters to a Young Poet. Penguin Classics.\n\n\nThornhill, John. 2021. “Lunch with the FT: Mathematician Hannah Fry.” Financial Times, July. https://www.ft.com/content/a5e33e5a-99b9-4bbc-948f-8a527c7675c3.\n\n\nTimbers, Tiffany, Trevor Campbell, and Melissa Lee. 2022. Data Science: A First Introduction. Chapman; Hall/CRC. https://datasciencebook.ca.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016) 2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWorld Health Organization. 2019. “Trends in Maternal Mortality 2000 to 2017: Estimates by WHO, UNICEF, UNFPA, World Bank Group and the United Nations Population Division.” https://apps.who.int/iris/handle/10665/327596.",
"crumbs": [
"Foundations",
"1 Telling stories with data "
@@ -203,7 +203,7 @@
"href": "02-drinking_from_a_fire_hose.html#australian-elections",
"title": "2 Drinking from a fire hose",
"section": "2.2 Australian elections",
- "text": "2.2 Australian elections\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\n\n2.2.1 Plan\nFor this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.\nThe basic requirement for the dataset is that it has the name of the seat (sometimes called a “division” in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is Figure 2.2 (a).\n\n\n\n\n\n\n\n\n\n\n\n(a) Quick sketch of a dataset that could be useful for analyzing Australian elections\n\n\n\n\n\n\n\n\n\n\n\n(b) Quick sketch of a possible graph of the number of seats won by each party\n\n\n\n\n\n\n\nFigure 2.2: Sketches of a potential dataset and graph related to an Australian election\n\n\n\nWe also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is Figure 2.2 (b).\n\n\n2.2.2 Simulate\nWe now simulate some data, to bring some specificity to our sketches.\nTo get started, within Posit Cloud, make a new Quarto document: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto document\\(\\dots\\)”. Give it a title, such as “Exploring the 2022 Australian Election”, add your name as author, and unclick “Use visual markdown editor” (Figure 2.3 (a)). Leave the other options as their default, and then click “Create”.\n\n\n\n\n\n\n\n\n\n\n\n(a) Creating a new Quarto document\n\n\n\n\n\n\n\n\n\n\n\n(b) Installing rmarkdown if necessary\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) After initial setup and with a preamble\n\n\n\n\n\n\n\n\n\n\n\n(d) Highlighting the green arrow to run the chunk\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Highlighting the cross to remove the messages\n\n\n\n\n\n\n\n\n\n\n\n(f) Highlighting the render button\n\n\n\n\n\n\n\nFigure 2.3: Getting started with a Quarto document\n\n\n\nYou may get a notification along the lines of “Package rmarkdown required\\(\\dots\\).” (Figure 2.3 (b)). If that happens, click “Install”. For this example, we will put everything into this one Quarto document. You should save it as “australian_elections.qmd”: “File” \\(\\rightarrow\\) “Save As\\(\\dots\\)”.\nRemove almost all the default content, and then beneath the heading material create a new R code chunk: “Code” \\(\\rightarrow\\) “Insert Chunk”. Then add preamble documentation that explains:\n\nthe purpose of the document;\nthe author and contact details;\nwhen the file was written or last updated; and\nprerequisites that the file relies on.\n\n\n#### Preamble ####\n# Purpose: Read in data from the 2022 Australian Election and make\n# a graph of the number of seats each party won.\n# Author: Rohan Alexander\n# Email: rohan.alexander@utoronto.ca\n# Date: 1 January 2023\n# Prerequisites: Know where to get Australian elections data.\n\nIn R, lines that start with “#” are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a “#”. Also make it clear that this is the preamble section by surrounding that with “####”. The result should look like Figure 2.3 (c).\nAfter this we need to setup the workspace. This involves installing and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the tidyverse and janitor packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nHadley Wickham is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013. He developed the tidyverse collection of packages, and has published many books including R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023) and Advanced R (Wickham 2019). He was awarded the COPSS Presidents’ Award in 2019.\n\n\nAn example of installing the packages follows. Run this code by clicking the small green arrow associated with the R code chunk (Figure 2.3 (d)).\n\n#### Workspace setup ####\ninstall.packages(\"tidyverse\")\ninstall.packages(\"janitor\")\n\nNow that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code can be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (Figure 2.3 (e)).\n\n#### Workspace setup ####\n# install.packages(\"tidyverse\")\n# install.packages(\"janitor\")\n\nlibrary(tidyverse)\nlibrary(janitor)\n\nWe can render the entire document by clicking “Render” (Figure 2.3 (f)). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\nFor an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance ?tidyverse.\nTo simulate our data, we need to create a dataset with two variables: “Division” and “Party”, and some values for each. In the case of “Division” reasonable values would be a name of one of the 151 Australian divisions. In the case of “Party” reasonable values would be one of the following five: “Liberal”, “Labor”, “National”, “Green”, or “Other”. Again, this code can be run by clicking the small green arrow associated with the R code chunk.\n\nsimulated_data <-\n tibble(\n # Use 1 through to 151 to represent each division\n \"Division\" = 1:151,\n # Randomly pick an option, with replacement, 151 times\n \"Party\" = sample(\n x = c(\"Liberal\", \"Labor\", \"National\", \"Green\", \"Other\"),\n size = 151,\n replace = TRUE\n )\n )\n\nsimulated_data\n\n# A tibble: 151 × 2\n Division Party \n <int> <chr> \n 1 1 Green \n 2 2 Green \n 3 3 Labor \n 4 4 Other \n 5 5 Labor \n 6 6 Liberal\n 7 7 Green \n 8 8 Other \n 9 9 Labor \n10 10 Labor \n# ℹ 141 more rows\n\n\nAt a certain point, your code will not run and you will want to ask others for help. Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub is more completely in Chapter 3, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on GitHub (Figure 2.4 (a)). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a “+” in the top right, and select “New gist” (Figure 2.4 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) GitHub sign-up screen\n\n\n\n\n\n\n\n\n\n\n\n(b) New GitHub Gist\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Create a public GitHub Gist to share code\n\n\n\n\n\n\n\nFigure 2.4: Creating a Gist to share code when asking for help\n\n\n\nFrom here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes “.R” at the end, for instance, “australian_elections.R”. In Figure 2.4 (c) it will turn out that we have incorrect capitalization, library(Tidyverse) instead of library(tidyverse).\nClick “Create public gist”. We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.\n\n\n2.2.3 Acquire\nNow we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC), which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to read_csv() from readr. We do not need to explicitly load readr because it is part of the tidyverse. The <- or “assignment operator” allocates the output of read_csv() to an object called “raw_elections_data”.\n\n#### Read in the data ####\nraw_elections_data <-\n read_csv(\n file = \n \"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv\",\n show_col_types = FALSE,\n skip = 1\n )\n\n# We have read the data from the AEC website. We may like to save\n# it in case something happens or they move it.\nwrite_csv(\n x = raw_elections_data,\n file = \"australian_voting.csv\"\n)\n\nWe can take a quick look at the dataset using head() which will show the first six rows, and tail() which will show the last six rows.\n\nhead(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Austral… ALP \n2 197 Aston VIC 36704 Alan TUDGE Liberal LP \n3 198 Ballarat VIC 36409 Catherine KING Austral… ALP \n4 103 Banks NSW 37018 David COLEMAN Liberal LP \n5 180 Barker SA 37083 Tony PASIN Liberal LP \n6 104 Barton NSW 36820 Linda BURNEY Austral… ALP \n\ntail(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 152 Wentworth NSW 37451 Allegra SPENDER Indepen… IND \n2 153 Werriwa NSW 36810 Anne Maree STANLEY Austral… ALP \n3 150 Whitlam NSW 36811 Stephen JONES Austral… ALP \n4 178 Wide Bay QLD 37506 Llew O'BRIEN Liberal… LNP \n5 234 Wills VIC 36452 Peter KHALIL Austral… ALP \n6 316 Wright QLD 37500 Scott BUCHHOLZ Liberal… LNP \n\n\nWe need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using clean_names() from janitor.\n\n#### Basic cleaning ####\nraw_elections_data <-\n read_csv(\n file = \"australian_voting.csv\",\n show_col_types = FALSE\n )\n\n\n# Make the names easier to type\ncleaned_elections_data <-\n clean_names(raw_elections_data)\n\n# Have a look at the first six rows\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 8\n division_id division_nm state_ab candidate_id given_nm surname party_nm \n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Australian …\n2 197 Aston VIC 36704 Alan TUDGE Liberal \n3 198 Ballarat VIC 36409 Catherine KING Australian …\n4 103 Banks NSW 37018 David COLEMAN Liberal \n5 180 Barker SA 37083 Tony PASIN Liberal \n6 104 Barton NSW 36820 Linda BURNEY Australian …\n# ℹ 1 more variable: party_ab <chr>\n\n\nThe names are faster to type because RStudio will auto-complete them. To do this, we begin typing the name of a variable and then use the “tab” key to complete it.\nThere are many variables in the dataset, and we are primarily interested in two: “division_nm” and “party_nm”. We can choose certain variables of interest with select() from dplyr which we loaded as part of the tidyverse. The “pipe operator”, |>, pushes the output of one line to be the first input of the function on the next line.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n select(\n division_nm,\n party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division_nm party_nm \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nSome of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with names(). And we can change the names using rename() from dplyr.\n\nnames(cleaned_elections_data)\n\n[1] \"division_nm\" \"party_nm\" \n\n\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n rename(\n division = division_nm,\n elected_party = party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nWe could now look at the unique values in the “elected_party” column using unique().\n\ncleaned_elections_data$elected_party |>\n unique()\n\n[1] \"Australian Labor Party\" \n[2] \"Liberal\" \n[3] \"Liberal National Party of Queensland\"\n[4] \"The Greens\" \n[5] \"The Nationals\" \n[6] \"Independent\" \n[7] \"Katter's Australian Party (KAP)\" \n[8] \"Centre Alliance\" \n\n\nAs there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using case_match() from dplyr.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n mutate(\n elected_party =\n case_match(\n elected_party,\n \"Australian Labor Party\" ~ \"Labor\",\n \"Liberal National Party of Queensland\" ~ \"Liberal\",\n \"Liberal\" ~ \"Liberal\",\n \"The Nationals\" ~ \"Nationals\",\n \"The Greens\" ~ \"Greens\",\n \"Independent\" ~ \"Other\",\n \"Katter's Australian Party (KAP)\" ~ \"Other\",\n \"Centre Alliance\" ~ \"Other\"\n )\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party\n <chr> <chr> \n1 Adelaide Labor \n2 Aston Liberal \n3 Ballarat Labor \n4 Banks Liberal \n5 Barker Liberal \n6 Barton Labor \n\n\nOur data now matches our plan (Figure 2.2 (a)). For every electoral division we have the party of the person that won it.\nHaving now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.\n\nwrite_csv(\n x = cleaned_elections_data,\n file = \"cleaned_elections_data.csv\"\n)\n\n\n\n2.2.4 Explore\nWe may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in Figure 2.2 (b).\nFirst, we read in the dataset that we just created.\n\n#### Read in the data ####\ncleaned_elections_data <-\n read_csv(\n file = \"cleaned_elections_data.csv\",\n show_col_types = FALSE\n )\n\nWe can get a quick count of how many seats each party won using count() from dplyr.\n\ncleaned_elections_data |>\n count(elected_party)\n\n# A tibble: 5 × 2\n elected_party n\n <chr> <int>\n1 Greens 4\n2 Labor 77\n3 Liberal 48\n4 Nationals 10\n5 Other 12\n\n\nTo build the graph that we are interested in, we use ggplot2 which is part of the tidyverse. The key aspect of this package is that we build graphs by adding layers using “+”, which we call the “add operator”. In particular we will create a bar chart using geom_bar() from ggplot2 (Figure 2.5 (a)).\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) + # aes abbreviates \"aesthetics\" \n geom_bar()\n\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) +\n geom_bar() +\n theme_minimal() + # Make the theme neater\n labs(x = \"Party\", y = \"Number of seats\") # Make labels more meaningful\n\n\n\n\n\n\n\n\n\n\n\n(a) Default options\n\n\n\n\n\n\n\n\n\n\n\n(b) Improved theme and labels\n\n\n\n\n\n\n\nFigure 2.5: Number of seats won, by political party, at the 2022 Australian Federal Election\n\n\n\nFigure 2.5 (a) accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (Figure 2.5 (b)).\n\n\n2.2.5 Share\nTo this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.\n\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.\nWe downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R (R Core Team 2023) including the tidyverse (Wickham et al. 2019) and janitor (Firke 2023). We then created a graph of the number of seats that each political party won (Figure 2.5).\nWe found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.\nThe distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.\n\nOne aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, Cardoso (2020) and Bronner (2020).",
+ "text": "2.2 Australian elections\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\n\n2.2.1 Plan\nFor this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.\nThe basic requirement for the dataset is that it has the name of the seat (sometimes called a “division” in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is Figure 2.2 (a).\n\n\n\n\n\n\n\n\n\n\n\n(a) Quick sketch of a dataset that could be useful for analyzing Australian elections\n\n\n\n\n\n\n\n\n\n\n\n(b) Quick sketch of a possible graph of the number of seats won by each party\n\n\n\n\n\n\n\nFigure 2.2: Sketches of a potential dataset and graph related to an Australian election\n\n\n\nWe also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is Figure 2.2 (b).\n\n\n2.2.2 Simulate\nWe now simulate some data, to bring some specificity to our sketches.\nTo get started, within Posit Cloud, make a new Quarto document: “File” \\(\\rightarrow\\) “New File” \\(\\rightarrow\\) “Quarto document\\(\\dots\\)”. Give it a title, such as “Exploring the 2022 Australian Election”, add your name as author, and unclick “Use visual markdown editor” (Figure 2.3 (a)). Leave the other options as their default, and then click “Create”.\n\n\n\n\n\n\n\n\n\n\n\n(a) Creating a new Quarto document\n\n\n\n\n\n\n\n\n\n\n\n(b) Installing rmarkdown if necessary\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) After initial setup and with a preamble\n\n\n\n\n\n\n\n\n\n\n\n(d) Highlighting the green arrow to run the chunk\n\n\n\n\n\n\n\n\n\n\n\n\n\n(e) Highlighting the cross to remove the messages\n\n\n\n\n\n\n\n\n\n\n\n(f) Highlighting the render button\n\n\n\n\n\n\n\nFigure 2.3: Getting started with a Quarto document\n\n\n\nYou may get a notification along the lines of “Package rmarkdown required\\(\\dots\\).” (Figure 2.3 (b)). If that happens, click “Install”. For this example, we will put everything into this one Quarto document. You should save it as “australian_elections.qmd”: “File” \\(\\rightarrow\\) “Save As\\(\\dots\\)”.\nRemove almost all the default content, and then beneath the heading material create a new R code chunk: “Code” \\(\\rightarrow\\) “Insert Chunk”. Then add preamble documentation that explains:\n\nthe purpose of the document;\nthe author and contact details;\nwhen the file was written or last updated; and\nprerequisites that the file relies on.\n\n\n#### Preamble ####\n# Purpose: Read in data from the 2022 Australian Election and make\n# a graph of the number of seats each party won.\n# Author: Rohan Alexander\n# Email: rohan.alexander@utoronto.ca\n# Date: 1 January 2023\n# Prerequisites: Know where to get Australian elections data.\n\nIn R, lines that start with “#” are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a “#”. Also make it clear that this is the preamble section by surrounding that with “####”. The result should look like Figure 2.3 (c).\nAfter this we need to setup the workspace. This involves installing and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the tidyverse and janitor packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.\n\n\n\n\n\n\nShoulders of giants\n\n\n\nHadley Wickham is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013. He developed the tidyverse collection of packages, and has published many books including R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023) and Advanced R (Wickham 2019). He was awarded the COPSS Presidents’ Award in 2019.\n\n\nAn example of installing the packages follows. Run this code by clicking the small green arrow associated with the R code chunk (Figure 2.3 (d)).\n\n#### Workspace setup ####\ninstall.packages(\"tidyverse\")\ninstall.packages(\"janitor\")\n\nNow that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code can be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (Figure 2.3 (e)).\n\n#### Workspace setup ####\n# install.packages(\"tidyverse\")\n# install.packages(\"janitor\")\n\nlibrary(tidyverse)\nlibrary(janitor)\n\nWe can render the entire document by clicking “Render” (Figure 2.3 (f)). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\nFor an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance ?tidyverse.\nTo simulate our data, we need to create a dataset with two variables: “Division” and “Party”, and some values for each. In the case of “Division” reasonable values would be a name of one of the 151 Australian divisions. In the case of “Party” reasonable values would be one of the following five: “Liberal”, “Labor”, “National”, “Green”, or “Other”. Again, this code can be run by clicking the small green arrow associated with the R code chunk.\n\nsimulated_data <-\n tibble(\n # Use 1 through to 151 to represent each division\n \"Division\" = 1:151,\n # Randomly pick an option, with replacement, 151 times\n \"Party\" = sample(\n x = c(\"Liberal\", \"Labor\", \"National\", \"Green\", \"Other\"),\n size = 151,\n replace = TRUE\n )\n )\n\nsimulated_data\n\n# A tibble: 151 × 2\n Division Party \n <int> <chr> \n 1 1 Other \n 2 2 Labor \n 3 3 Green \n 4 4 Liberal\n 5 5 Labor \n 6 6 Green \n 7 7 Liberal\n 8 8 Labor \n 9 9 Green \n10 10 Other \n# ℹ 141 more rows\n\n\nAt a certain point, your code will not run and you will want to ask others for help. Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub is more completely in Chapter 3, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on GitHub (Figure 2.4 (a)). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a “+” in the top right, and select “New gist” (Figure 2.4 (b)).\n\n\n\n\n\n\n\n\n\n\n\n(a) GitHub sign-up screen\n\n\n\n\n\n\n\n\n\n\n\n(b) New GitHub Gist\n\n\n\n\n\n\n\n\n\n\n\n\n\n(c) Create a public GitHub Gist to share code\n\n\n\n\n\n\n\nFigure 2.4: Creating a Gist to share code when asking for help\n\n\n\nFrom here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes “.R” at the end, for instance, “australian_elections.R”. In Figure 2.4 (c) it will turn out that we have incorrect capitalization, library(Tidyverse) instead of library(tidyverse).\nClick “Create public gist”. We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.\n\n\n2.2.3 Acquire\nNow we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC), which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to read_csv() from readr. We do not need to explicitly load readr because it is part of the tidyverse. The <- or “assignment operator” allocates the output of read_csv() to an object called “raw_elections_data”.\n\n#### Read in the data ####\nraw_elections_data <-\n read_csv(\n file = \n \"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv\",\n show_col_types = FALSE,\n skip = 1\n )\n\n# We have read the data from the AEC website. We may like to save\n# it in case something happens or they move it.\nwrite_csv(\n x = raw_elections_data,\n file = \"australian_voting.csv\"\n)\n\nWe can take a quick look at the dataset using head() which will show the first six rows, and tail() which will show the last six rows.\n\nhead(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Austral… ALP \n2 197 Aston VIC 36704 Alan TUDGE Liberal LP \n3 198 Ballarat VIC 36409 Catherine KING Austral… ALP \n4 103 Banks NSW 37018 David COLEMAN Liberal LP \n5 180 Barker SA 37083 Tony PASIN Liberal LP \n6 104 Barton NSW 36820 Linda BURNEY Austral… ALP \n\ntail(raw_elections_data)\n\n# A tibble: 6 × 8\n DivisionID DivisionNm StateAb CandidateID GivenNm Surname PartyNm PartyAb\n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> \n1 152 Wentworth NSW 37451 Allegra SPENDER Indepen… IND \n2 153 Werriwa NSW 36810 Anne Maree STANLEY Austral… ALP \n3 150 Whitlam NSW 36811 Stephen JONES Austral… ALP \n4 178 Wide Bay QLD 37506 Llew O'BRIEN Liberal… LNP \n5 234 Wills VIC 36452 Peter KHALIL Austral… ALP \n6 316 Wright QLD 37500 Scott BUCHHOLZ Liberal… LNP \n\n\nWe need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using clean_names() from janitor.\n\n#### Basic cleaning ####\nraw_elections_data <-\n read_csv(\n file = \"australian_voting.csv\",\n show_col_types = FALSE\n )\n\n\n# Make the names easier to type\ncleaned_elections_data <-\n clean_names(raw_elections_data)\n\n# Have a look at the first six rows\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 8\n division_id division_nm state_ab candidate_id given_nm surname party_nm \n <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> \n1 179 Adelaide SA 36973 Steve GEORGANAS Australian …\n2 197 Aston VIC 36704 Alan TUDGE Liberal \n3 198 Ballarat VIC 36409 Catherine KING Australian …\n4 103 Banks NSW 37018 David COLEMAN Liberal \n5 180 Barker SA 37083 Tony PASIN Liberal \n6 104 Barton NSW 36820 Linda BURNEY Australian …\n# ℹ 1 more variable: party_ab <chr>\n\n\nThe names are faster to type because RStudio will auto-complete them. To do this, we begin typing the name of a variable and then use the “tab” key to complete it.\nThere are many variables in the dataset, and we are primarily interested in two: “division_nm” and “party_nm”. We can choose certain variables of interest with select() from dplyr which we loaded as part of the tidyverse. The “pipe operator”, |>, pushes the output of one line to be the first input of the function on the next line.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n select(\n division_nm,\n party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division_nm party_nm \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nSome of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with names(). And we can change the names using rename() from dplyr.\n\nnames(cleaned_elections_data)\n\n[1] \"division_nm\" \"party_nm\" \n\n\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n rename(\n division = division_nm,\n elected_party = party_nm\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party \n <chr> <chr> \n1 Adelaide Australian Labor Party\n2 Aston Liberal \n3 Ballarat Australian Labor Party\n4 Banks Liberal \n5 Barker Liberal \n6 Barton Australian Labor Party\n\n\nWe could now look at the unique values in the “elected_party” column using unique().\n\ncleaned_elections_data$elected_party |>\n unique()\n\n[1] \"Australian Labor Party\" \n[2] \"Liberal\" \n[3] \"Liberal National Party of Queensland\"\n[4] \"The Greens\" \n[5] \"The Nationals\" \n[6] \"Independent\" \n[7] \"Katter's Australian Party (KAP)\" \n[8] \"Centre Alliance\" \n\n\nAs there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using case_match() from dplyr.\n\ncleaned_elections_data <-\n cleaned_elections_data |>\n mutate(\n elected_party =\n case_match(\n elected_party,\n \"Australian Labor Party\" ~ \"Labor\",\n \"Liberal National Party of Queensland\" ~ \"Liberal\",\n \"Liberal\" ~ \"Liberal\",\n \"The Nationals\" ~ \"Nationals\",\n \"The Greens\" ~ \"Greens\",\n \"Independent\" ~ \"Other\",\n \"Katter's Australian Party (KAP)\" ~ \"Other\",\n \"Centre Alliance\" ~ \"Other\"\n )\n )\n\nhead(cleaned_elections_data)\n\n# A tibble: 6 × 2\n division elected_party\n <chr> <chr> \n1 Adelaide Labor \n2 Aston Liberal \n3 Ballarat Labor \n4 Banks Liberal \n5 Barker Liberal \n6 Barton Labor \n\n\nOur data now matches our plan (Figure 2.2 (a)). For every electoral division we have the party of the person that won it.\nHaving now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.\n\nwrite_csv(\n x = cleaned_elections_data,\n file = \"cleaned_elections_data.csv\"\n)\n\n\n\n2.2.4 Explore\nWe may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in Figure 2.2 (b).\nFirst, we read in the dataset that we just created.\n\n#### Read in the data ####\ncleaned_elections_data <-\n read_csv(\n file = \"cleaned_elections_data.csv\",\n show_col_types = FALSE\n )\n\nWe can get a quick count of how many seats each party won using count() from dplyr.\n\ncleaned_elections_data |>\n count(elected_party)\n\n# A tibble: 5 × 2\n elected_party n\n <chr> <int>\n1 Greens 4\n2 Labor 77\n3 Liberal 48\n4 Nationals 10\n5 Other 12\n\n\nTo build the graph that we are interested in, we use ggplot2 which is part of the tidyverse. The key aspect of this package is that we build graphs by adding layers using “+”, which we call the “add operator”. In particular we will create a bar chart using geom_bar() from ggplot2 (Figure 2.5 (a)).\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) + # aes abbreviates \"aesthetics\" \n geom_bar()\n\ncleaned_elections_data |>\n ggplot(aes(x = elected_party)) +\n geom_bar() +\n theme_minimal() + # Make the theme neater\n labs(x = \"Party\", y = \"Number of seats\") # Make labels more meaningful\n\n\n\n\n\n\n\n\n\n\n\n(a) Default options\n\n\n\n\n\n\n\n\n\n\n\n(b) Improved theme and labels\n\n\n\n\n\n\n\nFigure 2.5: Number of seats won, by political party, at the 2022 Australian Federal Election\n\n\n\nFigure 2.5 (a) accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (Figure 2.5 (b)).\n\n\n2.2.5 Share\nTo this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.\n\nAustralia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties—“Liberal” and “Labor”—two minor parties—“Nationals” and “Greens”—and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.\nWe downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R (R Core Team 2023) including the tidyverse (Wickham et al. 2019) and janitor (Firke 2023). We then created a graph of the number of seats that each political party won (Figure 2.5).\nWe found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.\nThe distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.\n\nOne aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, Cardoso (2020) and Bronner (2020).",
"crumbs": [
"Foundations",
"2 Drinking from a fire hose "
@@ -500,7 +500,7 @@
"href": "05-static_communication.html#exercises",
"title": "5 Static communication",
"section": "5.6 Exercises",
- "text": "5.6 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: Three friends—Edward, Hugo, and Lucy—each measure the height of 20 of their friends. Each of the three use a slightly different approach to measurement and so make slightly different errors. Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation with every variable independent of each other. Please include three tests based on the simulated data.\n(Acquire) Please describe a possible source of such a dataset.\n(Explore) Please use ggplot2 to build the graph that you sketched using the data that you simulated.\n(Communicate) Please write two paragraphs about what you did.\n\n\n\nQuiz\n\nAssume the tidyverse and datasauRus are installed and loaded. What would be the outcome of the following code?\n\nFour vertical lines.\nFive vertical lines.\nThree vertical lines.\nTwo vertical lines.\n\n\n\ndatasaurus_dozen |> \n filter(dataset == \"v_lines\") |> \n ggplot(aes(x=x, y=y)) + \n geom_point()\n\n\nWhich theme does not have solid lines along the x and y axes (pick one)?\n\ntheme_minimal()\ntheme_classic()\ntheme_bw()\n\nAssume the tidyverse and the beps dataset as generated in this chapter have been installed and loaded. Which argument should be added to geom_bar() in the following code to make the bars for the different parties be next to each other rather than on top of each other?\n\nposition = \"side_by_side\"\nposition = \"dodge2\"\nposition = \"adjacent\"\nposition = \"closest\"\n\n\n\nbeps |> \n ggplot(mapping = aes(x = age, fill = vote)) + \n geom_bar()\n\n\nIn the code below, what should be added to labs() to change the text of the legend?\n\ncolor = \"Voted for\"\nlegend = \"Voted for\"\nscale = \"Voted for\"\nfill = \"Voted for\"\n\n\n\nbeps |>\n ggplot(mapping = aes(x = age, fill = vote)) +\n geom_bar() +\n theme_minimal() +\n labs(x = \"Age of respondent\", y = \"Number of respondents\")\n\n\nBased on the help file for scale_colour_brewer() which palette diverges?\n\n“Accent”\n“RdBu”\n“GnBu”\n“Set1”\n\nBased on Vanderplas, Cook, and Hofmann (2020), which cognitive principle should be considered when creating graphs (pick one)?\n\nProximity.\nVolume estimation.\nAxial positioning.\nRelative motion.\n\nBased on Vanderplas, Cook, and Hofmann (2020), color can be used to (pick one)?\n\nIdentify magnitude.\nImprove chart design aesthetics.\nEncode categorical and continuous variables and group plot elements.\n\nWhich geom should be used to make a scatter plot?\n\ngeom_smooth()\ngeom_point()\ngeom_bar()\ngeom_dotplot()\n\nWhich of these would result in the largest number of bins?\n\ngeom_histogram(binwidth = 5)\ngeom_histogram(binwidth = 2)\n\nSuppose there is a dataset that contains the heights of 100 birds, each from one of three different species. If we are interested in understanding the distribution of these heights, then in a paragraph or two, please explain which type of graph should be used and why.\nWould this code data |> ggplot(aes(x = col_one)) |> geom_point() work if we assume the dataset and columns exist (pick one)?\n\nYes.\nNo.\n\nWhich of the following, if any, are elements of the layered grammar of graphics of Wickham (2010) (select all that apply)?\n\nA default dataset and set of mappings from variables to aesthetics.\nOne or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings.\nColors that enable the reader to understand the main point.\nA coordinate system.\nThe facet specification.\nOne scale for each aesthetic mapping used.\n\nWhich function from modelsummary could we use to create a table of descriptive statistics?\n\ndatasummary_descriptive()\ndatasummary_skim()\ndatasummary_crosstab()\ndatasummary_balance()\n\nWhat is the primary reason for always plotting data (pick one)?\n\nTo check for missing values.\nTo ensure the data are normal.\nTo reveal underlying patterns and structures.\n\nWhich ggplot2 geom is primarily used to create bar charts when you have already computed counts or frequencies (pick one)?\n\ngeom_bar()\ngeom_col()\ngeom_histogram()\ngeom_line()\n\nIn ggplot2, what is the purpose of using facets in a plot (pick one)?\n\nTo change the color scheme of the plot.\nTo add labels to data points.\nTo create multiple plots split by the values of one or more variables.\nTo adjust the transparency of the points.\n\nWhen creating a bar chart with ggplot2, which aesthetic is typically mapped to a categorical variable to fill bars with different colors (pick one)?\n\nx\ny\nfill\nsize\n\nWhat is the effect of adding position = \"dodge\" or position = \"dodge2\" to a geom_bar() in ggplot2 (pick one)?\n\nIt stacks the bars on top of each other.\nIt places the bars side by side for each group.\nIt changes the bar colors to grayscale.\nIt adds transparency to the bars.\n\nIn the context of ggplot2, what is the primary difference between geom_point() and geom_jitter() (pick one)?\n\ngeom_jitter() adds random noise to points to reduce overplotting.\ngeom_point() plots points, geom_jitter() plots lines.\ngeom_point() adds transparency, geom_jitter() does not.\ngeom_jitter() is used for continuous data, geom_point() for categorical data.\n\nWhich ggplot2 geom would you use to add a line of best fit to a scatterplot (pick one)?\n\ngeom_line()\ngeom_bar()\ngeom_histogram()\ngeom_smooth()\n\nWhat argument would you use in geom_smooth() to specify a linear model without standard errors (pick one)?\n\nmethod = lm, se = FALSE\nmodel = linear, error = FALSE\ntype = \"linear\", ci = FALSE\nfit = lm, show_se = FALSE\n\nIn ggplot2, which geom is designed for creating histograms (pick one)?\n\ngeom_histogram()\ngeom_bar()\ngeom_col()\ngeom_density()\n\nWhen creating a histogram, adjusting the number of bins or binwidth affects:\n\nThe overall shape of the underlying data distribution.\nThe size of the data points.\nThe colors used in the plot.\nThe labels on the x-axis.\n\nWhat is one disadvantage of using boxplots (pick one)?\n\nThey are too colorful.\nThey hide the underlying distribution of the data.\nThey take too long to compute.\nThey cannot show outliers.\n\nWhich of the following is a recommended way to enhance a boxplot to show more information about the data (pick one)?\n\nIncrease the box width.\nOverlay the actual data points using geom_jitter().\nChange the box color to gradient.\nRemove the whiskers from the boxplot.\n\nIn the context of ggplot2, what does stat_ecdf() compute (pick one)?\n\nA histogram.\nA scatterplot with error bars.\nA boxplot.\nA cumulative distribution function.\n\nWhat is geocoding in the context of mapping data (pick one)?\n\nThe process of converting latitude and longitude into place names.\nThe process of selecting a map projection.\nThe process of drawing map boundaries.\nThe process of converting place names into latitude and longitude coordinates.\n\n\n\n\nActivity\nPlease create a graph using ggplot2 and a map using ggmap and add explanatory text to accompany both. Be sure to include cross-references and captions, etc. Each of these should take about pages.\nThen, with regard the graph you created, please reflect on Vanderplas, Cook, and Hofmann (2020). Add a few paragraphs about the different options that you considered to make the graph more effective.\nAnd finally, with regard to the map that you created, please reflect on the following quote from Heather Krause, founder of We All Count: “maps only show people who aren’t invisible to the makers” as well as Chapter 3 from D’Ignazio and Klein (2020) and add a few paragraphs related to this.\nUse Quarto, and include an appropriate title, author, date, and citations. Submit a PDF.\n\n\nPaper\nAt about this point the Mawson Paper from Online Appendix E would be appropriate.\n\n\n\n\nAndersen, Robert, and David Armstrong. 2021. Presenting Statistical Results Effectively. London: Sage.\n\n\nArel-Bundock, Vincent. 2021. WDI: World Development Indicators and Other World Bank Data. https://CRAN.R-project.org/package=WDI.\n\n\n———. 2022. “modelsummary: Data and Model Summaries in R.” Journal of Statistical Software 103 (1): 1–23. https://doi.org/10.18637/jss.v103.i01.\n\n\nArmstrong, Zan. 2022. “Stop Aggregating Away the Signal in Your Data.” The Overflow, March. https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/.\n\n\nArnold, Jeffrey. 2021. ggthemes: Extra Themes, Scales and Geoms for “ggplot2”. https://CRAN.R-project.org/package=ggthemes.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex Deckmyn. 2022. maps: Draw Geographical Maps. https://CRAN.R-project.org/package=maps.\n\n\nBethlehem, R. A. I., J. Seidlitz, S. R. White, J. W. Vogel, K. M. Anderson, C. Adamson, S. Adler, et al. 2022. “Brain Charts for the Human Lifespan.” Nature 604 (7906): 525–33. https://doi.org/10.1038/s41586-022-04554-y.\n\n\nBrewer, Cynthia. 2015. Designing Better Maps: A Guide for GIS Users. 2nd ed.\n\n\nBronner, Laura. 2021. “Quantitative Editing.” YouTube, June. https://youtu.be/LI5m9RzJgWc.\n\n\nCambon, Jesse, and Christopher Belanger. 2021. “tidygeocoder: Geocoding Made Easy.” Zenodo. https://doi.org/10.5281/zenodo.3981510.\n\n\nChase, William. 2020. “The Glamour of Graphics.” RStudio Conference, January. https://posit.co/resources/videos/the-glamour-of-graphics/.\n\n\nCleveland, William. (1985) 1994. The Elements of Graphing Data. 2nd ed. New Jersey: Hobart Press.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Massachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nDavies, Rhian, Steph Locke, and Lucy D’Agostino McGowan. 2022. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.\n\n\nDenby, Lorraine, and Colin Mallows. 2009. “Variations on the Histogram.” Journal of Computational and Graphical Statistics 18 (1): 21–31. https://doi.org/10.1198/jcgs.2009.0002.\n\n\nFirke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.\n\n\nFlynn, Michael. 2022. troopdata: Tools for Analyzing Cross-National Military Deployment and Basing Data. https://CRAN.R-project.org/package=troopdata.\n\n\nFox, John, and Robert Andersen. 2006. “Effect Displays for Multinomial and Proportional-Odds Logit Models.” Sociological Methodology 36 (1): 225–55. https://doi.org/10.1111/j.1467-9531.2006.00180.\n\n\nFox, John, Sanford Weisberg, and Brad Price. 2022. carData: Companion to Applied Regression Data Sets. https://CRAN.R-project.org/package=carData.\n\n\nFranconeri, Steven, Lace Padilla, Priti Shah, Jeffrey Zacks, and Jessica Hullman. 2021. “The Science of Visual Data Communication: What Works.” Psychological Science in the Public Interest 22 (3): 110–61. https://doi.org/10.1177/15291006211051956.\n\n\nFriendly, Michael, and Howard Wainer. 2021. A History of Data Visualization and Graphic Communication. 1st ed. Massachusetts: Harvard University Press.\n\n\nFunkhouser, Gray. 1937. “Historical Development of the Graphical Representation of Statistical Data.” Osiris 3: 269–404. https://doi.org/10.1086/368480.\n\n\nGarnier, Simon, Noam Ross, Robert Rudis, Antônio Camargo, Marco Sciaini, and Cédric Scherer. 2021. viridis – Colorblind-Friendly Color Maps for R. https://doi.org/10.5281/zenodo.4679424.\n\n\nGelfand, Sharla. 2022. opendatatoronto: Access the City of Toronto Open Data Portal. https://CRAN.R-project.org/package=opendatatoronto.\n\n\nHealy, Kieran. 2018. Data Visualization. New Jersey: Princeton University Press. https://socviz.co.\n\n\nHowes, Adam. 2022. “Representing Uncertainty Using Significant Figures,” April. https://athowes.github.io/posts/2022-04-24-representing-uncertainty-using-significant-figures/.\n\n\nKahle, David, and Hadley Wickham. 2013. “ggmap: Spatial Visualization with ggplot2.” The R Journal 5 (1): 144–61. http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.\n\n\nKarsten, Karl. 1923. Charts and Graphs. New York: Prentice-Hall.\n\n\nKuznets, Simon, Lillian Epstein, and Elizabeth Jenks. 1941. National Income and Its Composition, 1919-1938. National Bureau of Economic Research.\n\n\nMcIlroy, Doug, Ray Brownrigg, Thomas Minka, and Roger Bivand. 2023. mapproj: Map Projections. https://CRAN.R-project.org/package=mapproj.\n\n\nMiller, Greg. 2014. “The Cartographer Who’s Transforming Map Design.” Wired, October. https://www.wired.com/2014/10/cindy-brewer-map-design/.\n\n\nMoyer, Brian, and Abe Dunn. 2020. “Measuring the Gross Domestic Product (GDP): The Ultimate Data Science Project.” Harvard Data Science Review 2 (1). https://doi.org/10.1162/99608f92.414caadb.\n\n\nNeuwirth, Erich. 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.\n\n\nOECD. 2014. “The Essential Macroeconomic Aggregates.” In Understanding National Accounts, 13–46. OECD. https://doi.org/10.1787/9789264214637-2-en.\n\n\nPedersen, Thomas Lin. 2022. patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.\n\n\nPhillips, Alban. 1958. “The Relation Between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861-1957.” Economica 25 (100): 283–99. https://doi.org/10.1111/j.1468-0335.1958.tb00003.x.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRudis, Bob. 2020. hrbrthemes: Additional Themes, Theme Components and Utilities for “ggplot2”. https://CRAN.R-project.org/package=hrbrthemes.\n\n\nSpear, Mary Eleanor. 1952. Charting Statistics. https://archive.org/details/ChartingStatistics_201801/.\n\n\nTukey, John. 1977. Exploratory Data Analysis.\n\n\nVanderplas, Susan, Dianne Cook, and Heike Hofmann. 2020. “Testing Statistical Charts: What Makes a Good Graph?” Annual Review of Statistics and Its Application 7: 61–88. https://doi.org/10.1146/annurev-statistics-031219-041252.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWeissgerber, Tracey, Natasa Milic, Stacey Winham, and Vesna Garovic. 2015. “Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm.” PLoS Biology 13 (4): e1002128. https://doi.org/10.1371/journal.pbio.1002128.\n\n\nWickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016) 2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWickham, Hadley, and Lisa Stryjewski. 2011. “40 Years of Boxplots,” November. https://vita.had.co.nz/papers/boxplots.pdf.\n\n\nWilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Springer.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.",
+ "text": "5.6 Exercises\n\nPractice\n\n(Plan) Consider the following scenario: Three friends—Edward, Hugo, and Lucy—each measure the height of 20 of their friends. Each of the three use a slightly different approach to measurement and so make slightly different errors. Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.\n(Simulate) Please further consider the scenario described and simulate the situation with every variable independent of each other. Please include three tests based on the simulated data.\n(Acquire) Please describe a possible source of such a dataset.\n(Explore) Please use ggplot2 to build the graph that you sketched using the data that you simulated.\n(Communicate) Please write two paragraphs about what you did.\n\n\n\nQuiz\n\nAssume the tidyverse and datasauRus are installed and loaded. What would be the outcome of the following code?\n\nFour vertical lines.\nFive vertical lines.\nThree vertical lines.\nTwo vertical lines.\n\n\n\ndatasaurus_dozen |> \n filter(dataset == \"v_lines\") |> \n ggplot(aes(x=x, y=y)) + \n geom_point()\n\n\nWhich theme does not have solid lines along the x and y axes (pick one)?\n\ntheme_minimal()\ntheme_classic()\ntheme_bw()\n\nAssume the tidyverse and the beps dataset as generated in this chapter have been installed and loaded. Which argument should be added to geom_bar() in the following code to make the bars for the different parties be next to each other rather than on top of each other?\n\nposition = \"side_by_side\"\nposition = \"dodge2\"\nposition = \"adjacent\"\nposition = \"closest\"\n\n\n\nbeps |> \n ggplot(mapping = aes(x = age, fill = vote)) + \n geom_bar()\n\n\nIn the code below, what should be added to labs() to change the text of the legend?\n\ncolor = \"Voted for\"\nlegend = \"Voted for\"\nscale = \"Voted for\"\nfill = \"Voted for\"\n\n\n\nbeps |>\n ggplot(mapping = aes(x = age, fill = vote)) +\n geom_bar() +\n theme_minimal() +\n labs(x = \"Age of respondent\", y = \"Number of respondents\")\n\n\nBased on the help file for scale_colour_brewer() which palette diverges?\n\n“Accent”\n“RdBu”\n“GnBu”\n“Set1”\n\nBased on Vanderplas, Cook, and Hofmann (2020), which cognitive principle should be considered when creating graphs (pick one)?\n\nProximity.\nVolume estimation.\nAxial positioning.\nRelative motion.\n\nBased on Vanderplas, Cook, and Hofmann (2020), color can be used to (pick one)?\n\nIdentify magnitude.\nImprove chart design aesthetics.\nEncode categorical and continuous variables and group plot elements.\n\nWhich geom should be used to make a scatter plot?\n\ngeom_smooth()\ngeom_point()\ngeom_bar()\ngeom_dotplot()\n\nWhich of these would result in the largest number of bins?\n\ngeom_histogram(binwidth = 5)\ngeom_histogram(binwidth = 2)\n\nSuppose there is a dataset that contains the heights of 100 birds, each from one of three different species. If we are interested in understanding the distribution of these heights, then in a paragraph or two, please explain which type of graph should be used and why.\nWould this code data |> ggplot(aes(x = col_one)) |> geom_point() work if we assume the dataset and columns exist (pick one)?\n\nYes.\nNo.\n\nWhich of the following, if any, are elements of the layered grammar of graphics of Wickham (2010) (select all that apply)?\n\nA default dataset and set of mappings from variables to aesthetics.\nOne or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings.\nColors that enable the reader to understand the main point.\nA coordinate system.\nThe facet specification.\nOne scale for each aesthetic mapping used.\n\nWhich function from modelsummary could we use to create a table of descriptive statistics?\n\ndatasummary_descriptive()\ndatasummary_skim()\ndatasummary_crosstab()\ndatasummary_balance()\n\nWhat is the primary reason for always plotting data (pick one)?\n\nTo check for missing values.\nTo ensure the data are normal.\nTo reveal underlying patterns and structures.\n\nWhich ggplot2 geom is primarily used to create bar charts when you have already computed counts or frequencies (pick one)?\n\ngeom_bar()\ngeom_col()\ngeom_histogram()\ngeom_line()\n\nIn ggplot2, what is the purpose of using facets in a plot (pick one)?\n\nTo change the color scheme of the plot.\nTo add labels to data points.\nTo create multiple plots split by the values of one or more variables.\nTo adjust the transparency of the points.\n\nWhen creating a bar chart with ggplot2, which aesthetic is typically mapped to a categorical variable to fill bars with different colors (pick one)?\n\nx\ny\nfill\nsize\n\nWhat is the effect of adding position = \"dodge\" or position = \"dodge2\" to a geom_bar() in ggplot2 (pick one)?\n\nIt stacks the bars on top of each other.\nIt places the bars side by side for each group.\nIt changes the bar colors to grayscale.\nIt adds transparency to the bars.\n\nIn the context of ggplot2, what is the primary difference between geom_point() and geom_jitter() (pick one)?\n\ngeom_jitter() adds random noise to points to reduce overplotting.\ngeom_point() plots points, geom_jitter() plots lines.\ngeom_point() adds transparency, geom_jitter() does not.\ngeom_jitter() is used for continuous data, geom_point() for categorical data.\n\nWhich ggplot2 geom would you use to add a line of best fit to a scatterplot (pick one)?\n\ngeom_line()\ngeom_bar()\ngeom_histogram()\ngeom_smooth()\n\nWhat argument would you use in geom_smooth() to specify a linear model without standard errors (pick one)?\n\nmethod = lm, se = FALSE\nmodel = linear, error = FALSE\ntype = \"linear\", ci = FALSE\nfit = lm, show_se = FALSE\n\nIn ggplot2, which geom is designed for creating histograms (pick one)?\n\ngeom_histogram()\ngeom_bar()\ngeom_col()\ngeom_density()\n\nWhat does adjusting the number of bins, or changing the binwith, affect for a histogram (pick one)?\n\nThe overall shape of the underlying data distribution.\nThe size of the data points.\nThe colors used in the plot.\nThe labels on the x-axis.\n\nWhat is one disadvantage of using boxplots (pick one)?\n\nThey are too colorful.\nThey hide the underlying distribution of the data.\nThey take too long to compute.\nThey cannot show outliers.\n\nHow can you deal with that disadvantage (pick one)?\n\nIncrease the box width.\nOverlay the actual data points using geom_jitter().\nAdd colors for each category.\nRemove the whiskers from the boxplot.\n\nWhat does stat_ecdf() compute (pick one)?\n\nA histogram.\nA scatterplot with error bars.\nA boxplot.\nA cumulative distribution function.\n\nWhat is geocoding (pick one)?\n\nConverting latitude and longitude into place names.\nPicking a map projection.\nDrawing map boundaries.\nConverting place names into latitude and longitude.\n\n\n\n\nActivity\nPlease create a graph using ggplot2 and a map using ggmap and add explanatory text to accompany both. Be sure to include cross-references and captions, etc. Each of these should take about pages.\nThen, with regard the graph you created, please reflect on Vanderplas, Cook, and Hofmann (2020). Add a few paragraphs about the different options that you considered to make the graph more effective.\nAnd finally, with regard to the map that you created, please reflect on the following quote from Heather Krause, founder of We All Count: “maps only show people who aren’t invisible to the makers” as well as Chapter 3 from D’Ignazio and Klein (2020) and add a few paragraphs related to this.\nUse Quarto, and include an appropriate title, author, date, and citations. Submit a PDF.\n\n\nPaper\nAt about this point the Mawson Paper from Online Appendix E would be appropriate.\n\n\n\n\nAndersen, Robert, and David Armstrong. 2021. Presenting Statistical Results Effectively. London: Sage.\n\n\nArel-Bundock, Vincent. 2021. WDI: World Development Indicators and Other World Bank Data. https://CRAN.R-project.org/package=WDI.\n\n\n———. 2022. “modelsummary: Data and Model Summaries in R.” Journal of Statistical Software 103 (1): 1–23. https://doi.org/10.18637/jss.v103.i01.\n\n\nArmstrong, Zan. 2022. “Stop Aggregating Away the Signal in Your Data.” The Overflow, March. https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/.\n\n\nArnold, Jeffrey. 2021. ggthemes: Extra Themes, Scales and Geoms for “ggplot2”. https://CRAN.R-project.org/package=ggthemes.\n\n\nBecker, Richard, Allan Wilks, Ray Brownrigg, Thomas Minka, and Alex Deckmyn. 2022. maps: Draw Geographical Maps. https://CRAN.R-project.org/package=maps.\n\n\nBethlehem, R. A. I., J. Seidlitz, S. R. White, J. W. Vogel, K. M. Anderson, C. Adamson, S. Adler, et al. 2022. “Brain Charts for the Human Lifespan.” Nature 604 (7906): 525–33. https://doi.org/10.1038/s41586-022-04554-y.\n\n\nBrewer, Cynthia. 2015. Designing Better Maps: A Guide for GIS Users. 2nd ed.\n\n\nBronner, Laura. 2021. “Quantitative Editing.” YouTube, June. https://youtu.be/LI5m9RzJgWc.\n\n\nCambon, Jesse, and Christopher Belanger. 2021. “tidygeocoder: Geocoding Made Easy.” Zenodo. https://doi.org/10.5281/zenodo.3981510.\n\n\nChase, William. 2020. “The Glamour of Graphics.” RStudio Conference, January. https://posit.co/resources/videos/the-glamour-of-graphics/.\n\n\nCleveland, William. (1985) 1994. The Elements of Graphing Data. 2nd ed. New Jersey: Hobart Press.\n\n\nD’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Massachusetts: The MIT Press. https://data-feminism.mitpress.mit.edu.\n\n\nDavies, Rhian, Steph Locke, and Lucy D’Agostino McGowan. 2022. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.\n\n\nDenby, Lorraine, and Colin Mallows. 2009. “Variations on the Histogram.” Journal of Computational and Graphical Statistics 18 (1): 21–31. https://doi.org/10.1198/jcgs.2009.0002.\n\n\nFirke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.\n\n\nFlynn, Michael. 2022. troopdata: Tools for Analyzing Cross-National Military Deployment and Basing Data. https://CRAN.R-project.org/package=troopdata.\n\n\nFox, John, and Robert Andersen. 2006. “Effect Displays for Multinomial and Proportional-Odds Logit Models.” Sociological Methodology 36 (1): 225–55. https://doi.org/10.1111/j.1467-9531.2006.00180.\n\n\nFox, John, Sanford Weisberg, and Brad Price. 2022. carData: Companion to Applied Regression Data Sets. https://CRAN.R-project.org/package=carData.\n\n\nFranconeri, Steven, Lace Padilla, Priti Shah, Jeffrey Zacks, and Jessica Hullman. 2021. “The Science of Visual Data Communication: What Works.” Psychological Science in the Public Interest 22 (3): 110–61. https://doi.org/10.1177/15291006211051956.\n\n\nFriendly, Michael, and Howard Wainer. 2021. A History of Data Visualization and Graphic Communication. 1st ed. Massachusetts: Harvard University Press.\n\n\nFunkhouser, Gray. 1937. “Historical Development of the Graphical Representation of Statistical Data.” Osiris 3: 269–404. https://doi.org/10.1086/368480.\n\n\nGarnier, Simon, Noam Ross, Robert Rudis, Antônio Camargo, Marco Sciaini, and Cédric Scherer. 2021. viridis – Colorblind-Friendly Color Maps for R. https://doi.org/10.5281/zenodo.4679424.\n\n\nGelfand, Sharla. 2022. opendatatoronto: Access the City of Toronto Open Data Portal. https://CRAN.R-project.org/package=opendatatoronto.\n\n\nHealy, Kieran. 2018. Data Visualization. New Jersey: Princeton University Press. https://socviz.co.\n\n\nHowes, Adam. 2022. “Representing Uncertainty Using Significant Figures,” April. https://athowes.github.io/posts/2022-04-24-representing-uncertainty-using-significant-figures/.\n\n\nKahle, David, and Hadley Wickham. 2013. “ggmap: Spatial Visualization with ggplot2.” The R Journal 5 (1): 144–61. http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.\n\n\nKarsten, Karl. 1923. Charts and Graphs. New York: Prentice-Hall.\n\n\nKuznets, Simon, Lillian Epstein, and Elizabeth Jenks. 1941. National Income and Its Composition, 1919-1938. National Bureau of Economic Research.\n\n\nMcIlroy, Doug, Ray Brownrigg, Thomas Minka, and Roger Bivand. 2023. mapproj: Map Projections. https://CRAN.R-project.org/package=mapproj.\n\n\nMiller, Greg. 2014. “The Cartographer Who’s Transforming Map Design.” Wired, October. https://www.wired.com/2014/10/cindy-brewer-map-design/.\n\n\nMoyer, Brian, and Abe Dunn. 2020. “Measuring the Gross Domestic Product (GDP): The Ultimate Data Science Project.” Harvard Data Science Review 2 (1). https://doi.org/10.1162/99608f92.414caadb.\n\n\nNeuwirth, Erich. 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.\n\n\nOECD. 2014. “The Essential Macroeconomic Aggregates.” In Understanding National Accounts, 13–46. OECD. https://doi.org/10.1787/9789264214637-2-en.\n\n\nPedersen, Thomas Lin. 2022. patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.\n\n\nPhillips, Alban. 1958. “The Relation Between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861-1957.” Economica 25 (100): 283–99. https://doi.org/10.1111/j.1468-0335.1958.tb00003.x.\n\n\nR Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.\n\n\nRudis, Bob. 2020. hrbrthemes: Additional Themes, Theme Components and Utilities for “ggplot2”. https://CRAN.R-project.org/package=hrbrthemes.\n\n\nSpear, Mary Eleanor. 1952. Charting Statistics. https://archive.org/details/ChartingStatistics_201801/.\n\n\nTukey, John. 1977. Exploratory Data Analysis.\n\n\nVanderplas, Susan, Dianne Cook, and Heike Hofmann. 2020. “Testing Statistical Charts: What Makes a Good Graph?” Annual Review of Statistics and Its Application 7: 61–88. https://doi.org/10.1146/annurev-statistics-031219-041252.\n\n\nWasserman, Larry. 2005. All of Statistics. Springer.\n\n\nWeissgerber, Tracey, Natasa Milic, Stacey Winham, and Vesna Garovic. 2015. “Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm.” PLoS Biology 13 (4): e1002128. https://doi.org/10.1371/journal.pbio.1002128.\n\n\nWickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.\n\n\nWickham, Hadley, Mara Averick, Jenny Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. (2016) 2023. R for Data Science. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz.\n\n\nWickham, Hadley, and Lisa Stryjewski. 2011. “40 Years of Boxplots,” November. https://vita.had.co.nz/papers/boxplots.pdf.\n\n\nWilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Springer.\n\n\nXie, Yihui. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.",
"crumbs": [
"Communication",
"5 Static communication "
@@ -797,7 +797,7 @@
"href": "09-clean_and_prepare.html#kenyan-census",
"title": "9 Clean and prepare",
"section": "9.7 2019 Kenyan census",
- "text": "9.7 2019 Kenyan census\nAs a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\nThe distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.\n\n9.7.1 Gather and clean\nWe first need to download and read in the PDF of the 2019 Kenyan census.4\n\ncensus_url <-\n paste0(\n \"https://www.knbs.or.ke/download/2019-kenya-population-and-\",\n \"housing-census-volume-iii-distribution-of-population-by-age-\",\n \"sex-and-administrative-units/?wpdmdl=5729&refresh=\",\n \"620561f1ce3ad1644519921\"\n )\n\ndownload.file(\n url = census_url,\n destfile = \"2019_Kenya_census.pdf\",\n mode = \"wb\"\n)\n\nWe can use pdf_text() from pdftools again here.\n\nkenya_census <-\n pdf_text(\n pdf = \"2019_Kenya_census.pdf\"\n )\n\nIn this example we will focus on the page of the PDF about Nairobi (Figure 9.8).\n\n\n\n\n\n\nFigure 9.8: Page from the 2019 Kenyan census about Nairobi\n\n\n\n\n9.7.1.1 Make rectangular\nThe first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.\n\n# Focus on the page of interest\njust_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]\n\n# Remove blank lines\njust_nairobi <- just_nairobi[just_nairobi != \"\"]\n\n# Remove titles, headings and other content at the top of the page\njust_nairobi <- just_nairobi[5:length(just_nairobi)]\n\n# Remove page numbers and other content at the bottom of the page\njust_nairobi <- just_nairobi[1:62]\n\n# Convert into a tibble\ndemography_data <- tibble(all = just_nairobi)\n\nAt this point the data are in a tibble. This allows us to use our familiar dplyr verbs. In particular we want to separate the columns.\n\ndemography_data <-\n demography_data |>\n mutate(all = str_squish(all)) |>\n mutate(all = str_replace(all, \"10 -14\", \"10-14\")) |>\n mutate(all = str_replace(all, \"Not Stated\", \"NotStated\")) |>\n # Deal with the two column set-up\n separate(\n col = all,\n into = c(\n \"age\", \"male\", \"female\", \"total\",\n \"age_2\", \"male_2\", \"female_2\", \"total_2\"\n ),\n sep = \" \",\n remove = TRUE,\n fill = \"right\",\n extra = \"drop\"\n )\n\nThey are side by side at the moment. We need to instead append to the bottom.\n\ndemography_data_long <-\n rbind(\n demography_data |> select(age, male, female, total),\n demography_data |>\n select(age_2, male_2, female_2, total_2) |>\n rename(\n age = age_2,\n male = male_2,\n female = female_2,\n total = total_2\n )\n )\n\n\n# There is one row of NAs, so remove it\ndemography_data_long <-\n demography_data_long |>\n remove_empty(which = c(\"rows\"))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total \n <chr> <chr> <chr> <chr> \n 1 Total 2,192,452 2,204,376 4,396,828\n 2 0 57,265 56,523 113,788 \n 3 1 56,019 54,601 110,620 \n 4 2 52,518 51,848 104,366 \n 5 3 51,115 51,027 102,142 \n 6 4 47,182 46,889 94,071 \n 7 0-4 264,099 260,888 524,987 \n 8 5 45,203 44,711 89,914 \n 9 6 43,635 44,226 87,861 \n10 7 43,507 43,655 87,162 \n# ℹ 113 more rows\n\n\nHaving got it into a rectangular format, we now need to clean the dataset to make it useful.\n\n\n9.7.1.2 Validity\nTo attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.\n\ndemography_data_long |>\n select(male, female, total) |>\n mutate(across(everything(), ~ str_remove_all(., \"[:digit:]\"))) |>\n distinct()\n\n# A tibble: 5 × 3\n male female total\n <chr> <chr> <chr>\n1 \",,\" \",,\" \",,\" \n2 \",\" \",\" \",\" \n3 \"\" \",\" \",\" \n4 \"\" \"\" \",\" \n5 \"\" \"\" \"\" \n\n\nWe need to remove commas. While we could use janitor here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(across(c(male, female, total), ~ str_remove_all(., \",\"))) |>\n mutate(across(c(male, female, total), ~ as.integer(.)))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total\n <chr> <int> <int> <int>\n 1 Total 2192452 2204376 4396828\n 2 0 57265 56523 113788\n 3 1 56019 54601 110620\n 4 2 52518 51848 104366\n 5 3 51115 51027 102142\n 6 4 47182 46889 94071\n 7 0-4 264099 260888 524987\n 8 5 45203 44711 89914\n 9 6 43635 44226 87861\n10 7 43507 43655 87162\n# ℹ 113 more rows\n\n\n\n\n9.7.1.3 Internal consistency\n\nThe census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as “ages 0 to 5”, or a single age, such as “1”.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age_type = if_else(str_detect(age, \"-\"), \n \"age-group\", \n \"single-year\"),\n age_type = if_else(str_detect(age, \"Total\"), \n \"age-group\", \n age_type)\n )\n\nAt the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age = as_factor(age)\n )\n\n\n\n\n9.7.2 Check and test\nHaving gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”, which are the only two gender categories available.\n\ndemography_data_long |>\n mutate(\n check_sum = male + female,\n totals_match = if_else(total == check_sum, 1, 0)\n ) |>\n filter(totals_match == 0)\n\n# A tibble: 0 × 7\n# ℹ 7 variables: age <fct>, male <int>, female <int>, total <int>,\n# age_type <chr>, check_sum <int>, totals_match <dbl>\n\n\nFinally, we want to check that the single-age counts sum to the age-groups.\n\ndemography_data_long |>\n mutate(age_groups = if_else(age_type == \"age-group\", \n age, \n NA_character_)) |>\n fill(age_groups, .direction = \"up\") |>\n mutate(\n group_sum = sum(total),\n group_sum = group_sum / 2,\n difference = total - group_sum,\n .by = c(age_groups)\n ) |>\n filter(age_type == \"age-group\" & age_groups != \"Total\") |> \n head()\n\n# A tibble: 6 × 8\n age male female total age_type age_groups group_sum difference\n <fct> <int> <int> <int> <chr> <chr> <dbl> <dbl>\n1 0-4 264099 260888 524987 age-group 0-4 524987 0\n2 5-9 215230 217482 432712 age-group 5-9 432712 0\n3 10-14 185008 193542 378550 age-group 10-14 378550 0\n4 15-19 159098 192755 351853 age-group 15-19 351853 0\n5 20-24 249534 313485 563019 age-group 20-24 563019 0\n6 25-29 282703 300845 583548 age-group 25-29 583548 0\n\n\n\n\n9.7.3 Tidy-up\nNow that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.\n\ndemography_data_tidy <-\n demography_data_long |>\n rename_with(~paste0(., \"_total\"), male:total) |>\n pivot_longer(cols = contains(\"_total\"), \n names_to = \"type\", \n values_to = \"number\") |>\n separate(\n col = type,\n into = c(\"gender\", \"part_of_area\"),\n sep = \"_\"\n ) |>\n select(age, age_type, gender, number)\n\nThe original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2022). We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (Figure 9.9).\n\ndemography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n filter(gender != \"total\") |>\n ggplot(aes(x = age, y = number, fill = gender)) +\n geom_col(aes(x = age, y = number, fill = gender), \n position = \"dodge\") +\n scale_y_continuous(labels = comma) +\n scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), \"100+\")) +\n theme_classic() +\n scale_fill_brewer(palette = \"Set1\") +\n labs(\n y = \"Number\",\n x = \"Age\",\n fill = \"Gender\",\n caption = \"Data source: 2019 Kenya Census\"\n ) +\n theme(legend.position = \"bottom\") +\n coord_flip()\n\n\n\n\n\n\n\nFigure 9.9: Distribution of age and gender in Nairobi in 2019, based on Kenyan census\n\n\n\n\n\nA variety of features are clear from Figure 9.9, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.\nFinally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.\n\ncolumn_names_as_contracts <-\n demography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n rename(\n \"fctr_group_age\" = \"age\",\n \"chr_group_gender\" = \"gender\",\n \"int_group_count\" = \"number\"\n )\n\nWe can then use pointblank to set-up tests for us.\n\nagent <-\n create_agent(tbl = column_names_as_contracts) |>\n col_is_character(columns = vars(chr_group_gender)) |>\n col_is_factor(columns = vars(fctr_group_age)) |>\n col_is_integer(columns = vars(int_group_count)) |>\n col_vals_in_set(\n columns = chr_group_gender,\n set = c(\"male\", \"female\", \"total\")\n ) |>\n interrogate()\n\nagent\n\n\n\n\n\n\n\n\nPointblank Validation\n\n\n\n\n[2024-10-12|21:02:03]\n\n\ntibble column_names_as_contracts\n\n\n\n\n\n\nSTEP\nCOLUMNS\nVALUES\nTBL\nEVAL\nUNITS\nPASS\nFAIL\nW\nS\nN\nEXT\n\n\n\n\n\n\n1\n\n\n\n\ncol_is_character\n\n \n\n\n col_is_character()\n\n\n▮chr_group_gender\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n2\n\n\n\n\ncol_is_factor\n\n \n\n\n col_is_factor()\n\n\n▮fctr_group_age\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n3\n\n\n\n\ncol_is_integer\n\n \n\n\n col_is_integer()\n\n\n▮int_group_count\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n4\n\n\n\n\ncol_vals_in_set\n\n \n\n\n col_vals_in_set()\n\n\n▮chr_group_gender\n\n\nmale, female, total\n\n\n\n \n\n\n✓\n306\n306\n1\n0\n0\n—\n—\n—\n—\n\n\n\n2024-10-12 21:02:03 EDT < 1 s 2024-10-12 21:02:03 EDT",
+ "text": "9.7 2019 Kenyan census\nAs a final example, let us consider a more extensive situation and gather, clean, and prepare some data from the 2019 Kenyan census. We will focus on creating a dataset of single-year counts, by gender, for Nairobi.\nThe distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert this PDF into a tidy dataset that can be analyzed.\n\n9.7.1 Gather and clean\nWe first need to download and read in the PDF of the 2019 Kenyan census.4\n\ncensus_url <-\n paste0(\n \"https://www.knbs.or.ke/download/2019-kenya-population-and-\",\n \"housing-census-volume-iii-distribution-of-population-by-age-\",\n \"sex-and-administrative-units/?wpdmdl=5729&refresh=\",\n \"620561f1ce3ad1644519921\"\n )\n\ndownload.file(\n url = census_url,\n destfile = \"2019_Kenya_census.pdf\",\n mode = \"wb\"\n)\n\nWe can use pdf_text() from pdftools again here.\n\nkenya_census <-\n pdf_text(\n pdf = \"2019_Kenya_census.pdf\"\n )\n\nIn this example we will focus on the page of the PDF about Nairobi (Figure 9.8).\n\n\n\n\n\n\nFigure 9.8: Page from the 2019 Kenyan census about Nairobi\n\n\n\n\n9.7.1.1 Make rectangular\nThe first challenge is to get the dataset into a format that we can more easily manipulate. We will extract the relevant parts of the page. In this case, data about Nairobi is on page 410.\n\n# Focus on the page of interest\njust_nairobi <- stri_split_lines(kenya_census[[410]])[[1]]\n\n# Remove blank lines\njust_nairobi <- just_nairobi[just_nairobi != \"\"]\n\n# Remove titles, headings and other content at the top of the page\njust_nairobi <- just_nairobi[5:length(just_nairobi)]\n\n# Remove page numbers and other content at the bottom of the page\njust_nairobi <- just_nairobi[1:62]\n\n# Convert into a tibble\ndemography_data <- tibble(all = just_nairobi)\n\nAt this point the data are in a tibble. This allows us to use our familiar dplyr verbs. In particular we want to separate the columns.\n\ndemography_data <-\n demography_data |>\n mutate(all = str_squish(all)) |>\n mutate(all = str_replace(all, \"10 -14\", \"10-14\")) |>\n mutate(all = str_replace(all, \"Not Stated\", \"NotStated\")) |>\n # Deal with the two column set-up\n separate(\n col = all,\n into = c(\n \"age\", \"male\", \"female\", \"total\",\n \"age_2\", \"male_2\", \"female_2\", \"total_2\"\n ),\n sep = \" \",\n remove = TRUE,\n fill = \"right\",\n extra = \"drop\"\n )\n\nThey are side by side at the moment. We need to instead append to the bottom.\n\ndemography_data_long <-\n rbind(\n demography_data |> select(age, male, female, total),\n demography_data |>\n select(age_2, male_2, female_2, total_2) |>\n rename(\n age = age_2,\n male = male_2,\n female = female_2,\n total = total_2\n )\n )\n\n\n# There is one row of NAs, so remove it\ndemography_data_long <-\n demography_data_long |>\n remove_empty(which = c(\"rows\"))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total \n <chr> <chr> <chr> <chr> \n 1 Total 2,192,452 2,204,376 4,396,828\n 2 0 57,265 56,523 113,788 \n 3 1 56,019 54,601 110,620 \n 4 2 52,518 51,848 104,366 \n 5 3 51,115 51,027 102,142 \n 6 4 47,182 46,889 94,071 \n 7 0-4 264,099 260,888 524,987 \n 8 5 45,203 44,711 89,914 \n 9 6 43,635 44,226 87,861 \n10 7 43,507 43,655 87,162 \n# ℹ 113 more rows\n\n\nHaving got it into a rectangular format, we now need to clean the dataset to make it useful.\n\n\n9.7.1.2 Validity\nTo attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.\n\ndemography_data_long |>\n select(male, female, total) |>\n mutate(across(everything(), ~ str_remove_all(., \"[:digit:]\"))) |>\n distinct()\n\n# A tibble: 5 × 3\n male female total\n <chr> <chr> <chr>\n1 \",,\" \",,\" \",,\" \n2 \",\" \",\" \",\" \n3 \"\" \",\" \",\" \n4 \"\" \"\" \",\" \n5 \"\" \"\" \"\" \n\n\nWe need to remove commas. While we could use janitor here, it is worthwhile to at least first look at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with in a way that we want. Nonetheless, having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(across(c(male, female, total), ~ str_remove_all(., \",\"))) |>\n mutate(across(c(male, female, total), ~ as.integer(.)))\n\ndemography_data_long\n\n# A tibble: 123 × 4\n age male female total\n <chr> <int> <int> <int>\n 1 Total 2192452 2204376 4396828\n 2 0 57265 56523 113788\n 3 1 56019 54601 110620\n 4 2 52518 51848 104366\n 5 3 51115 51027 102142\n 6 4 47182 46889 94071\n 7 0-4 264099 260888 524987\n 8 5 45203 44711 89914\n 9 6 43635 44226 87861\n10 7 43507 43655 87162\n# ℹ 113 more rows\n\n\n\n\n9.7.1.3 Internal consistency\n\nThe census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year age. As such we will add a flag as to the type of age it is: an age-group, such as “ages 0 to 5”, or a single age, such as “1”.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age_type = if_else(str_detect(age, \"-\"), \n \"age-group\", \n \"single-year\"),\n age_type = if_else(str_detect(age, \"Total\"), \n \"age-group\", \n age_type)\n )\n\nAt the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.\n\ndemography_data_long <-\n demography_data_long |>\n mutate(\n age = as_factor(age)\n )\n\n\n\n\n9.7.2 Check and test\nHaving gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”, which are the only two gender categories available.\n\ndemography_data_long |>\n mutate(\n check_sum = male + female,\n totals_match = if_else(total == check_sum, 1, 0)\n ) |>\n filter(totals_match == 0)\n\n# A tibble: 0 × 7\n# ℹ 7 variables: age <fct>, male <int>, female <int>, total <int>,\n# age_type <chr>, check_sum <int>, totals_match <dbl>\n\n\nFinally, we want to check that the single-age counts sum to the age-groups.\n\ndemography_data_long |>\n mutate(age_groups = if_else(age_type == \"age-group\", \n age, \n NA_character_)) |>\n fill(age_groups, .direction = \"up\") |>\n mutate(\n group_sum = sum(total),\n group_sum = group_sum / 2,\n difference = total - group_sum,\n .by = c(age_groups)\n ) |>\n filter(age_type == \"age-group\" & age_groups != \"Total\") |> \n head()\n\n# A tibble: 6 × 8\n age male female total age_type age_groups group_sum difference\n <fct> <int> <int> <int> <chr> <chr> <dbl> <dbl>\n1 0-4 264099 260888 524987 age-group 0-4 524987 0\n2 5-9 215230 217482 432712 age-group 5-9 432712 0\n3 10-14 185008 193542 378550 age-group 10-14 378550 0\n4 15-19 159098 192755 351853 age-group 15-19 351853 0\n5 20-24 249534 313485 563019 age-group 20-24 563019 0\n6 25-29 282703 300845 583548 age-group 25-29 583548 0\n\n\n\n\n9.7.3 Tidy-up\nNow that we are reasonably confident that everything is looking good, we can convert it to tidy format. This will make it easier to work with.\n\ndemography_data_tidy <-\n demography_data_long |>\n rename_with(~paste0(., \"_total\"), male:total) |>\n pivot_longer(cols = contains(\"_total\"), \n names_to = \"type\", \n values_to = \"number\") |>\n separate(\n col = type,\n into = c(\"gender\", \"part_of_area\"),\n sep = \"_\"\n ) |>\n select(age, age_type, gender, number)\n\nThe original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2022). We will return to this dataset, but just to bring this all together, we may like to make a graph of single-year counts, by gender, for Nairobi (Figure 9.9).\n\ndemography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n filter(gender != \"total\") |>\n ggplot(aes(x = age, y = number, fill = gender)) +\n geom_col(aes(x = age, y = number, fill = gender), \n position = \"dodge\") +\n scale_y_continuous(labels = comma) +\n scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), \"100+\")) +\n theme_classic() +\n scale_fill_brewer(palette = \"Set1\") +\n labs(\n y = \"Number\",\n x = \"Age\",\n fill = \"Gender\",\n caption = \"Data source: 2019 Kenya Census\"\n ) +\n theme(legend.position = \"bottom\") +\n coord_flip()\n\n\n\n\n\n\n\nFigure 9.9: Distribution of age and gender in Nairobi in 2019, based on Kenyan census\n\n\n\n\n\nA variety of features are clear from Figure 9.9, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.\nFinally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.\n\ncolumn_names_as_contracts <-\n demography_data_tidy |>\n filter(age_type == \"single-year\") |>\n select(age, gender, number) |>\n rename(\n \"fctr_group_age\" = \"age\",\n \"chr_group_gender\" = \"gender\",\n \"int_group_count\" = \"number\"\n )\n\nWe can then use pointblank to set-up tests for us.\n\nagent <-\n create_agent(tbl = column_names_as_contracts) |>\n col_is_character(columns = vars(chr_group_gender)) |>\n col_is_factor(columns = vars(fctr_group_age)) |>\n col_is_integer(columns = vars(int_group_count)) |>\n col_vals_in_set(\n columns = chr_group_gender,\n set = c(\"male\", \"female\", \"total\")\n ) |>\n interrogate()\n\nagent\n\n\n\n\n\n\n\n\nPointblank Validation\n\n\n\n\n[2024-10-13|09:48:08]\n\n\ntibble column_names_as_contracts\n\n\n\n\n\n\nSTEP\nCOLUMNS\nVALUES\nTBL\nEVAL\nUNITS\nPASS\nFAIL\nW\nS\nN\nEXT\n\n\n\n\n\n\n1\n\n\n\n\ncol_is_character\n\n \n\n\n col_is_character()\n\n\n▮chr_group_gender\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n2\n\n\n\n\ncol_is_factor\n\n \n\n\n col_is_factor()\n\n\n▮fctr_group_age\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n3\n\n\n\n\ncol_is_integer\n\n \n\n\n col_is_integer()\n\n\n▮int_group_count\n\n—\n\n\n \n\n\n✓\n1\n1\n1\n0\n0\n—\n—\n—\n—\n\n\n\n\n4\n\n\n\n\ncol_vals_in_set\n\n \n\n\n col_vals_in_set()\n\n\n▮chr_group_gender\n\n\nmale, female, total\n\n\n\n \n\n\n✓\n306\n306\n1\n0\n0\n—\n—\n—\n—\n\n\n\n2024-10-13 09:48:08 EDT < 1 s 2024-10-13 09:48:08 EDT",
"crumbs": [
"Preparation",
"9 Clean and prepare "
@@ -951,7 +951,7 @@
"href": "11-eda.html#united-states-population-and-income-data",
"title": "11 Exploratory data analysis",
"section": "11.2 1975 United States population and income data",
- "text": "11.2 1975 United States population and income data\nAs a first example we consider US state populations as of 1975. This dataset is built into R with state.x77. Here is what the dataset looks like:\n\nus_populations <-\n state.x77 |>\n as_tibble() |>\n clean_names() |>\n mutate(state = rownames(state.x77)) |>\n select(state, population, income)\n\nus_populations\n\n# A tibble: 50 × 3\n state population income\n <chr> <dbl> <dbl>\n 1 Alabama 3615 3624\n 2 Alaska 365 6315\n 3 Arizona 2212 4530\n 4 Arkansas 2110 3378\n 5 California 21198 5114\n 6 Colorado 2541 4884\n 7 Connecticut 3100 5348\n 8 Delaware 579 4809\n 9 Florida 8277 4815\n10 Georgia 4931 4091\n# ℹ 40 more rows\n\n\nWe want to get a quick sense of the data. The first step is to have a look at the top and bottom of it with head() and tail(), then a random selection, and finally to focus on the variables and their class with glimpse(). The random selection is an important aspect, and when you use head() you should also quickly consider a random selection.\n\nus_populations |>\n head()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Alabama 3615 3624\n2 Alaska 365 6315\n3 Arizona 2212 4530\n4 Arkansas 2110 3378\n5 California 21198 5114\n6 Colorado 2541 4884\n\nus_populations |>\n tail()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Vermont 472 3907\n2 Virginia 4981 4701\n3 Washington 3559 4864\n4 West Virginia 1799 3617\n5 Wisconsin 4589 4468\n6 Wyoming 376 4566\n\nus_populations |>\n slice_sample(n = 6)\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 North Dakota 637 5087\n2 Maryland 4122 5299\n3 Texas 12237 4188\n4 Louisiana 3806 3545\n5 Alaska 365 6315\n6 North Carolina 5441 3875\n\nus_populations |>\n glimpse()\n\nRows: 50\nColumns: 3\n$ state <chr> \"Alabama\", \"Alaska\", \"Arizona\", \"Arkansas\", \"California\", \"…\n$ population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …\n$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…\n\n\nWe are then interested in understanding key summary statistics, such as the minimum, median, and maximum values for numeric variables with summary() from base R and the number of observations.\n\nus_populations |>\n summary()\n\n state population income \n Length:50 Min. : 365 Min. :3098 \n Class :character 1st Qu.: 1080 1st Qu.:3993 \n Mode :character Median : 2838 Median :4519 \n Mean : 4246 Mean :4436 \n 3rd Qu.: 4968 3rd Qu.:4814 \n Max. :21198 Max. :6315 \n\n\nFinally, it is especially important to understand the behavior of these key summary statistics at the limits. In particular, one approach is to randomly remove some observations and compare what happens to them. For instance, we can randomly create five datasets that differ on the basis of which observations were removed. We can then compare the summary statistics. If any of them are especially different, then we would want to look at the observations that were removed as they may contain observations with high influence.\n\nsample_means <- tibble(seed = c(), mean = c(), states_ignored = c())\n\nfor (i in c(1:5)) {\n set.seed(i)\n dont_get <- c(sample(x = state.name, size = 5))\n sample_means <-\n sample_means |>\n rbind(tibble(\n seed = i,\n mean =\n us_populations |>\n filter(!state %in% dont_get) |>\n summarise(mean = mean(population)) |>\n pull(),\n states_ignored = str_c(dont_get, collapse = \", \")\n ))\n}\n\nsample_means |>\n kable(\n col.names = c(\"Seed\", \"Mean\", \"Ignored states\"),\n digits = 0,\n format.args = list(big.mark = \",\"),\n booktabs = TRUE\n )\n\n\n\nTable 11.1: Comparing the mean population when different states are randomly removed\n\n\n\n\n\n\n\n\n\n\n\nSeed\nMean\nIgnored states\n\n\n\n\n1\n4,469\nArkansas, Rhode Island, Alabama, North Dakota, Minnesota\n\n\n2\n4,027\nMassachusetts, Iowa, Colorado, West Virginia, New York\n\n\n3\n4,086\nCalifornia, Idaho, Rhode Island, Oklahoma, South Carolina\n\n\n4\n4,391\nHawaii, Arizona, Connecticut, Utah, New Jersey\n\n\n5\n4,340\nAlaska, Texas, Iowa, Hawaii, South Dakota\n\n\n\n\n\n\n\n\nIn the case of the populations of US states, we know that larger states, such as California and New York, will have an out sized effect on our estimate of the mean. Table 11.1 supports that, as we can see that when we use seeds 2 and 3, there is a lower mean.",
+ "text": "11.2 1975 United States population and income data\nAs a first example we consider US state populations as of 1975. This dataset is built into R with state.x77. Here is what the dataset looks like:\n\nus_populations <-\n state.x77 |>\n as_tibble() |>\n clean_names() |>\n mutate(state = rownames(state.x77)) |>\n select(state, population, income)\n\nus_populations\n\n# A tibble: 50 × 3\n state population income\n <chr> <dbl> <dbl>\n 1 Alabama 3615 3624\n 2 Alaska 365 6315\n 3 Arizona 2212 4530\n 4 Arkansas 2110 3378\n 5 California 21198 5114\n 6 Colorado 2541 4884\n 7 Connecticut 3100 5348\n 8 Delaware 579 4809\n 9 Florida 8277 4815\n10 Georgia 4931 4091\n# ℹ 40 more rows\n\n\nWe want to get a quick sense of the data. The first step is to have a look at the top and bottom of it with head() and tail(), then a random selection, and finally to focus on the variables and their class with glimpse(). The random selection is an important aspect, and when you use head() you should also quickly consider a random selection.\n\nus_populations |>\n head()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Alabama 3615 3624\n2 Alaska 365 6315\n3 Arizona 2212 4530\n4 Arkansas 2110 3378\n5 California 21198 5114\n6 Colorado 2541 4884\n\nus_populations |>\n tail()\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Vermont 472 3907\n2 Virginia 4981 4701\n3 Washington 3559 4864\n4 West Virginia 1799 3617\n5 Wisconsin 4589 4468\n6 Wyoming 376 4566\n\nus_populations |>\n slice_sample(n = 6)\n\n# A tibble: 6 × 3\n state population income\n <chr> <dbl> <dbl>\n1 Mississippi 2341 3098\n2 Oklahoma 2715 3983\n3 New Jersey 7333 5237\n4 Wisconsin 4589 4468\n5 Utah 1203 4022\n6 Kentucky 3387 3712\n\nus_populations |>\n glimpse()\n\nRows: 50\nColumns: 3\n$ state <chr> \"Alabama\", \"Alaska\", \"Arizona\", \"Arkansas\", \"California\", \"…\n$ population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …\n$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…\n\n\nWe are then interested in understanding key summary statistics, such as the minimum, median, and maximum values for numeric variables with summary() from base R and the number of observations.\n\nus_populations |>\n summary()\n\n state population income \n Length:50 Min. : 365 Min. :3098 \n Class :character 1st Qu.: 1080 1st Qu.:3993 \n Mode :character Median : 2838 Median :4519 \n Mean : 4246 Mean :4436 \n 3rd Qu.: 4968 3rd Qu.:4814 \n Max. :21198 Max. :6315 \n\n\nFinally, it is especially important to understand the behavior of these key summary statistics at the limits. In particular, one approach is to randomly remove some observations and compare what happens to them. For instance, we can randomly create five datasets that differ on the basis of which observations were removed. We can then compare the summary statistics. If any of them are especially different, then we would want to look at the observations that were removed as they may contain observations with high influence.\n\nsample_means <- tibble(seed = c(), mean = c(), states_ignored = c())\n\nfor (i in c(1:5)) {\n set.seed(i)\n dont_get <- c(sample(x = state.name, size = 5))\n sample_means <-\n sample_means |>\n rbind(tibble(\n seed = i,\n mean =\n us_populations |>\n filter(!state %in% dont_get) |>\n summarise(mean = mean(population)) |>\n pull(),\n states_ignored = str_c(dont_get, collapse = \", \")\n ))\n}\n\nsample_means |>\n kable(\n col.names = c(\"Seed\", \"Mean\", \"Ignored states\"),\n digits = 0,\n format.args = list(big.mark = \",\"),\n booktabs = TRUE\n )\n\n\n\nTable 11.1: Comparing the mean population when different states are randomly removed\n\n\n\n\n\n\n\n\n\n\n\nSeed\nMean\nIgnored states\n\n\n\n\n1\n4,469\nArkansas, Rhode Island, Alabama, North Dakota, Minnesota\n\n\n2\n4,027\nMassachusetts, Iowa, Colorado, West Virginia, New York\n\n\n3\n4,086\nCalifornia, Idaho, Rhode Island, Oklahoma, South Carolina\n\n\n4\n4,391\nHawaii, Arizona, Connecticut, Utah, New Jersey\n\n\n5\n4,340\nAlaska, Texas, Iowa, Hawaii, South Dakota\n\n\n\n\n\n\n\n\nIn the case of the populations of US states, we know that larger states, such as California and New York, will have an out sized effect on our estimate of the mean. Table 11.1 supports that, as we can see that when we use seeds 2 and 3, there is a lower mean.",
"crumbs": [
"Modeling",
"11 Exploratory data analysis "
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 055d4668..f3f2fe01 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -6,11 +6,11 @@
https://tellingstorieswithdata.com/00-errata.html
- 2024-10-12T12:47:44.703Z
+ 2024-10-13T13:46:36.582Z
https://tellingstorieswithdata.com/01-introduction.html
- 2024-10-12T14:04:13.892Z
+ 2024-10-13T13:46:13.798Z
https://tellingstorieswithdata.com/02-drinking_from_a_fire_hose.html
@@ -26,7 +26,7 @@
https://tellingstorieswithdata.com/05-static_communication.html
- 2024-10-12T20:17:32.918Z
+ 2024-10-13T12:01:29.677Z
https://tellingstorieswithdata.com/06-farm.html
diff --git a/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta b/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta
index 7709c50e..3f6008cb 100644
Binary files a/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta and b/inputs/data/cancensus/CM_data_cd8148f91ef9529ea3c54ed1dc68fee4.rda.meta differ