-
Notifications
You must be signed in to change notification settings - Fork 75
/
Copy path02-drinking_from_a_fire_hose.qmd
1456 lines (1126 loc) · 66.8 KB
/
02-drinking_from_a_fire_hose.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
engine: knitr
---
# Drinking from a fire hose {#sec-fire-hose}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
:::
**Prerequisites**
- Read *The mundanity of excellence: An ethnographic report on stratification and Olympic swimmers*, [@chambliss1989mundanity]
- This paper finds that excellence is not due to some special talent or gift, but instead due to technique, discipline, and attitude.
- Read *Data science as an atomic habit*, [@citeBarrett]
- This blog post describes an approach to learning data science that involves making small, consistent, actions.
- Read *This is how AI bias really happens---and why it's so hard to fix*, [@hao2019]
- This article highlights some of the ways in which models can perpetuate bias.
**Key concepts and skills**
- The statistical programming language R enables us to tell interesting stories using data. It is a language like any other, and the path to mastery can be slow.
- The workflow that we use to approach projects is: plan, simulate, acquire, explore, and share.
- The way to learn R is to start with a small project and break down what is required to achieve it into tiny steps, look at other people's code, and draw on that to achieve each step. Complete that project and move onto the next project. With each project you will get a little better.
**Software and packages**
- Base R [@citeR]
- Core `tidyverse` [@tidyverse]
- `dplyr` [@citedplyr]
- `ggplot2` [@citeggplot]
- `tidyr` [@citetidyr]
- `stringr` [@citestringr]
- `readr` [@citereadr]
- `janitor` [@janitor]
- `lubridate` [@GrolemundWickham2011]
- `opendatatoronto` [@citeSharla]
- `tinytable` [@tinytable]
```{r}
#| message: false
#| warning: false
library(janitor)
library(lubridate)
library(opendatatoronto)
library(tidyverse)
library(tinytable)
```
## Hello, World!
The way to start, is to start. In this chapter we go through three complete examples of the data science workflow advocated in this book.\index{workflow} This means we:
$$
\mbox{Plan} \rightarrow \mbox{Simulate} \rightarrow \mbox{Acquire} \rightarrow \mbox{Explore} \rightarrow \mbox{Share}
$$
If you are new to R, then some of the code may be a bit unfamiliar to you. If you are new to statistics, then some of the concepts may be unfamiliar. Do not worry. It will all soon become familiar.
The only way to learn how to tell stories, is to start telling stories yourself. This means that you should try to get these examples working. Do the sketches yourself, type everything out yourself (using Posit Cloud if you are new to R and do not have it locally installed), and execute it all. It is important to realize that it will be challenging at the start. This is normal.
> Whenever you're learning a new tool, for a long time, you're going to suck$\dots$ But the good news is that is typical; that's something that happens to everyone, and it's only temporary.
>
> Hadley Wickham as quoted by @citeBarrett.
You will be guided thoroughly here. Hopefully by experiencing the excitement of telling stories with data, you will feel empowered to stick with it.
The first step of the workflow is to plan. We do this because we need to establish an end-point, even if we later need to update it as we learn more about the situation. We then simulate because this forces us into the details of our plan. In some projects, data acquisition may be as straight-forward as downloading a dataset, but in others the data acquisition may be much of the focus, for instance, if we conduct a survey. We explore the data using various quantitative methods to come to understand it. And finally, we share our understanding, in a way that is focused on the needs of our audience.
To get started, go to [Posit Cloud](https://posit.cloud)\index{Posit Cloud} and create an account; the free version is fine for now. We use that initially, rather than the desktop, so that getting started is the same for everyone, but to avoid having to pay you should change to a local installation later. Once you have an account and log in, it should look something like @fig-02-rstudio_cloud-1.
::: {#fig-openrstudio layout-ncol="2" layout-valign="bottom"}
![Opening Posit Cloud for the first time](figures/02-posit_cloud-1.png){#fig-02-rstudio_cloud-1}
![Opening a new RStudio project](figures/02-posit_cloud-2.png){#fig-02-rstudio_cloud-2}
Getting started with Posit Cloud and a new project
:::
You will be in "Your Projects". From here you should start a new project: "New Project" $\rightarrow$ "New RStudio Project" (@fig-02-rstudio_cloud-2). You can give the project a name by clicking on "Untitled Project" and replacing it.
We will now go through three worked examples: Australian elections, Toronto shelter usage, and neonatal mortality. These examples build increasing complexity, but from the first one, we will be telling a story with data. While we briefly explain many aspects here, almost everything is explained in much more detail in the rest of the book.
## Australian elections
Australia\index{Australia} is a parliamentary democracy with 151 seats in the House of Representatives, which is the lower house and that from which government is formed.\index{elections!Australia 2022 Federal Election} There are two major parties---"Liberal" and "Labor"---two minor parties---"Nationals" and "Greens"---and many smaller parties and independents. In this example we will create a graph of the number of seats that each party won in the 2022 Federal Election.\index{elections}
### Plan
For this example, we need to plan two aspects. The first is what the dataset that we need will look like, and the second is what the final graph will look like.
The basic requirement for the dataset is that it has the name of the seat (sometimes called a "division" in Australia) and the party of the person elected. A quick sketch of the dataset that we would need is @fig-australiaexample-data.
::: {#fig-australiaexample layout-ncol="2" layout-valign="bottom"}
![Quick sketch of a dataset that could be useful for analyzing Australian elections](figures/02-divisiontable.png){#fig-australiaexample-data width="45%"}
![Quick sketch of a possible graph of the number of seats won by each party](figures/02-divisiongraph.png){#fig-australiaexample-graph width="45%"}
Sketches of a potential dataset and graph related to an Australian election
:::
We also need to plan the graph that we are interested in. Given we want to display the number of seats that each party won, a quick sketch of what we might aim for is @fig-australiaexample-graph.
### Simulate
We now simulate some data, to bring some specificity to our sketches.
To get started, within Posit Cloud\index{Posit Cloud}, make a new Quarto\index{Quarto} document: "File" $\rightarrow$ "New File" $\rightarrow$ "Quarto document$\dots$". Give it a title, such as "Exploring the 2022 Australian Election", add your name as author, and unclick "Use visual markdown editor" (@fig-quarto-australian-elections-clcik). Leave the other options as their default, and then click "Create".
::: {#fig-quarto-australian-elections layout-nrow=3}
![Creating a new Quarto document](figures/02-posit_cloud-3.png){#fig-quarto-australian-elections-clcik}
![Installing rmarkdown if necessary](figures/02-posit_cloud-4.png){#fig-quarto-australian-elections-wtfquarto}
![After initial setup and with a preamble](figures/02-posit_cloud-5.png){#fig-quarto-australian-elections-3}
![Highlighting the green arrow to run the chunk](figures/02-posit_cloud-6.png){#fig-quarto-australian-elections-4}
![Highlighting the cross to remove the messages](figures/02-posit_cloud-7.png){#fig-quarto-australian-elections-5}
![Highlighting the render button](figures/02-posit_cloud-8.png){#fig-quarto-australian-elections-6}
Getting started with a Quarto document
:::
You may get a notification along the lines of "Package rmarkdown required$\dots$." (@fig-quarto-australian-elections-wtfquarto). If that happens, click "Install". For this example, we will put everything into this one Quarto document. You should save it as "australian_elections.qmd": "File" $\rightarrow$ "Save As$\dots$".
Remove almost all the default content, and then beneath the heading material create a new R code chunk: "Code" $\rightarrow$ "Insert Chunk". Then add preamble documentation that explains:
- the purpose of the document;
- the author and contact details;
- when the file was written or last updated; and
- prerequisites that the file relies on.
```{r}
#| eval: false
#| echo: true
#### Preamble ####
# Purpose: Read in data from the 2022 Australian Election and make
# a graph of the number of seats each party won.
# Author: Rohan Alexander
# Email: [email protected]
# Date: 1 January 2023
# Prerequisites: Know where to get Australian elections data.
```
In R, lines that start with "\#" are comments. This means that they are not run as code by R, but are instead designed to be read by humans. Each line of this preamble should start with a "\#". Also make it clear that this is the preamble section by surrounding that with "\####". The result should look like @fig-quarto-australian-elections-3.
After this we need to setup the workspace. This involves installing\index{packages!install} and loading any packages that will be needed. A package only needs to be installed once for each computer, but needs to be loaded each time it is to be used. In this case we are going to use the `tidyverse` and `janitor` packages. They will need to be installed because this is the first time they are being used, and then each will need to be loaded.
:::{.callout-note}
## Shoulders of giants
Hadley Wickham\index{Wickham, Hadley} is Chief Scientist at RStudio. After earning a PhD in Statistics from Iowa State University in 2008 he was appointed as an assistant professor at Rice University, and became Chief Scientist at RStudio, now Posit, in 2013.\index{statistics} He developed the `tidyverse` collection of packages, and has published many books including *R for Data Science* [@r4ds] and *Advanced R* [@advancedr]. He was awarded the COPSS Presidents' Award in 2019.\index{COPSS Presidents' Award}
:::
An example of installing the packages follows.\index{packages!install} Run this code by clicking the small green arrow associated with the R code chunk (@fig-quarto-australian-elections-4).
```{r}
#| eval: false
#| echo: true
#### Workspace setup ####
install.packages("tidyverse")
install.packages("janitor")
```
Now that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code should be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (@fig-quarto-australian-elections-5).
```{r}
#| echo: true
#| eval: true
#| warning: false
#| message: false
#### Workspace setup ####
# install.packages("tidyverse")
# install.packages("janitor")
library(tidyverse)
library(janitor)
```
We can render the entire document by clicking "Render" (@fig-quarto-australian-elections-6). When you do this, you may be asked to install some packages. If that happens, then you should agree to this. This will result in a HTML document.\index{Quarto!render HTML}
For an introduction to the packages that were just installed, each package contains a help file that provides information about them and their functions. It can be accessed by prepending a question mark to the package name and then running that code in the console. For instance `?tidyverse`.
To simulate our data,\index{simulation} we need to create a dataset with two variables: "Division" and "Party", and some values for each. In the case of "Division" reasonable values would be a name of one of the 151 Australian divisions. In the case of "Party" reasonable values would be one of the following five: "Liberal", "Labor", "National", "Green", or "Other". Again, this code can be run by clicking the small green arrow associated with the R code chunk.
```{r}
#| echo: true
#| eval: true
#| warning: false
#| message: false
simulated_data <-
tibble(
# Use 1 through to 151 to represent each division
"Division" = 1:151,
# Randomly pick an option, with replacement, 151 times
"Party" = sample(
x = c("Liberal", "Labor", "National", "Green", "Other"),
size = 151,
replace = TRUE
)
)
simulated_data
```
At a certain point, your code will not run and you will want to ask others for help.\index{help!asking for} Do not take a screenshot of a small snippet of the code and expect that someone will be able to help based on that. They, almost surely, cannot. Instead, you need to provide them with your whole script in a way that they can run. We will explain what GitHub\index{GitHub} is more completely in @sec-reproducible-workflows, but for now, if you need help, then you should naively create a GitHub Gist which will enable you to share your code in a way that is more helpful than taking a screenshot. The first step is to create a free account on [GitHub](https://github.com) (@fig-githubone). Thinking about an appropriate username is important because this will become part of your professional profile. It would make sense to have a username that is professional, independent of any course, and ideally related to your real name. Then look for a "+" in the top right, and select "New gist" (@fig-githubgistone).
::: {#fig-githubgist layout-ncol="2" layout-nrow="2" layout-valign="bottom"}
![GitHub sign-up screen](figures/github_1.png){#fig-githubone}
![New GitHub Gist](figures/githubgistone.png){#fig-githubgistone}
![Create a public GitHub Gist to share code](figures/githubgisttwo.png){#fig-githubgisttwo}
Creating a Gist to share code when asking for help
:::
From here you should add all the code to that Gist, not just the final bit that is giving an error. And give it a meaningful filename that includes ".R" at the end, for instance, "australian_elections.R". In @fig-githubgisttwo it will turn out that we have incorrect capitalization, `library(Tidyverse)` instead of `library(tidyverse)`.
Click "Create public gist". We can then share the URL to this Gist with whoever we are asking to help, explain what the problem is, and what we are trying to achieve. It will be easier for them to help, because all the code is available.
### Acquire
Now we want to get the actual data. The data we need is from the Australian Electoral Commission (AEC)\index{Australia!Australian Electoral Commission}, which is the non-partisan agency that organizes Australian federal elections. We can pass a page of their website to `read_csv()` from `readr`. We do not need to explicitly load `readr` because it is part of the `tidyverse`. The `<-` or "assignment operator" allocates the output of `read_csv()` to an object called "raw_elections_data".
::: {.content-visible when-format="pdf"}
```{r}
#| eval: false
#| echo: true
#### Read in the data ####
raw_elections_data <-
read_csv(
file =
paste0("https://results.aec.gov.au/27966/website/Downloads/",
"HouseMembersElectedDownload-27966.csv"),
show_col_types = FALSE,
skip = 1
)
# We save the original data in case we lose access
write_csv(
x = raw_elections_data,
file = "australian_voting.csv"
)
```
:::
::: {.content-visible unless-format="pdf"}
```{r}
#| eval: false
#| echo: true
#### Read in the data ####
raw_elections_data <-
read_csv(
file =
"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv",
show_col_types = FALSE,
skip = 1
)
# We have read the data from the AEC website. We may like to save
# it in case something happens or they move it.
write_csv(
x = raw_elections_data,
file = "australian_voting.csv"
)
```
:::
```{r}
#| eval: false
#| echo: false
# INTERNAL
raw_elections_data <-
read_csv(
file =
"https://results.aec.gov.au/27966/website/Downloads/HouseMembersElectedDownload-27966.csv",
show_col_types = FALSE,
skip = 1
)
write_csv(
x = raw_elections_data,
file = here::here("inputs/data/australian_voting.csv")
)
```
```{r}
#| eval: true
#| echo: false
#| warning: false
raw_elections_data <-
read_csv(
here::here("inputs/data/australian_voting.csv"),
show_col_types = FALSE
)
```
We can take a quick look at the dataset using `head()` which will show the first six rows, and `tail()` which will show the last six rows.
```{r}
head(raw_elections_data)
tail(raw_elections_data)
```
We need to clean the data so that we can use it. We are trying to make it similar to the dataset that we thought we wanted in the planning stage. While it is fine to move away from the plan, this needs to be a deliberate, reasoned decision. After reading in the dataset that we saved, the first thing that we will do is adjust the names of the variables. We will do this using `clean_names()` from `janitor`.
```{r}
#| echo: false
#| eval: true
#### Basic cleaning ####
raw_elections_data <-
read_csv(
file = here::here("inputs/data/australian_voting.csv"),
show_col_types = FALSE
)
```
```{r}
#| echo: true
#| eval: false
#### Basic cleaning ####
raw_elections_data <-
read_csv(
file = "australian_voting.csv",
show_col_types = FALSE
)
```
```{r}
#| echo: true
#| eval: true
# Make the names easier to type
cleaned_elections_data <-
clean_names(raw_elections_data)
# Have a look at the first six rows
head(cleaned_elections_data)
```
The names are faster to type because RStudio\index{R!RStudio} will auto-complete them. To do this, we begin typing the name of a variable and then use the "tab" key to complete it.
There are many variables in the dataset, and we are primarily interested in two: "division_nm" and "party_nm". We can choose certain variables of interest with `select()` from `dplyr` which we loaded as part of the `tidyverse`. The "pipe operator"\index{pipe operator}, `|>`, pushes the output of one line to be the first input of the function on the next line.
```{r}
#| echo: true
#| eval: true
cleaned_elections_data <-
cleaned_elections_data |>
select(
division_nm,
party_nm
)
head(cleaned_elections_data)
```
Some of the variable names are still not obvious because they are abbreviated. We can look at the names of the columns in this dataset with `names()`. And we can change the names using `rename()` from `dplyr`.
```{r}
names(cleaned_elections_data)
```
```{r}
cleaned_elections_data <-
cleaned_elections_data |>
rename(
division = division_nm,
elected_party = party_nm
)
head(cleaned_elections_data)
```
We could now look at the unique values in the "elected_party" column using `unique()`.
```{r}
cleaned_elections_data$elected_party |>
unique()
```
As there is more detail in this than we wanted, we may want to simplify the party names to match what we simulated, using `case_match()` from `dplyr`.
```{r}
cleaned_elections_data <-
cleaned_elections_data |>
mutate(
elected_party =
case_match(
elected_party,
"Australian Labor Party" ~ "Labor",
"Liberal National Party of Queensland" ~ "Liberal",
"Liberal" ~ "Liberal",
"The Nationals" ~ "Nationals",
"The Greens" ~ "Greens",
"Independent" ~ "Other",
"Katter's Australian Party (KAP)" ~ "Other",
"Centre Alliance" ~ "Other"
)
)
head(cleaned_elections_data)
```
Our data now matches our plan (@fig-australiaexample-data). For every electoral division we have the party of the person that won it.
Having now nicely cleaned the dataset, we should save it, so that we can start with that cleaned dataset in the next stage. We should make sure to save it under a new file name so we are not replacing the raw data, and so that it is easy to identify the cleaned dataset later.
```{r}
#| echo: true
#| eval: false
write_csv(
x = cleaned_elections_data,
file = "cleaned_elections_data.csv"
)
```
```{r}
#| echo: false
#| eval: true
write_csv(
cleaned_elections_data,
here::here("outputs/data/cleaned_elections_data.csv")
)
```
### Explore
We may like to explore the dataset that we created. One way to better understand a dataset is to make a graph. In particular, here we would like to build the graph that we planned in @fig-australiaexample-graph.
First, we read in the dataset that we just created.
```{r}
#| echo: false
#| eval: true
# Internal
cleaned_elections_data <-
read_csv(
file = "outputs/data/cleaned_elections_data.csv",
show_col_types = FALSE
)
```
```{r}
#| echo: true
#| eval: false
#### Read in the data ####
cleaned_elections_data <-
read_csv(
file = "cleaned_elections_data.csv",
show_col_types = FALSE
)
```
We can get a quick count of how many seats each party won using `count()` from `dplyr`.
```{r}
#| echo: true
#| eval: true
#| warning: false
cleaned_elections_data |>
count(elected_party)
```
To build the graph that we are interested in, we use `ggplot2` which is part of the `tidyverse`. The key aspect of this package is that we build graphs by adding layers using "+", which we call the "add operator". In particular we will create a bar chart using `geom_bar()` from `ggplot2` (@fig-canadanice-1).
```{r}
#| echo: true
#| eval: true
#| warning: false
#| label: fig-canadanice
#| fig-cap: "Number of seats won, by political party, at the 2022 Australian Federal Election"
#| fig-subcap: ["Default options", "Improved theme and labels"]
#| layout-ncol: 2
cleaned_elections_data |>
ggplot(aes(x = elected_party)) + # aes abbreviates "aesthetics"
geom_bar()
cleaned_elections_data |>
ggplot(aes(x = elected_party)) +
geom_bar() +
theme_minimal() + # Make the theme neater
labs(x = "Party", y = "Number of seats") # Make labels more meaningful
```
@fig-canadanice-1 accomplishes what we set out to do. But we can make it look a bit nicer by modifying the default options and improving the labels (@fig-canadanice-2).
### Share
To this point we have downloaded some data, cleaned it, and made a graph. We would typically need to communicate what we have done at some length. In this case, we can write a few paragraphs about what we did, why we did it, and what we found to conclude our workflow. An example follows.
> Australia is a parliamentary democracy with 151 seats in the House of Representatives, which is the house from which government is formed. There are two major parties---"Liberal" and "Labor"---two minor parties---"Nationals" and "Greens"---and many smaller parties. The 2022 Federal Election occurred on 21 May, and around 15 million votes were cast. We were interested in the number of seats that were won by each party.
>
> We downloaded the results, on a seat-specific basis, from the Australian Electoral Commission website. We cleaned and tidied the dataset using the statistical programming language R [@citeR] including the `tidyverse` [@tidyverse] and `janitor` [@janitor]. We then created a graph of the number of seats that each political party won (@fig-canadanice).
>
> We found that the Labor Party won 77 seats, followed by the Liberal Party with 48 seats. The minor parties won the following number of seats: the Nationals won 10 seats and the Greens won 4 seats. Finally, there were 10 Independents elected as well as candidates from smaller parties.
>
> The distribution of seats is skewed toward the two major parties which could reflect relatively stable preferences on the part of Australian voters, or possibly inertia due to the benefits of already being a major party such a national network or funding. A better understanding of the reasons for this distribution are of interest in future work. While the dataset consists of everyone who voted, it worth noting that in Australia some are systematically excluded from voting, and it is much more difficult for some to vote than others.
One aspect to be especially concerned with is making sure that this communication is focused on the needs of the audience and telling a story. Data journalism provides some excellent examples of how analysis needs to be tailored to the audience, for instance, @biasbehindbars and @bronnerftw.
## Toronto's unhoused population
Toronto\index{Canada!Toronto unhoused population} has a large unhoused population [@torontohomeless]. Freezing winters mean it is important there are enough places in shelters. In this example we will make a table of shelter usage in 2021 to compare average use in each month. Our expectation is that there is greater usage in the colder months, for instance, December, compared with warmer months, for instance, July.
### Plan
The dataset that we are interested in would need to have the date, the shelter, and the number of beds that were occupied that night. A quick sketch of a dataset that would work is @fig-torontohomeless-data. We are interested in creating a table that has the monthly average number of beds occupied each night. The table would probably look something like @fig-torontohomeless-table.
::: {#fig-torontohomeless layout-ncol="2"}
![Quick sketch of a dataset](figures/02-shelter_sketch.png){#fig-torontohomeless-data width="50%"}
![Quick sketch of a table of the average number of beds occupied each month](figures/02-homeless_sketch.png){#fig-torontohomeless-table width="50%"}
Sketches of a dataset and table related shelter usage in Toronto
:::
### Simulate
The next step is to simulate some data that could resemble our dataset.\index{simulation} Simulation provides us with an opportunity to think deeply about our data generating process. When we turn to analysis, it will provide us with a guide. Conducting analysis without first using simulation can be thought of as shooting arrows without a target---while you are certainly doing something, it is not clear whether you are doing it well.
In Posit Cloud\index{Posit Cloud} make a new Quarto\index{Quarto} document, save it, and make a new R code chunk and add preamble documentation. Then install and/or load the packages that are needed. We will again use the `tidyverse` and `janitor`. As those were installed earlier, they do not need to be installed again. We will also use `lubridate`. That is part of the `tidyverse` and so does not need to be installed independently, but it does need to be loaded. We will also use `opendatatoronto`, and `knitr` and these will need to be installed and loaded.
```{r}
#| eval: false
#| echo: true
#### Preamble ####
# Purpose: Get data on 2021 shelter usage and make table
# Author: Rohan Alexander
# Email: [email protected]
# Date: 1 July 2022
# Prerequisites: -
#### Workspace setup ####
install.packages("opendatatoronto")
install.packages("knitr")
library(knitr)
library(janitor)
library(lubridate)
library(opendatatoronto)
library(tidyverse)
```
```{r}
#| echo: false
#| eval: true
#| warning: false
#| message: false
library(knitr)
library(janitor)
library(lubridate)
library(opendatatoronto)
library(tidyverse)
```
To add a bit more detail to the earlier example, packages contain code that other people have written. There are a few common ones that you will see regularly in this book, especially the `tidyverse`. To use a package, we must first install it and then we need to load it. A package only needs to be installed once per computer but must be loaded every time. This means the packages that we installed earlier do not need to be reinstalled here.
::: callout-note
## Shoulders of giants
Dr Robert Gentleman\index{Gentleman, Robert} is a co-creator of R. After earning a PhD in Statistics from the University of Washington in 1988, he moved to the University of Auckland.\index{statistics} He then went onto various roles including at 23andMe and is now the Executive Director of the Center for Computational Biomedicine at Harvard Medical School.
:::
::: callout-note
## Shoulders of giants
Dr Ross Ihaka\index{Ihaka, Ross} is a co-creator of R. He earned a PhD in Statistics from the University of California, Berkeley, in 1985. He wrote a dissertation titled "Ruaumoko", which is the Māori god of earthquakes.\index{statistics}\index{Berkeley} He then moved to the University of Auckland where he remained for his entire career. He was awarded the Pickering Medal in 2008 by the Royal Society of New Zealand Te Apārangi.
:::
Given that people donate their time to make R and the packages that we use, it is important to cite them. To get the information that is needed, we use `citation()`.\index{citation!R} When run without any arguments, that provides the citation information for R itself, and when run with an argument that is the name of a package, it provides the citation information for that package.\index{packages!cite}\index{citation!R packages}
```{r}
citation() # Get the citation information for R
citation("ggplot2") # Get citation information for a package
```
Turning to the simulation, we need three variables: "date", "shelter", and "occupancy". This example will build on the earlier one by adding a seed using `set.seed()`.\index{random seed} A seed enables us to always generate the same random data whenever we run the same code. Any integer can be used as the seed. In this case the seed will be 853. If you use that as your seed, then you should get the same random numbers as in this example. If you use a different seed, then you should expect different random numbers. Finally, we use `rep()` to repeat something a certain number of times. For instance, we repeat "Shelter 1" 365 times which accounts for about a year.\index{distribution!Poisson}
```{r}
#| echo: true
#| eval: true
#| warning: false
#| message: false
#### Simulate ####
set.seed(853)
simulated_occupancy_data <-
tibble(
date = rep(x = as.Date("2021-01-01") + c(0:364), times = 3),
# Based on Eddelbuettel: https://stackoverflow.com/a/21502386
shelter = c(
rep(x = "Shelter 1", times = 365),
rep(x = "Shelter 2", times = 365),
rep(x = "Shelter 3", times = 365)
),
number_occupied =
rpois(
n = 365 * 3,
lambda = 30
) # Draw 1,095 times from the Poisson distribution
)
head(simulated_occupancy_data)
```
In this simulation we first create a list of all the dates in 2021. We repeat that list three times. We assume data for three shelters for every day of the year. To simulate the number of beds that are occupied each night, we draw from a Poisson distribution\index{distribution!Poisson}, assuming a mean number of 30 beds occupied per shelter, although this is just an arbitrary choice. By way of background, a Poisson distribution is often used when we have count data, and we return to it in @sec-its-just-a-generalized-linear-model.
### Acquire
We use data made available about Toronto shelter usage by the City of Toronto. Shelter usage is measured by a count made each night at 4 a.m. of the number of occupied beds. To access the data, we use `opendatatoronto` and then save our own copy.
::: {.content-visible when-format="pdf"}
```{r}
#| eval: false
#| echo: true
#### Acquire ####
toronto_shelters <-
# Each package is associated with a unique id found in the "For
# Developers" tab of the relevant page from Open Data Toronto
list_package_resources("21c83b32-d5a8-4106-a54f-010dbe49f6f2") |>
# Within that package, we are interested in the 2021 dataset
filter(name ==
"daily-shelter-overnight-service-occupancy-capacity-2021.csv") |>
# Having reduced the dataset to one row we can get the resource
get_resource()
write_csv(
x = toronto_shelters,
file = "toronto_shelters.csv"
)
head(toronto_shelters)
```
:::
::: {.content-visible unless-format="pdf"}
```{r}
#| eval: false
#| echo: true
#### Acquire ####
toronto_shelters <-
# Each package is associated with a unique id found in the "For
# Developers" tab of the relevant page from Open Data Toronto
# https://open.toronto.ca/dataset/daily-shelter-overnight-service-occupancy-capacity/
list_package_resources("21c83b32-d5a8-4106-a54f-010dbe49f6f2") |>
# Within that package, we are interested in the 2021 dataset
filter(name ==
"daily-shelter-overnight-service-occupancy-capacity-2021.csv") |>
# Having reduced the dataset to one row we can get the resource
get_resource()
write_csv(
x = toronto_shelters,
file = "toronto_shelters.csv"
)
head(toronto_shelters)
```
:::
```{r}
#| eval: false
#| echo: false
#| warning: false
write_csv(
x = toronto_shelters,
file = here::here("inputs/data/toronto_shelters.csv")
)
```
```{r}
#| eval: true
#| echo: false
#| warning: false
toronto_shelters <-
read_csv(
here::here("inputs/data/toronto_shelters.csv"),
show_col_types = FALSE
)
```
::: {.content-visible unless-format="pdf"}
```{r}
#| eval: true
#| echo: true
#| warning: false
head(toronto_shelters)
```
:::
Not much needs to be done to this to make it similar to the dataset that we were interested in (@fig-torontohomeless-data). We need to change the names to make them easier to type using `clean_names()`, and reduce the columns to only those that are relevant using `select()`.
```{r}
toronto_shelters_clean <-
clean_names(toronto_shelters) |>
mutate(occupancy_date = ymd(occupancy_date)) |>
select(occupancy_date, occupied_beds)
head(toronto_shelters_clean)
```
All that remains is to save the cleaned dataset.
```{r}
#| eval: false
#| echo: true
#| warning: false
write_csv(
x = toronto_shelters_clean,
file = "cleaned_toronto_shelters.csv"
)
```
```{r}
#| eval: true
#| echo: false
#| warning: false
write_csv(
x = toronto_shelters_clean,
file = here::here("outputs/data/cleaned_toronto_shelters.csv")
)
```
### Explore
First, we load the dataset that we just created.
```{r}
#| echo: false
#| eval: true
toronto_shelters_clean <-
read_csv(
"outputs/data/cleaned_toronto_shelters.csv",
show_col_types = FALSE
)
```
```{r}
#| echo: true
#| eval: false
#### Explore ####
toronto_shelters_clean <-
read_csv(
"cleaned_toronto_shelters.csv",
show_col_types = FALSE
)
```
The dataset contains daily records for each shelter. We are interested in understanding average usage for each month. To do this, we need to add a month column using `month()` from `lubridate`. By default, `month()` provides the number of the month, and so we include two arguments---"label" and "abbr"---to get the full name of the month. We remove rows that do not have any data for the number of beds using `drop_na()` from `tidyr`, which is part of the `tidyverse`. We will do this here unthinkingly because our focus is on getting started, but this is an important decision and we talk more about missing data in @sec-farm-data and @sec-exploratory-data-analysis. We then create a summary statistic on the basis of monthly groups, using `summarise()` from `dplyr`. We use `tt()` from `tinytable` to create @tbl-homelessoccupancyd.
```{r}
#| label: tbl-homelessoccupancyd
#| tbl-cap: "Shelter usage in Toronto in 2021"
toronto_shelters_clean |>
mutate(occupancy_month = month(
occupancy_date,
label = TRUE,
abbr = FALSE
)) |>
arrange(month(occupancy_date)) |>
drop_na(occupied_beds) |>
summarise(number_occupied = mean(occupied_beds),
.by = occupancy_month) |>
tt()
```
As with before, this looks fine, and achieves what we set out to do. But we can make some tweaks to the defaults to make it look even better (@tbl-homelessoccupancy). In particular we make the column names easier to read, only show an appropriate number of decimal places, and change the alignment (`j` is used to specify the column number of interest and `r` is the alignment type i.e. right).
```{r}
#| label: tbl-homelessoccupancy
#| tbl-cap: "Shelter usage in Toronto in 2021"
toronto_shelters_clean |>
mutate(occupancy_month = month(
occupancy_date,
label = TRUE,
abbr = FALSE
)) |>
arrange(month(occupancy_date)) |>
drop_na(occupied_beds) |>
summarise(number_occupied = mean(occupied_beds),
.by = occupancy_month) |>
tt(
digits = 1
) |>
style_tt(j = 2, align = "r") |>
setNames(c("Month", "Average daily number of occupied beds"))
```
### Share
We need to write a few brief paragraphs about what we did, why we did it, and what we found to sum up our work. An example follows.
> Toronto has a large unhoused population. Freezing winters mean it is critical there are enough places in shelters. We are interested to understand how usage of shelters changes in colder months, compared with warmer months.
>
> We use data provided by the City of Toronto about Toronto shelter bed occupancy. Specifically, at 4 a.m. each night a count is made of the occupied beds. We are interested in averaging this over the month. We cleaned, tidied, and analyzed the dataset using the statistical programming language R [@citeR] as well as the `tidyverse` [@Wickham2017], `janitor` [@janitor], `opendatatoronto` [@citeSharla], `lubridate` [@GrolemundWickham2011], and `knitr` [@citeknitr]. We then made a table of the average number of occupied beds each night for each month (@tbl-homelessoccupancy).
>
> We found that the daily average number of occupied beds was higher in December 2021 than July 2021, with 34 occupied beds in December, compared with 30 in July (@tbl-homelessoccupancy). More generally, there was a steady increase in the daily average number of occupied beds between July and December, with a slight overall increase each month.
>
> The dataset is on the basis of shelters, and so our results may be skewed by changes that are specific to especially large or small shelters. It may be that specific shelters are particularly attractive in colder months. Additionally, we were concerned with counts of the number of occupied beds, but if the supply of beds changes over the season, then an additional statistic of interest would be the proportion occupied.
Although this example is only a few paragraphs, it could be reduced to form an abstract, or increased to form a full report, for instance, by expanding each paragraph into a section. The first paragraph is a general overview, the second focuses on the data, the third on the results, and the fourth is a discussion. Following the example of @hao2019, that fourth paragraph is a good place to consider areas in which bias may have crept in.
## Neonatal mortality
Neonatal mortality\index{neonatal mortality} refers to a death that occurs within the first month of life. The neonatal mortality rate (NMR) is the number of neonatal deaths per 1,000 live births [@unigme]. The Third Sustainable Development Goal (SDG) calls for a reduction in NMR to 12. In this example we will create a graph of the estimated NMR for the past 50 years for: Argentina\index{Argentina!neonatal mortality}, Australia\index{Australia!neonatal mortality}, Canada\index{Canada!neonatal mortality}, and Kenya\index{Kenya!neonatal mortality}.
### Plan
For this example, we need to think about what our dataset should look like, and what the graph should look like.
The dataset needs to have variables that specify the country and the year. It also needs to have a variable with the NMR estimate for that year for that country. Roughly, it should look like @fig-nmrexample-data.
::: {#fig-nmrexample layout-ncol="2"}
![Quick sketch of a potentially useful NMR dataset](figures/02-nmr_dataset_sketch.png){#fig-nmrexample-data}
![Quick sketch of a graph of NMR by country over time](figures/02-NMRgraph.png){#fig-nmrexample-graph}
Sketches of a dataset and graph about the neonatal mortality rate (NMR)
:::
We are interested to make a graph with year on the x-axis and estimated NMR on the y-axis. Each country should have its own series. A quick sketch of what we are looking for is @fig-nmrexample-graph.
### Simulate
We would like to simulate some data that aligns with our plan. In this case we will need three columns: country, year, and NMR.\index{simulation}
Within Posit Cloud\index{Posit Cloud}, make a new Quarto document\index{Quarto} and save it. Add preamble documentation and set up the workspace. We will use the `tidyverse`, `janitor`, and `lubridate`.
```{r}
#| eval: false
#| echo: true
#### Preamble ####
# Purpose: Obtain and prepare data about neonatal mortality for
# four countries for the past fifty years and create a graph.
# Author: Rohan Alexander
# Email: [email protected]
# Date: 1 July 2022
# Prerequisites: -
#### Workspace setup ####
library(janitor)
library(lubridate)
library(tidyverse)
```
```{r}
#| echo: false
#| eval: true
#| warning: false
#| message: false
library(janitor)
library(lubridate)
library(tidyverse)
```
The code contained in packages can change from time to time as the authors update it and release new versions. We can see which version of a package we are using with `packageVersion()`. For instance, we are using version 2.0.0 of the `tidyverse` and version 2.2.0 of `janitor`.
```{r}
packageVersion("tidyverse")
packageVersion("janitor")
```
To update the version of all of the packages that we have installed, we use `update.packages()`.\index{packages!update} We can use `tidyverse_update()` to just install the `tidyverse` packages. This does not need to be run, say, every day, but from time to time it is worth updating packages. While many packages take care to ensure backward compatibility, at a certain point this does not become possible. Updating packages could result in old code needing to be rewritten. This is not a big deal when you are getting started and in any case there are tools aimed at loading particular versions that we cover in @sec-reproducible-workflows.
Returning to the simulation, we repeat the name of each country 50 times with `rep()`, and enable the passing of 50 years. Finally, we draw from the uniform distribution with `runif()` to simulate an estimated NMR value for that year for that country.\index{distribution!uniform}
```{r}
#| echo: true
#| eval: true
#| warning: false
#| message: false
#### Simulate data ####
set.seed(853)
simulated_nmr_data <-
tibble(
country =
c(rep("Argentina", 50), rep("Australia", 50),
rep("Canada", 50), rep("Kenya", 50)),
year =
rep(c(1971:2020), 4),
nmr =
runif(n = 200, min = 0, max = 100)
)
head(simulated_nmr_data)
```
While this simulation works, it would be time consuming and error prone if we decided that instead of 50 years, we were interested in simulating, say, 60 years. One way to improve this code is to replace all instances of 50 with a variable.\index{simulation}\index{distribution!uniform}
```{r}
#| echo: true
#| eval: true
#| warning: false
#| message: false
#### Simulate data ####
set.seed(853)
number_of_years <- 50
simulated_nmr_data <-
tibble(
country =
c(rep("Argentina", number_of_years), rep("Australia", number_of_years),
rep("Canada", number_of_years), rep("Kenya", number_of_years)),
year =
rep(c(1:number_of_years + 1970), 4),
nmr =
runif(n = number_of_years * 4, min = 0, max = 100)
)
head(simulated_nmr_data)
```
The result will be the same, but now if we want to change from 50 to 60 years, we only have to make the change in one place.
We can have confidence in this simulated dataset because it is relatively straight forward, and we wrote the code for it.\index{confidence!establishing} But when we turn to the real dataset, it is more difficult to be sure that it is what it claims to be. Even if we trust the data, we need to be able to share that confidence with others. One way forward is to establish some tests of whether our data are as they should be. For instance, we expect:\index{testing}
1. That "country" is exclusively one of these four: "Argentina", "Australia", "Canada", or "Kenya".
2. Conversely, "country" contains all those four countries.
3. That "year" is no smaller than 1971 and no larger than 2020, and is an integer, not a letter or a number with decimal places.
4. That "nmr" is a value somewhere between 0 and 1,000, and is a number.
We can write a series of tests based on these features, that we expect the dataset to pass.
```{r}
#| echo: true
#| eval: false
simulated_nmr_data$country |>
unique() == c("Argentina", "Australia", "Canada", "Kenya")
simulated_nmr_data$country |>
unique() |>
length() == 4
simulated_nmr_data$year |> min() == 1971
simulated_nmr_data$year |> max() == 2020
simulated_nmr_data$nmr |> min() >= 0
simulated_nmr_data$nmr |> max() <= 1000
simulated_nmr_data$nmr |> class() == "numeric"
```
Having passed these tests, we can have confidence in the simulated dataset.\index{confidence!establishing} More importantly, we can apply these tests to the real dataset. This enables us to have greater confidence in that dataset and to share that confidence with others.
### Acquire
The UN Inter-agency Group for Child Mortality Estimation (IGME)\index{UN Inter-agency Group for Child Mortality Estimation} [provides](https://childmortality.org/) NMR estimates that we can download and save.
```{r}
#| eval: false
#| echo: true
#| warning: false
#### Acquire data ####
raw_igme_data <-