Use incidence2 accessor functions or subset columns directly? #79

Bisaloo · 2023-04-11T18:51:00Z

incidence() does allow you to specify the resulting count variable column name as an argument. This would allow you to define relevant columns in variables at the start and then treat everything as if a normal data frame for consistency of access. This is less good when used programatically but potentially better for more interactive use??

We then have two ways to extract count data, groups or dates from an incidence2 object:

use the dedicated accessor functions
use standard data.frame subsetting. It is possible because we rename columns to a stable name when we pass the input dataset though incidence().

Benefits accessor functions

The extra level of abstraction likely makes it more robust to possible future breaking changes. For example, even if future version of incidence2 chose to not rename the columns but instead to use a purely tag-based system, as in linelist, accessor would likely deal with the breaking change under the hood and provide a stable interface.

On the other hand, we are supposed to already deal with breaking changes by pinning specific version of our dependencies, as discussed in #69.

Benefits direct subsetting

Users are more likely to be familiar with the syntax of standard data.frame subsetting
It works on all objects, including those that drop their incidence2 class
It doesn't tie us so strongly to incidence2

The text was updated successfully, but these errors were encountered:

TimTaylor · 2023-04-11T20:15:31Z

You've convinced me I need to write a "design" vignette (it's been planned for a while but ... ... time). I'll address accessors below and leave a separate comment on general use of {incidence2} below (so you can hide as a little off-topic for the issue).

On accessors (get_xxx()). These were mainly aimed at those wanting to provide methods for <incidence2> objects. I hadn't really thought of them being used in pipelines, but the {episoap} templates (and templates in general) are an interesting case and I can see why they could be useful. I think you have captured well the pros and cons above.

TimTaylor · 2023-04-11T21:30:30Z

Following from above, some thoughts on {incidence2} and where it is best used (and not used). I'll use a crude dichotomy of "interactive" to mean any sort of analysis pipeline and "programmatic" to mean in a package.

In interactive settings, the benefit of {incidence2} is most apparent for complex aggregations of linelist with multiple date indices, or pre-aggregated data with multiple count variables, e.g.

library(incidence2)
library(outbreaks)
library(dplyr)

# linelist example
ebola <- ebola_sim_clean$linelist
(grouped_inci <- incidence(
    ebola,
    date_index = c(
        onset = "date_of_onset",
        infection = "date_of_infection"
    ), 
    interval = "isoweek",
    groups = "gender"
))
#> # incidence:  218 x 4
#> # count vars: infection, onset
#> # groups:     gender
#>    date_index gender count_variable count
#>  * <isowk>    <fct>  <chr>          <int>
#>  1 2014-W12   f      infection          1
#>  2 2014-W15   f      onset              1
#>  3 2014-W15   m      infection          1
#>  4 2014-W16   f      infection          1
#>  5 2014-W16   m      onset              1
#>  6 2014-W17   f      infection          4
#>  7 2014-W17   f      onset              4
#>  8 2014-W17   m      onset              1
#>  9 2014-W18   f      infection          7
#> 10 2014-W18   f      onset              4
#> # ℹ 208 more rows

plot(grouped_inci, angle = 45, border_colour = "white")

# pre-aggregated example
covid <- covidregionaldataUK
(monthly_covid <- 
    covid |> 
    filter(!region %in% c("England", "Scotland", "Northern Ireland", "Wales")) |> 
    incidence(
        date_index = "date",
        groups = "region",
        counts = c("cases_new", "deaths_new"),
        interval = "yearmonth"
    ))
#> # incidence:  324 x 4
#> # count vars: cases_new, deaths_new
#> # groups:     region
#>    date_index region          count_variable count
#>  * <yrmon>    <chr>           <fct>          <dbl>
#>  1 2020-Jan   East Midlands   cases_new         NA
#>  2 2020-Jan   East Midlands   deaths_new        NA
#>  3 2020-Jan   East of England cases_new         NA
#>  4 2020-Jan   East of England deaths_new        NA
#>  5 2020-Jan   London          cases_new         NA
#>  6 2020-Jan   London          deaths_new        NA
#>  7 2020-Jan   North East      cases_new         NA
#>  8 2020-Jan   North East      deaths_new        NA
#>  9 2020-Jan   North West      cases_new         NA
#> 10 2020-Jan   North West      deaths_new        NA
#> # ℹ 314 more rows


# exlude deaths from plot due to scale
monthly_covid |> 
    subset(count_variable == "cases_new") |> 
    plot(nrow = 3, angle = 45, border_colour = "white")
#> Warning: Removed 26 rows containing missing values (`position_stack()`).

Where it may be preferable to use {grates} directly is for more simple aggregations of a single date_index and where you are not worried about the additional formatting of output and the default print methods:

# e.g. For some this may be sufficient
ebola |> 
    mutate(isoweek = as_isoweek(date_of_onset)) |> 
    count(isoweek, gender) |> 
    head(n = 10L)
#>     isoweek gender  n
#> 1  2014-W15      f  1
#> 2  2014-W16      m  1
#> 3  2014-W17      f  4
#> 4  2014-W17      m  1
#> 5  2014-W18      f  4
#> 6  2014-W19      f  9
#> 7  2014-W19      m  3
#> 8  2014-W20      f  7
#> 9  2014-W20      m 10
#> 10 2014-W21      f  8

# as opposed to
incidence(
    ebola,
    date_index = c(onset = "date_of_onset"),
    interval = "isoweek",
    groups = "gender"
)
#> # incidence:  109 x 4
#> # count vars: onset
#> # groups:     gender
#>    date_index gender count_variable count
#>  * <isowk>    <fct>  <chr>          <int>
#>  1 2014-W15   f      onset              1
#>  2 2014-W16   m      onset              1
#>  3 2014-W17   f      onset              4
#>  4 2014-W17   m      onset              1
#>  5 2014-W18   f      onset              4
#>  6 2014-W19   f      onset              9
#>  7 2014-W19   m      onset              3
#>  8 2014-W20   f      onset              7
#>  9 2014-W20   m      onset             10
#> 10 2014-W21   f      onset              8
#> # ℹ 99 more rows

For programatic use the benefits are more aparent and the knowledge of the objects invariants and structure do make it simple for developers to enable nice workflows such as

library(i2extras)

out <- 
    ebola |> 
    incidence(date_index = "date_of_onset", interval = "week", groups = "hospital") |> 
    slice_head(n = 120L) |> 
    fit_curve(model = "poisson", alpha = 0.05)

# plot with a prediction interval but not a confidence interval
plot(out, ci = FALSE, pi=TRUE, angle = 45, border_colour = "white")

# estimate growth rate
growth_rate(out)
#> # A tibble: 6 × 10
#>   count_variable hospital      model     r r_lower r_upper growth_or_decay  time
#>   <chr>          <fct>         <lis> <dbl>   <dbl>   <dbl> <chr>           <dbl>
#> 1 date_of_onset  Connaught Ho… <glm> 0.197   0.177   0.217 doubling         3.53
#> 2 date_of_onset  Military Hos… <glm> 0.173   0.147   0.200 doubling         4.00
#> 3 date_of_onset  other         <glm> 0.170   0.141   0.200 doubling         4.09
#> 4 date_of_onset  Princess Chr… <glm> 0.142   0.101   0.188 doubling         4.87
#> 5 date_of_onset  Rokupa Hospi… <glm> 0.178   0.133   0.228 doubling         3.89
#> 6 date_of_onset  <NA>          <glm> 0.184   0.164   0.205 doubling         3.77
#> # ℹ 2 more variables: time_lower <dbl>, time_upper <dbl>

^{Created on 2023-04-11 with reprex v2.0.2}

Bisaloo added the discussion label Apr 11, 2023

Bisaloo mentioned this issue Apr 11, 2023

Updates transmissibility report lockfiles and deal with incidence2 breaking changes #77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use incidence2 accessor functions or subset columns directly? #79

Use incidence2 accessor functions or subset columns directly? #79

Bisaloo commented Apr 11, 2023 •

edited

Loading

TimTaylor commented Apr 11, 2023 •

edited

Loading

TimTaylor commented Apr 11, 2023 •

edited

Loading

Use incidence2 accessor functions or subset columns directly? #79

Use incidence2 accessor functions or subset columns directly? #79

Comments

Bisaloo commented Apr 11, 2023 • edited Loading

Benefits accessor functions

Benefits direct subsetting

TimTaylor commented Apr 11, 2023 • edited Loading

TimTaylor commented Apr 11, 2023 • edited Loading

Bisaloo commented Apr 11, 2023 •

edited

Loading

TimTaylor commented Apr 11, 2023 •

edited

Loading

TimTaylor commented Apr 11, 2023 •

edited

Loading