Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable glue syntax in label to access current column/segment (and possibly others) #495

Merged
merged 23 commits into from
Oct 28, 2023

Conversation

yjunechoe
Copy link
Collaborator

@yjunechoe yjunechoe commented Oct 25, 2023

Summary

Many validation functions internally iterate over a user-specified set of columns and segments, where each combination is materialized as individual steps.

pointblank/R/col_vals_lt.R

Lines 377 to 378 in dc1b917

for (i in seq_along(columns)) {
for (j in seq_along(segments_list)) {

Currently, these steps simply inherit the same value for label, but given that the columns/segments can be selected dynamically (especially with #493), it'd be nice if users can access the context for the current step.

The PR implements this by exposing {.col} (using dplyr::across() terminology) and {.segment} from the internal create_validation_step() function:

agent1 <- create_agent(small_table) %>% 
  col_vals_lt(
    columns = matches("^[ac]$"),
    value = 8,
    segments = vars(f),
    label = "column: {.col}, segment: {.segment}"
  )
cat(agent1$validation_set$label, sep = "\n")
#> column: a, segment: high
#> column: a, segment: low
#> column: a, segment: mid
#> column: c, segment: high
#> column: c, segment: low
#> column: c, segment: mid

This is next point is minor, but I personally think that exposing more information from the internals is better (as long as it's a bare string, which is risk-free). So just for fun this PR also exposes {.step} as the name of the validation step (I don't actually have a strong feeling about this one, so {.step} can be removed for simplicity).

This allows the following reprex for the suggestion raised in #451:

agent2 <- small_table |> 
  create_agent() |> 
  col_vals_lt(
    c, 8,
    segments = vars(f),
    label = "The `{.step}()` step for group '{.segment}'"
  ) |> 
  interrogate()
cat(agent2$validation_set$label, sep = "\n")
#> The `col_vals_lt()` step for group 'high'
#> The `col_vals_lt()` step for group 'low'
#> The `col_vals_lt()` step for group 'mid'

The difficult part of this implementation was getting the scoping right: the {} context should only search for vars/fns in the caller environment of the validation function (ex: the environment where col_vals_lt() was called by the user) and above. Otherwise, glue will "leak" internal variables. In this PR, I use the heuristic of caller_env(n=2L), given the following invariant in how the internal create_validation_step() function gets called:

<user environment>
└── col_vals_lt()
    └── create_validation_step()

So caller_env(n=2L) climbs twice from create_validation_step() where it's called, ensuring that users can't access any arbitrary local variables via the glue context. For extensibility, create_validation_step() gains a .call = caller_env(n=2L) argument in case we need to break the above assumption about where in the stack trace create_validation_step() is called, relative to the "user environment".

Misc. considerations

The current step's column and segment are the most important information because those may be dynamic, while others are mostly known and hardcoded. But if we were to expose some other information from the "inside", some other candidates are:

  • step ID i
  • value (though value doesn't always resolve to a string, which is tricky)
  • seg_col

Happy to adopt any suggestions to the set of glue variables we expose in label and their names (ex: {.col} could be {.column}).

Lastly, while this complements the enhancement in #493, the two PRs touch different files/behaviors, so they can be merged in either order.

Related GitHub Issues and PRs

Checklist

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 25, 2023

Oops I spoke too soon - there's one sizeable challenge for the yaml round trip - the original glue expression is lost after the create_validation_step(), so yaml*() functions reading off the contents of agent$validation_set do not know about glue.

This gives us a behavior where only the first materialized label is picked to represent the label for the entire step:

pointblank/R/yaml_write.R

Lines 1037 to 1038 in dc1b917

dplyr::group_by(i_o) %>%
dplyr::filter(dplyr::row_number() == 1) %>%

agent2 <- create_agent(~ small_table) |> 
  col_vals_lt(
    c, 8,
    segments = vars(f),
    label = "The `{.step}()` step for group '{.segment}'"
  ) |> 
  interrogate()

yaml_agent_string(agent2, expanded = FALSE)
#> type: agent
#> tbl: ~small_table
#> tbl_name: ~small_table
#> label: '[2023-10-25|08:33:26]'
#> lang: en
#> locale: en
#> steps:
#> - col_vals_lt:
#>     columns: vars(c)
#>     value: 8.0
#>     segments: list(vars(f))
#>     label: The `col_vals_lt()` step for group 'high'

There's not a great way around this, since we'd prefer the label to be materialized (especially given that currently this PR allows more complicated stuff like {my_fun(.col)}). One thing that could help is if validation functions also let label accept a vector of length-segments * columns. This would make validation_set$label a list-column of character vectors, but switching over to that shouldn't be too challenging(?).

Essentially, it'd be nice for label in the above example to be represented in the yaml as something like:

#>     label:
#>     - The `col_vals_lt()` step for group 'high'
#>     - The `col_vals_lt()` step for group 'low'
#>     - The `col_vals_lt()` step for group 'mid'

And only collapse if label is identical across the expanded steps

@yjunechoe yjunechoe marked this pull request as draft October 25, 2023 20:13
@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 26, 2023

I toyed around with this idea a bit further. In addition to supporting glue syntax in label, this PR now also supports multi-length character vector for label (mostly to ensure the yaml round-trip for materialized glue labels, but this now also allows completely custom labels for the expanded steps, as long as the length matches)

agent_pre <- create_agent(~ small_table) |> 
  col_vals_lt(
    c, 8,
    segments = vars(f),
    label = "The `col_vals_lt()` step for group '{.segment}'"
  )

# Yaml representation (multi-length label)
yaml_agent_string(agent_pre, expanded = FALSE)
#> type: agent
#> tbl: ~small_table
#> tbl_name: ~small_table
#> label: '[2023-10-26|10:39:24]'
#> lang: en
#> locale: en
#> steps:
#> - col_vals_lt:
#>     columns: vars(c)
#>     value: 8.0
#>     segments: list(vars(f))
#>     label:
#>     - The `col_vals_lt()` step for group 'high'
#>     - The `col_vals_lt()` step for group 'low'
#>     - The `col_vals_lt()` step for group 'mid'

agent_yaml <- tempfile()
yaml_write(agent_pre, expanded = FALSE, filename = agent_yaml)

# Multi-length label makes the round-trip
agent_post <- yaml_read_agent(agent_yaml)
yaml_agent_string(agent_post, expanded = FALSE)
#> type: agent
#> tbl: ~small_table
#> tbl_name: ~small_table
#> label: '[2023-10-26|10:39:24]'
#> lang: en
#> locale: en
#> steps:
#> - col_vals_lt:
#>     columns: vars(c)
#>     value: 8.0
#>     segments: list(vars(f))
#>     label:
#>     - The `col_vals_lt()` step for group 'high'
#>     - The `col_vals_lt()` step for group 'low'
#>     - The `col_vals_lt()` step for group 'mid'

identical(
  as_agent_yaml_list(agent_pre, expanded = FALSE),
  as_agent_yaml_list(agent_post, expanded = FALSE)
)
#> [1] TRUE

The old yaml-writing behavior is preserved by collapsing label when it doesn't vary:

agent_one_label <- create_agent(~ small_table) |> 
  col_vals_lt(
    c, 8,
    segments = vars(f),
    label = "I'm the same label for all segments!"
  )
yaml_agent_string(agent_one_label)
#> type: agent
#> tbl: ~small_table
#> tbl_name: ~small_table
#> label: '[2023-10-26|10:48:33]'
#> lang: en
#> locale: en
#> steps:
#> - col_vals_lt:
#>     columns: vars(c)
#>     value: 8.0
#>     segments: list(vars(f))
#>     label: I'm the same label for all segments!

Implementational details

I introduce a new resolve_label() function that intercepts label right before it's sent down to create_validation_step(). For example inside col_vals_lt():

  label <- resolve_label(label, columns, segments_list)
  for (i in seq_along(columns)) {
    for (j in seq_along(segments_list)) {

All resolve_label() does is recycle label to match i * j for it to be iterated over in the for loop(s). In this case, it returns a matrix of labels for convenience of subsetting with i and j. This allows us to iterate over columns (rows) and segments (columns) of the label matrix, like this:

        create_validation_step(
          <...>
          label = label[[i,j]],
          <...>
        )

This handles label accepting a n>1 character vector.

Separately, writing out n>1 character vector to yaml is handled inside yaml_agent_string(). I take a minimally intrusive approach where if !expanded, the $label column of the validation set temporarily becomes a list-column of character vectors. The individual elements of $label are then unlisted to character vector/scalar at the very end, right before it's written out to yaml, which gives us yaml representations like:

label:
  - first
  - second
  - third

as opposed to:

label:
  -  - first
     - second
     - third

TODO

Proof of concept is complete and the PR passes existing tests, but the new features (dynamic glue syntax & accepting multi-length label input) could use more tests.

Let me know if you think this is a direction worth pursuing!

@yjunechoe yjunechoe marked this pull request as ready for review October 26, 2023 15:21
@rich-iannone
Copy link
Member

This is great! And we could definitely merge this before the other big PR you worked on. So long as the YAML round trips, this is good stuff!

@yjunechoe
Copy link
Collaborator Author

Sounds good - I'll add a couple more tests here and ping you again for a final review!

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 27, 2023

I've covered some expected behaviors for the new features in test-label_glue.R. As far as I can tell, glue syntax and multi-length label work as expected!

Currently, the following string variables from inside create_validation_step() are exposed to users via glue:

      data = list(
        .step = assertion_type,
        .col = column,
        .seg_col = seg_col,
        .seg_val = seg_val
      ),

... where .col, .seg_col, and .seg_val could all be dynamic, so those would be the most useful bits of information to have access to "from the outside". I was hoping to also expose value here, but that requires some more thinking and can be reconsidered at a later point if users say they want it.

  • (ex: if value is left+right, should those be exposed separately or as length-2 vector? how do we disambiguate between string constant vs. column name?, etc.)

Let me know how this looks! I'll then edit NEWS accordingly

@rich-iannone
Copy link
Member

Looks really good! We can follow up later with some documentation (in each of the validation functions) that describes the new glue-powered naming (and which variables you can use). This is a super-great addition!

@rich-iannone
Copy link
Member

I'll re-run the tests once CRAN is back up (that's causing the failure with the setup-r action).

@rich-iannone
Copy link
Member

Everything looks good wrt tests. Would you add that NEWS entry? Then we’re good to merge!

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 28, 2023

Done! While I'm at it, would you like me to add that blip to the function docs as well? I think there's just one place where label for validation functions is documented and inherited from, so it should be simple:

pointblank/R/col_vals_gt.R

Lines 134 to 140 in dc1b917

#' @param label *An optional label for the validation step*
#'
#' `scalar<character>` // *default:* `NULL` (`optional`)
#'
#' An optional label for the validation step. This label appears in the
#' *agent* report and, for the best appearance, it should be kept quite short.
#'

I can add another short paragraph below this explaining the new glue syntax. And we could also subtly document multi-length label support here by changing the argument signature from scalar<character> to vector<character>.

(Oops! I'm just seeing your earlier comment that we could do this later - I'm happy to merge this as-is and work on documentation separately!)

@rich-iannone
Copy link
Member

I was thinking about deferring only because it would be a whole lot more work for you (and you already did quite a bit). But, the changes you proposed (both great!) would be great to have in here. It's a lot of copy/paste + devtools::document() but it would round out the whole thing. Then finally we can merge this, promise!

@yjunechoe
Copy link
Collaborator Author

Ah! Ok I'm just now realizing the complexity of the documentation setup beyond just inheriting params 😅. Yes, let's merge this now first! I'll then merge this into the tidyselect PR.

And maybe we can hold off documentation updates after both are merged?

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@rich-iannone rich-iannone merged commit 03917d7 into rstudio:main Oct 28, 2023
12 of 13 checks passed
@rich-iannone
Copy link
Member

@yjunechoe its now in! I will get to the other PR in short order.

@yjunechoe yjunechoe deleted the label-glue branch October 28, 2023 17:35
yjunechoe added a commit that referenced this pull request Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants