Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking Changes] attribute-based node protection #107

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

zkamvar
Copy link
Member

@zkamvar zkamvar commented Apr 29, 2024

As discussed in #105 (comment), since I have some time on my hands, I wanted to give this a go.

This shifts that paradigm from splitting nodes that need protection to one where we would use attributes to tell us where the protection needs to be applied.

Not Ready to Merge

I'm not quite ready to merge this yet because this has ripple effects for both {babeldown} as it explicitly relies on asis and curly nodes to be separated out to avoid them entering the translation fields

  ## protect content inside curly braces and math ----
  woolish$body <- tinkr::protect_math(woolish$body)
  woolish$body <- tinkr::protect_curly(woolish$body)
  curlies <- xml2::xml_find_all(woolish$body, "//*[@curly]")
  purrr::walk(curlies, protect_curly)
  maths <- xml2::xml_find_all(woolish$body, "//*[@asis='true']")
  purrr::walk(maths, protect_math)

and {pegboard} as the link transformation routines (from Jekyll -> pandoc) explicitly assume that the asis nodes exist

as shown in the documentation fix_links.R#L38-L48:

#' However, if a link uses liquid templating for a variable such as: 
#' `[Home]({{ page.root }}/index.html) and other text`, it will appear in XML as
#'
#' ```xml
#' ...
#' <text asis="true">[</text>
#' <text>Home</text>
#' <text asis="true">]</text>
#' <text>({{ page.root }}/index.html) and other text</text>
#' ...
#' ```

zkamvar added 30 commits April 18, 2024 12:38
I've modified the escape-text function escape text based on wether or
not it exists in an escapable range.

This commit implements a proof of concept that protects the first
escapable character and will not pass check.
In this version, we no longer need to split nodes in order to protect
them if we also want them to be continuous. I've taken the XSL template
"escape-text" and modified it so that it takes in three new parameters:

1. `pos`..........the position of the current character
2. `protect.pos`..a space-separated list of starting positions for
   protection
3. `protect.end`..a space-separated list of ending positions for protection

I've also added three new helper templates to handle list contents:

`peek` returns the top of the list, `trim` trims off the first element
of the list (or returns the value if it's not a list), and `adjust-range`
trims a list depending on if the current value is within range.

There's a lot of printing here because I wasn't too confident with
debugging, but based on my test in inst/extdata/xml_protect.xml, it
produces results correctly.
I had initially found a tokenize template and had contacted the author
about license information (she gave permission):
<https://exslt.github.io/str/functions/tokenize/str.tokenize.template.xsl.html>

When I was working with it, I found that the function exists as part of
libxml because it bundles EXSLT functions, which allows me to do this
easier and more efficient by tracking and modifying a single index
instead of a pair of strings.
The square bracket _should_ be escaped since it's outside of the
protected range.
This begins to address limitations of the attribute-based protection
by providing a way to separate and rejoin nodes that were previously
split.
zkamvar added 9 commits May 2, 2024 09:31
The previous iteration was not quite correct because it had assumed that
the sourcepos would match up exactly with the protection ranges, but
these were two separate numbers.

This does the following:

1. when a protected range spans the entire node, then it is labeled "asis"
2. `split_sourcepos()` now reflects the actual end of the sourcepos
   instead of the computed end
3. an awkward catch for single nodes in `join_split_nodes()` is now
   eliminated
4. `join_split_nodes()` no longer re-comuputes the protected ranges from
   the sourcepos
This allows us to search for internal nodes using their identities
@zkamvar zkamvar mentioned this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant