Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unicode support to the saw-script lexer #2187

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sauclovian-g
Copy link
Contributor

Closes #2042.

@sauclovian-g
Copy link
Contributor Author

Note that there was some talk about banning non-8-bit characters from SAW strings until the question of how to deal with them in Cryptol strings gets resolved. I don't think there's any real need to do that, so I haven't, because (a) you can do very little with SAW strings anyway and (b) there doesn't seem to be any way to convert between SAW and Cryptol strings. (There's no eval_string or string_to_term functions to match eval_int and int_to_term.)

Copy link
Contributor

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Worth an addition to the changelog, though.

Note that there was some talk about banning non-8-bit characters from SAW strings until the question of how to deal with them in Cryptol strings gets resolved. I don't think there's any real need to do that, so I haven't

I agree, no need to do so here.

@@ -0,0 +1,2 @@
π : [4]
π = 3 // or so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sauclovian-g
Copy link
Contributor Author

LGTM! Worth an addition to the changelog, though.

Definitely. Completely forgot.

Also, I appear to have accidentally corrected the position spanning behavior for Cryptol fragments and quoted strings, which is why a bunch of the column numbers in test_type_errors changed. (The position for the whole of such objects now includes the end characters as well as the start characters; previously it left off the end characters.) I've checked all the diffs and they're all correct.

(not all the columns are still entirely correct, e.g. the start position for (x : Int) <- "abc" is column 2, not column 1, but that's for some other time)

 - Turn off the Alex wrapper. We're already doing our own version of
   half of it, and things we want to do cause type conflicts in the
   generated code we aren't using.

 - Provide our own AlexPos. AlexPos holds the current position (like
   the old lexer position type) but we don't need to carry the total
   character count in it. Stop exporting it; nothing uses it.

 - Provide our own AlexInput and alexGetByte. For the moment, panic on
   characters that don't fit in a byte; that's not quite the status
   quo but it's close enough.

 - Convert to saw-script positions (which are spans/ranges) when we
   generate tokens instead of transforming them on a second pass
   afterwards.

 - Panic on alexInputPrevChar; it is not called by the generated code.

 - Tidy various things a bit.

This commit wasn't intended to have any functional change, except
possibly for behavior on non-ASCII input that isn't currently
supported.

However, it seems that by changing the position logic I accidentally
fixed the position tracking for strings and Cryptol fragments. This
is a good thing, so run with it...

Previously the end delimiter (and for Cryptol fragments, any
whitespace before it), but not the start delimiter, was left out of
the span of the position, so e.g. {{ 3 }} would show as 4 columns wide
and the position of "abc" would end at the position of the c.

This is no longer true; now it correctly includes the whole thing.

This changes some of the reported column numbers in test_type_errors,
so update the reference outputs.
No functional change intended.
Add a couple tests and a CHANGES entry.

Closes #2042.
@sauclovian-g
Copy link
Contributor Author

Hrm, it doesn't work in the CI (the test that fails is the one that makes sure that non-ascii identifiers work)... I expect this is some kind of locale settings issue.

@sauclovian-g sauclovian-g changed the title Add unicode to support to the saw-script lexer Add unicode support to the saw-script lexer Jan 17, 2025
@sauclovian-g
Copy link
Contributor Author

sauclovian-g commented Jan 17, 2025

Ok, the conclusion from that is that it thinks the default input encoding is ascii and (thus unsurprisingly) forcing read as utf-8 makes the unicode test work. (The extra print I added made assorted other things fail, which isn't important.)

(Edit for clarity: the GitHub UI has dropped the fact that I pushed a test commit at this point and reverted it later.)

@sauclovian-g
Copy link
Contributor Author

So... given #825, why is anything working this way?

@RyanGlScott
Copy link
Contributor

That is quite baffling, as (AFAIU) GitHub Actions' Ubuntu runners should default to UTF-8. I've confirmed that this is the case when I try printing the locale information in a minimal GitHub Actions workflow, as seen in this CI job:

Locale-derived encoding is UTF-8
Just UTF-8
Just UTF-8
Just UTF-8
Just UTF-8
import "Test.cry";

// we should be able to refer to non-ascii identifiers in imported Cryptol
let pi = {{ π }};

// and we should be able to have our own non-ascii identifiers too
let lá = true;

UTF-8 as far as the eye can see. So why is SAW's integration test CI workflow using ASCII instead? Honestly, I'm really not quite sure... perhaps an earlier step in the job is changing the locale from UTF-8 to ASCII for some reason? (We could try printing out the values of the LANG/LC_ALL environment variables at various steps in the integration_tests job to see if it changes at some point.)

Alternatively, we might consider adopting a policy of always using UTF-8 here. We could do so by defining LANG/LC_ALL to be UTF-8 in the workflow itself (although I worry something might be overriding this somewhere). We could also set the locale from within SAW itself, as suggested in #825 (comment). I don't have a strong opinion on this, so I'm open to your thoughts.

@weaversa
Copy link
Contributor

weaversa commented Jan 17, 2025

That is quite baffling, as (AFAIU) GitHub Actions' Ubuntu runners should default to UTF-8. I've confirmed that this is the case when I try printing the locale information in a minimal GitHub Actions workflow, as seen in this CI job:

Locale-derived encoding is UTF-8
Just UTF-8
Just UTF-8
Just UTF-8
Just UTF-8
import "Test.cry";

// we should be able to refer to non-ascii identifiers in imported Cryptol
let pi = {{ π }};

// and we should be able to have our own non-ascii identifiers too
let lá = true;

UTF-8 as far as the eye can see. So why is SAW's integration test CI workflow using ASCII instead? Honestly, I'm really not quite sure... perhaps an earlier step in the job is changing the locale from UTF-8 to ASCII for some reason? (We could try printing out the values of the LANG/LC_ALL environment variables at various steps in the integration_tests job to see if it changes at some point.)

Alternatively, we might consider adopting a policy of always using UTF-8 here. We could do so by defining LANG/LC_ALL to be UTF-8 in the workflow itself (although I worry something might be overriding this somewhere). We could also set the locale from within SAW itself, as suggested in #825 (comment). I don't have a strong opinion on this, so I'm open to your thoughts.

Would the latin-1 work here explain the issue? -- https://github.com/GaloisInc/cryptol/pull/1541/files

@RyanGlScott
Copy link
Contributor

I don't think so. GaloisInc/cryptol#1541 is about ensuring that string values in the Cryptol remote API are encoded as Latin-1 so that they can be represented as valid Cryptol strings (see GaloisInc/cryptol#863). That is an entirely separate issue from what is observed here, which is about the locale that is inherited from the CI environment (and therefore affects how SAW parses things). This part of SAW is entirely independent of Cryptol.

@sauclovian-g
Copy link
Contributor Author

We still don't know what's going on, but it seems that Cryptol sets the encoding explicitly to UTF-8 as the first thing in its main: https://github.com/GaloisInc/cryptol/blob/9e50047c1c13a1d09bb6d294820e4eb82d902a9b/cryptol/Main.hs#L235 and the conclusion so far is that we should probably do the same in SAW for behavioral consistency. That should also make the CI behave itself.

Meanwhile re #825 it seems that the probable trigger there is printing the SAW banner logo, which the CI doesn't ever do (maybe it should... that's for some other day) so it doesn't hit the problem.

So I'm going to revert my test-push (and force-push it out of existence), then push a patch to set the encoding, and with luck that'll pass the CI and we can move on.

This is not necessarily the right thing to do, but for the moment
consistency with Cryptol seems like the best approach.

I've added this in every main I can find, and in a couple cases
updated the minimum base version. (According to the docs, you need at
least base 4.5.)

This uses an internal GHC interface (apparently the only available way
to do this) but it seems like the risk of it changing gratuitously is
low, and the risk of it going away without an alternative appearing is
negligible.

The docs for the whole module suggest restricting to specific versions
of base, but that seems overkill for this particular function and
likely to cause more headaches than it would prevent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse error on embedded Cryptol with π (or other extended characters)
3 participants