-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unicode support to the saw-script lexer #2187
base: master
Are you sure you want to change the base?
Conversation
Note that there was some talk about banning non-8-bit characters from SAW strings until the question of how to deal with them in Cryptol strings gets resolved. I don't think there's any real need to do that, so I haven't, because (a) you can do very little with SAW strings anyway and (b) there doesn't seem to be any way to convert between SAW and Cryptol strings. (There's no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Worth an addition to the changelog, though.
Note that there was some talk about banning non-8-bit characters from SAW strings until the question of how to deal with them in Cryptol strings gets resolved. I don't think there's any real need to do that, so I haven't
I agree, no need to do so here.
@@ -0,0 +1,2 @@ | |||
π : [4] | |||
π = 3 // or so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. Completely forgot. Also, I appear to have accidentally corrected the position spanning behavior for Cryptol fragments and quoted strings, which is why a bunch of the column numbers in test_type_errors changed. (The position for the whole of such objects now includes the end characters as well as the start characters; previously it left off the end characters.) I've checked all the diffs and they're all correct. (not all the columns are still entirely correct, e.g. the start position for |
- Turn off the Alex wrapper. We're already doing our own version of half of it, and things we want to do cause type conflicts in the generated code we aren't using. - Provide our own AlexPos. AlexPos holds the current position (like the old lexer position type) but we don't need to carry the total character count in it. Stop exporting it; nothing uses it. - Provide our own AlexInput and alexGetByte. For the moment, panic on characters that don't fit in a byte; that's not quite the status quo but it's close enough. - Convert to saw-script positions (which are spans/ranges) when we generate tokens instead of transforming them on a second pass afterwards. - Panic on alexInputPrevChar; it is not called by the generated code. - Tidy various things a bit. This commit wasn't intended to have any functional change, except possibly for behavior on non-ASCII input that isn't currently supported. However, it seems that by changing the position logic I accidentally fixed the position tracking for strings and Cryptol fragments. This is a good thing, so run with it... Previously the end delimiter (and for Cryptol fragments, any whitespace before it), but not the start delimiter, was left out of the span of the position, so e.g. {{ 3 }} would show as 4 columns wide and the position of "abc" would end at the position of the c. This is no longer true; now it correctly includes the whole thing. This changes some of the reported column numbers in test_type_errors, so update the reference outputs.
No functional change intended.
Add a couple tests and a CHANGES entry. Closes #2042.
8fc3603
to
3505880
Compare
Hrm, it doesn't work in the CI (the test that fails is the one that makes sure that non-ascii identifiers work)... I expect this is some kind of locale settings issue. |
Ok, the conclusion from that is that it thinks the default input encoding is ascii and (thus unsurprisingly) forcing read as utf-8 makes the unicode test work. (The extra print I added made assorted other things fail, which isn't important.) (Edit for clarity: the GitHub UI has dropped the fact that I pushed a test commit at this point and reverted it later.) |
So... given #825, why is anything working this way? |
That is quite baffling, as (AFAIU) GitHub Actions' Ubuntu runners should default to UTF-8. I've confirmed that this is the case when I try printing the locale information in a minimal GitHub Actions workflow, as seen in this CI job:
UTF-8 as far as the eye can see. So why is SAW's integration test CI workflow using ASCII instead? Honestly, I'm really not quite sure... perhaps an earlier step in the job is changing the locale from UTF-8 to ASCII for some reason? (We could try printing out the values of the Alternatively, we might consider adopting a policy of always using UTF-8 here. We could do so by defining |
Would the |
I don't think so. GaloisInc/cryptol#1541 is about ensuring that string values in the Cryptol remote API are encoded as Latin-1 so that they can be represented as valid Cryptol strings (see GaloisInc/cryptol#863). That is an entirely separate issue from what is observed here, which is about the locale that is inherited from the CI environment (and therefore affects how SAW parses things). This part of SAW is entirely independent of Cryptol. |
We still don't know what's going on, but it seems that Cryptol sets the encoding explicitly to UTF-8 as the first thing in its Meanwhile re #825 it seems that the probable trigger there is printing the SAW banner logo, which the CI doesn't ever do (maybe it should... that's for some other day) so it doesn't hit the problem. So I'm going to revert my test-push (and force-push it out of existence), then push a patch to set the encoding, and with luck that'll pass the CI and we can move on. |
f479994
to
3505880
Compare
This is not necessarily the right thing to do, but for the moment consistency with Cryptol seems like the best approach. I've added this in every main I can find, and in a couple cases updated the minimum base version. (According to the docs, you need at least base 4.5.) This uses an internal GHC interface (apparently the only available way to do this) but it seems like the risk of it changing gratuitously is low, and the risk of it going away without an alternative appearing is negligible. The docs for the whole module suggest restricting to specific versions of base, but that seems overkill for this particular function and likely to cause more headaches than it would prevent.
Closes #2042.