Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils/astring: add expected control characters #134

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions aexpect/utils/astring.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def strip_console_codes(output, custom_codes=None):
output = f"\x1b[m{output}"
console_codes = "%[G@8]|\\[[@A-HJ-MPXa-hl-nqrsu\\`]"
console_codes += "|\\[[\\d;]+[HJKgqnrm]|#8|\\([B0UK]|\\)|\\[\\?2004[lh]"
console_codes += "|[c]|[!p]|[\\]104]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the problem with "c" (and actually with all the existing and added codes) is that we only assume they are matched from the beginning, but we don't force that. This means something like \x1bthis-is-a-broken-text-c would now be accepted and result in output his-is-a-broken-text-c.

Note this is not introduced by your change, it just makes it more visible (as matching a single "c" is quite probable).

The fix for this should be IMO to match from the beginning (especially since we erase characters from the beginning). So your fix should add |^c|^!p|^]104 to ensure only \x1bc/\x1b!p and \x1b]104 is matched.

As for the brackets, it's a regexp so [c] matches any character out of the brackets. At this point we don't expect any other standalone characters so I'd suggest just the c without the brackets.

Similarly the [!p] which actually matches only ! out of the !pfoo string as it matches one of the characters inside the brackets. So that one really has to be without the brackets.

The last one is actually generic and vendors are free to specify any ]\d+ strings so here I'd probably consider ^]\d+ rather than one of the variants, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems even worse then I though. Astring blindly ignores many issues, I have fixed some of them and added a selftest here: #135 but IMO it'd be even better to simply replace it with a regexp to substitute all console codes with "" instead of this custom handling.

Anyway I understand that this code is here and the behaviour was odd for a long time. So the question is whether we really want to do something about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case my comments were not clear, I'm ready to ack this patch once your part of the regexp works properly, therefore something like console_codes += "|^c|^!p|^]104 or console_codes += "|^c|^!p|^]\d+. @clebergnu you had some concerns, would this work for you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try this, thanks for the review, @ldoktor AFAICR @clebergnu 's concern was that we first confirm those are actually console codes. I did it in all cases IMO, just not sure what the OS operation 104 is supposed to mean. As you might have seen I have reached out to many people in the OS, kernel space and no answer so far. But if the test passes, it passes, and the control characters all start with \x1b so I think it's safe to assume we're not breaking 'normal strings'.

Copy link
Author

@smitterl smitterl Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case my comments were not clear, I'm ready to ack this patch once your part of the regexp works properly, therefore something like console_codes += "|^c|^!p|^]104 or console_codes += "|^c|^!p|^]\d+.

Either I'm misunderstanding something or your proposal doesn't work. With the indicated regex the algorithm raises at "!p".

The problematic line that I reported starts with "ESC c" but no line starts with "ESC !p". The control character doesn't imply that the lines start with these characters. We see the soft reset command isn't the first, so for the same right "c" doesn't need to be the first. Also, in this xterm reference there's no indicated it must be the first https://gist.github.com/justinmk/a5102f9a0c1810437885a04a07ef0a91 Maybe the algorithm works a bit different from what we read in the code? I read out the tmp_word. Its value changes as listed below.

However, my original regex works and it doesn't filter out 'c' if there's no ESC in front of it as we can see in terms such as "Connected" or "avocado" right in the first line.

I think we should merge this as-is and not accept #135 for the outlined reasons.

Please let me know if you disagree. Otherwise I hope we can merge this because there's no workaround besides this patch for my tests.

tmp_word: [mConnected to domain 'avocado-vt-vm1'
Escape character is ^] (Ctrl + ])
LOADPARM=[        ]
Using virtio-blk.
Using SCSI scheme.
s390-ccw Enumerated Boot Menu.

 [0] default
...
[    0.558508] Checked W+X mappings: passed, no W+X pages found
[    0.558514] Run /init as init process
[    0.563984] systemd[1]: Successfully made /usr/ read-only.

2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: c
2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: [!p
2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: ]104^G
2024-08-30 09:09:55,831 astring          L0055 DEBUG| tmp_word: [?7h[    0.564148] systemd[1]: systemd 256-13.el10 running in system mode (+PAM +AUDIT +SELINU
X -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIB
FDISK +PCRE2 +PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT +LIBARCHIVE)
[    0.564153] systemd[1]: Detected virtualization kvm.
[    0.564157] systemd[1]: Detected architecture s390x.
[    0.564159] systemd[1]: Running in initrd.

Welcome to 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clebergnu IIUC it should be OSC command: https://en.wikipedia.org/wiki/ANSI_escape_code#OSC which are not strictly defined (which is why I suggested the \d to catch them all).

As for this version I mentioned the brackets are unnecessary as you only define one or a few characters (it's regexp so the [] brackets means one of the characters.

And the c itself sounds a bit too vague without the ^ because it simply matches 1 character whenever c is present after the initial text. Take this as an example: aexpect.utils.astring.strip_console_codes("0\x1b[0m1\x1b2345\x1b67890123456789") fails but aexpect.utils.astring.strip_console_codes("0\x1b[0m1\x1b23c45\x1b67890123456789") succeeds and reports the c as part of the output, stripping the 2 simply because it cuts of the first character.

So yes, the #135 should rapidly change the situation but even before that one (and I think your commit should go first) shouldn't introduce more of the odd behaviours.

As for the !p, thanks for sharing the snippet (would be nice to get the full original output). From there it looks like your getting \x1b[!p and not just \x1b!p, which is why my regexp does not work. In that case it should be |^c|^\\[!p|^]104 or (IMO better) |^c|^\\[!p|^]\d+.

Copy link
Contributor

@clebergnu clebergnu Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand @ldoktor concern. @smitterl, would you be able to add #135 to your test with your reproducer?

Copy link
Author

@smitterl smitterl Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clebergnu with #135 as-is the error just reproduces. But I ran another test that uses the console and that one passes. I'm trying to figure out now what the right expression should be according to #135, |^c|^\\[!p|^]\d+ doesn't work. Python itself complains about the invalid escape sequence \d but |^c|^\\[!p|^]\\d+ doesn't work either.
I am still not sure I understand why the "^", I have not understood the algorithm yet. I hope to have a moment for this soon.

if custom_codes is not None and custom_codes not in console_codes:
console_codes += f"|{custom_codes}"
while index < len(output):
Expand Down