-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Python representation of matcher #15
Comments
Hi, |
I was thinking by analogy with a prefix tree. You can create a data
structure that takes a bunch of strings and will match anything that starts
with any of those strings. But given a fixed set of strings, you can also
convert that datastructure to Python (or C++) code which would do the
matching as if you had written the Python code yourself (laboriously). I
don't know if something similar could be done in this case.
…On Thu, 2 Mar 2023 at 16:30, Frederik Petersen ***@***.***> wrote:
Hi,
not sure what you mean by generating a Python representation. Can you
provide some more detailed thoughts on that.
It's been a while since I did most of the core implementation. The
finalized keyword tree is already a pretty optimized python representation
of the matcher data structure. If you have any specific ideas on how to
further optimize that, let me know and maybe even fork the repo, then we
can compare performance. :) Thanks for your input
—
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAALCGOCV73SOCZ3TCAXSGLW2DDLTANCNFSM6AAAAAAVA46H6M>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Okay, got it. Mhm, I would be curious if there could be any input-specific generated Python code that would be much faster than the code that is actually running as of now, as it's already pretty optimized as stated above. I def don't see a big speed up potential through that. The generated data structure in its finalized form is already set up in a way that there isn't a lot of overhead compared to the baseline of the algorithm. Obviously generated C(++) code would be faster, but then why not use pyahocorasick? Overall I probably just don't grasp where the big speedup should come from when comparing the current implementation and datastructure vs. a static generated-code version of a specific result graph. The only place that might be a bit faster would maybe be the lookups of transitions, but then again it might be outweighed by the additional complexity of the code. When it comes to pypy: Yeah the generated code might improve performance there. I could see that. But I'm not an expert on how the JIT component of it works and if it would actually be worth it. Overall I'm also not a big fan of generating code. Either it can be error prone (when done poorly) or it takes quite some effort to get right (using ASTs, etc.). If someone would come up with a working version of that I would be very interested to see it. I don't have time to look into it myself though, at this moment. |
I'm curious whether it would be possible to generate a Python representation of the matcher data structure, and if so, whether that might be even faster at matching (e.g. fewer lookups as things would be hardcoded, more scope for Pypy to optimize?). Just tossing that out there as a potential idea, but in any case curious to know what you think.
The text was updated successfully, but these errors were encountered: