Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Python representation of matcher #15

Open
baoilleach opened this issue Feb 19, 2023 · 3 comments
Open

Question about Python representation of matcher #15

baoilleach opened this issue Feb 19, 2023 · 3 comments

Comments

@baoilleach
Copy link

I'm curious whether it would be possible to generate a Python representation of the matcher data structure, and if so, whether that might be even faster at matching (e.g. fewer lookups as things would be hardcoded, more scope for Pypy to optimize?). Just tossing that out there as a potential idea, but in any case curious to know what you think.

@FrederikP
Copy link
Collaborator

Hi,
not sure what you mean by generating a Python representation. Can you provide some more detailed thoughts on that.
It's been a while since I did most of the core implementation. The finalized keyword tree is already a pretty optimized python representation of the matcher data structure. If you have any specific ideas on how to further optimize that, let me know and maybe even fork the repo, then we can compare performance. :) Thanks for your input

@baoilleach
Copy link
Author

baoilleach commented Mar 2, 2023 via email

@FrederikP
Copy link
Collaborator

Okay, got it.

Mhm, I would be curious if there could be any input-specific generated Python code that would be much faster than the code that is actually running as of now, as it's already pretty optimized as stated above. I def don't see a big speed up potential through that. The generated data structure in its finalized form is already set up in a way that there isn't a lot of overhead compared to the baseline of the algorithm.

Obviously generated C(++) code would be faster, but then why not use pyahocorasick?

Overall I probably just don't grasp where the big speedup should come from when comparing the current implementation and datastructure vs. a static generated-code version of a specific result graph. The only place that might be a bit faster would maybe be the lookups of transitions, but then again it might be outweighed by the additional complexity of the code.

When it comes to pypy: Yeah the generated code might improve performance there. I could see that. But I'm not an expert on how the JIT component of it works and if it would actually be worth it. Overall I'm also not a big fan of generating code. Either it can be error prone (when done poorly) or it takes quite some effort to get right (using ASTs, etc.).

If someone would come up with a working version of that I would be very interested to see it. I don't have time to look into it myself though, at this moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants