-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new sx structure parser that tolerates unexpected items #1253
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok after having added all the comments I finally understood what it does, but wouldn't it be easier to parse the entire file once and use whatever is needed? Alternatively, I would suggest to do it based solely on regex
. I feel that it's very difficult to keep an eye on the consistency on line
, lineview
, lineno
etc.
self.line = filehandle.__iter__ () | ||
self.lineview = '' | ||
self.filename = filename | ||
self.lineno = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very uneasy with defining this kind of stuff outside of __init__
, basically because if this class is used without calling parse
, people would only see AttributeError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the whole while
could be replaced by
with open(file) as filehandle:
for i, line in enumerate(filehandle):
self.lineno = i
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the whole
while
could be replaced bywith open(file) as filehandle: for i, line in enumerate(filehandle): self.lineno = i ...
No, because sometimes I need multiple lines at the same time. More importantly, the reading of additional lines can happen elsewhere than in the main parser loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very uneasy with defining this kind of stuff outside of
__init__
, basically because if this class is used without callingparse
, people would only seeAttributeError
.
Anyone who implements a parser with such an engine and never calls "parse" didn't get the idea of the engine. Initializing in the parse function would allow to parse multiple times with the same object (if this makes sense...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed the read_until
parts. Well, I'd want at least the with statement, because currently I don't see the file ever being closed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filehandle was implicitly closed by python upon destructing the (local var) filehandle upon leaving the routine. I made that more explicit now, also removing the parse-time only properties.
This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files. It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this. A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).
The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line". |
I like the new architecture and agree that going line by line only once is important for both speed and memory consumption. But I do have some python nits, that I'll add later. |
class my_parser(keyword_tree_parser): | ||
def __init__(self,file) | ||
super ().__init__({ | ||
"key1" : self.parse_key1, | ||
"key2" : self.parse_key2 }) | ||
self.parse (file) | ||
|
||
""" | ||
def __init__ (self,keylevels=[]): | ||
if isinstance(keylevels,dict): | ||
keylevels = [ keylevels ] | ||
elif not isinstance(keylevels,list): | ||
raise TypeError | ||
self.keylevels = keylevels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expecting sub classes and base classes to have different signatures for their __init__
is imo problematic because it violates the substitution principle. If sub classes anyway should call parse
in their init, can the base class init not be something like
def __init__(self, file, keylevels=[]):
if isinstance(keylevels,dict):
keylevels = [ keylevels ]
elif not isinstance(keylevels,list):
raise TypeError
self.keylevels = keylevels
self.parse(file)
then subclasses can just to
super.__init__(file=file, keylevels=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about your suggestion, but I am not convinced. If calling the base class initializer means to do the parsing, the base class initializer must be called at the end of the derived class's init . This is contrary to what most people do elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, but then I would also keep it out of the sub classes __init__
. An alternative would be to have the actual parsers not be sub classes of the keyword tree parser at all, but to implement a has-a relationship instead.
class StructureParser:
def __init__(self, file):
... # setup
self._parser = KeywordTreeParser(keylevels=...)
self._parser.parse(file)
self._cleanup(keymap) | ||
res = func () | ||
if isinstance(res,GeneratorType): | ||
res.send (None) | ||
keymap['%finalize!'] = res | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of generators, but I wonder if this cannot be streamlined a bit.
If I understand correctly, _cleanup
is called here first, because we cannot know if the func
from a previous iteration didn't install some new keylevels
. Can we not require that by providing a method on the base parser that allows temporary modification of keylevels
but also does the clean up as a context manager. The parser functions/generators func
can then use this manager around their yield
statement.
from contextlib import contextmanager
class keyword_tree_parser:
...
@contextmanager
def install_keylevel(self, new_keylevels):
self.keylevels.append(new_keylevels)
yield
self.keylevels.pop(-1)
and sub classes would use this
class sub(keyword_tree_parser):
...
def my_parse_method(self):
... # some setup
with self.install_keylevels({'cell': self.parse_cell}):
yield
... # some processing
If we also require all keylevel functions to be generators, the hot part of the loop can then just be
res = func()
res.send(None)
The back and forth between the parser generator functions and the base class is not quite clear to me, so maybe I'm missing some steps.
self.positions = [] | ||
self.species = [] | ||
self.indices = [] | ||
self.ispecies=-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be attributes or can they be just local variables? They are only used to instantiate Atoms
one L52 and below, or not?
…disappear within parse() routine
I am wondering if we can combine this with the concepts we developed for the interactive LAMMPS parser. Here is a current example for the LAMMPS parser inside the Writing the LAMMPS parser is easier, as LAMMPS already provides Python bindings, still the general principle is:
This approach combines both, the flexibility to parse only for specific properties and the option to parse each line only once. |
One thing that makes it generally a bit hard to read is that the file status (current line, file object, etc.) is passed around implicitly via attributes. It would be more readable, if that was passed around explicitly via arguments to the parser generator functions, imo, but I can't judge how much rewriting that would be. |
@freyso Are you available to join the pyiron meeting today at 3pm? https://github.com/orgs/pyiron/discussions/207 Then we could start with this topic, that might be faster than the asynchronous discussion and separate developments at different locations. |
Annoying I don't know, but do we really have so large files in sphinx?
But the end of a section is currently recognised by a marker, right? We can do the same with regex.
This part is anyway quite horrible right now in pyiron, because pyiron currently loads the whole input and output anyway. SPHInX allows @pmrv's lazy loading, but it's not supported elsewhere and is so buggy, so I don't think we should rely on this.
I understand the idea, and if python was a human, I would definitely think the way you describe it is how it should work. The reality, however, is that regex is so blazing fast, that anything we do with for loop and string checking becomes totally obsolete. As a matter of fact, reading line by line was what we used to do for SPHInX, but the parsing was slow beyond imagination at that time. In addition to this, regex makes @jan-janssen's idea also easier to implement, because the output parsers become more modular, because parsing takes place only if the user asks for it, while in your case the parsing must take place again if there's something the user wants to have additionally. |
In part, this is an experiment on parsing.
Idea 1 is to parse while reading the file, instead of reading the entire file first.
Idea 2 is to use a more tabular way of programming, by creating a table of keywords with associated parsing routines.
Whenever the keyword is encountered, the corresponding parsing routine is called.
This is organized in a multi-level way, so certain keywords will simply add additional keywords and their associated parsing routines, which will be removed again from the when the next higher-level keyword appears. This allows to parse larger sections that have multiple subitems. The hope is to produce more readable (and thus maintainable) parsing routines.
I have implemented this for the rarely used sphinx.structure.read_atoms routines, which we need to parse
some old data I created outside of pyiron. The new version supports multiple structures in the same file and ignores unexpected stuff (such as labels) much more reliably than the old version.