Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new sx structure parser that tolerates unexpected items #1253

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

freyso
Copy link
Contributor

@freyso freyso commented Dec 8, 2023

In part, this is an experiment on parsing.

Idea 1 is to parse while reading the file, instead of reading the entire file first.
Idea 2 is to use a more tabular way of programming, by creating a table of keywords with associated parsing routines.
Whenever the keyword is encountered, the corresponding parsing routine is called.
This is organized in a multi-level way, so certain keywords will simply add additional keywords and their associated parsing routines, which will be removed again from the when the next higher-level keyword appears. This allows to parse larger sections that have multiple subitems. The hope is to produce more readable (and thus maintainable) parsing routines.

I have implemented this for the rarely used sphinx.structure.read_atoms routines, which we need to parse
some old data I created outside of pyiron. The new version supports multiple structures in the same file and ignores unexpected stuff (such as labels) much more reliably than the old version.

Copy link
Member

@samwaseda samwaseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok after having added all the comments I finally understood what it does, but wouldn't it be easier to parse the entire file once and use whatever is needed? Alternatively, I would suggest to do it based solely on regex. I feel that it's very difficult to keep an eye on the consistency on line, lineview, lineno etc.

pyiron_atomistics/sphinx/parser_base.py Outdated Show resolved Hide resolved
pyiron_atomistics/sphinx/parser_base.py Outdated Show resolved Hide resolved
pyiron_atomistics/sphinx/structure.py Outdated Show resolved Hide resolved
pyiron_atomistics/sphinx/parser_base.py Outdated Show resolved Hide resolved
Comment on lines 63 to 66
self.line = filehandle.__iter__ ()
self.lineview = ''
self.filename = filename
self.lineno = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very uneasy with defining this kind of stuff outside of __init__, basically because if this class is used without calling parse, people would only see AttributeError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole while could be replaced by

with open(file) as filehandle:
  for i, line in enumerate(filehandle):
    self.lineno = i
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole while could be replaced by

with open(file) as filehandle:
  for i, line in enumerate(filehandle):
    self.lineno = i
    ...

No, because sometimes I need multiple lines at the same time. More importantly, the reading of additional lines can happen elsewhere than in the main parser loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very uneasy with defining this kind of stuff outside of __init__, basically because if this class is used without calling parse, people would only see AttributeError.

Anyone who implements a parser with such an engine and never calls "parse" didn't get the idea of the engine. Initializing in the parse function would allow to parse multiple times with the same object (if this makes sense...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the read_until parts. Well, I'd want at least the with statement, because currently I don't see the file ever being closed at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filehandle was implicitly closed by python upon destructing the (local var) filehandle upon leaving the routine. I made that more explicit now, also removing the parse-time only properties.

@freyso
Copy link
Contributor Author

freyso commented Dec 11, 2023

ok after having added all the comments I finally understood what it does, but wouldn't it be easier to parse the entire file once and use whatever is needed?

This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files. It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this.

A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).

Alternatively, I would suggest to do it based solely on regex. I feel that it's very difficult to keep an eye on the consistency on line, lineview, lineno etc.

The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line".

@pmrv
Copy link
Contributor

pmrv commented Dec 11, 2023

I like the new architecture and agree that going line by line only once is important for both speed and memory consumption. But I do have some python nits, that I'll add later.

Comment on lines 36 to 49
class my_parser(keyword_tree_parser):
def __init__(self,file)
super ().__init__({
"key1" : self.parse_key1,
"key2" : self.parse_key2 })
self.parse (file)

"""
def __init__ (self,keylevels=[]):
if isinstance(keylevels,dict):
keylevels = [ keylevels ]
elif not isinstance(keylevels,list):
raise TypeError
self.keylevels = keylevels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expecting sub classes and base classes to have different signatures for their __init__ is imo problematic because it violates the substitution principle. If sub classes anyway should call parse in their init, can the base class init not be something like

def __init__(self, file, keylevels=[]):
  if isinstance(keylevels,dict):
    keylevels = [ keylevels ]
  elif not isinstance(keylevels,list):
    raise TypeError
  self.keylevels = keylevels
  self.parse(file)

then subclasses can just to

super.__init__(file=file, keylevels=...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about your suggestion, but I am not convinced. If calling the base class initializer means to do the parsing, the base class initializer must be called at the end of the derived class's init . This is contrary to what most people do elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but then I would also keep it out of the sub classes __init__. An alternative would be to have the actual parsers not be sub classes of the keyword tree parser at all, but to implement a has-a relationship instead.

class StructureParser:
  def __init__(self, file):
    ... # setup
    self._parser = KeywordTreeParser(keylevels=...)
    self._parser.parse(file)

Comment on lines 71 to 76
self._cleanup(keymap)
res = func ()
if isinstance(res,GeneratorType):
res.send (None)
keymap['%finalize!'] = res
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of generators, but I wonder if this cannot be streamlined a bit.

If I understand correctly, _cleanup is called here first, because we cannot know if the func from a previous iteration didn't install some new keylevels. Can we not require that by providing a method on the base parser that allows temporary modification of keylevels but also does the clean up as a context manager. The parser functions/generators func can then use this manager around their yield statement.

from contextlib import contextmanager
class keyword_tree_parser:
  ...
  @contextmanager
  def install_keylevel(self, new_keylevels):
    self.keylevels.append(new_keylevels)
    yield
    self.keylevels.pop(-1)

and sub classes would use this

class sub(keyword_tree_parser):
  ...
  def my_parse_method(self):
    ... # some setup
    with self.install_keylevels({'cell': self.parse_cell}):
      yield
    ... # some processing

If we also require all keylevel functions to be generators, the hot part of the loop can then just be

res = func()
res.send(None)

The back and forth between the parser generator functions and the base class is not quite clear to me, so maybe I'm missing some steps.

Comment on lines 44 to 47
self.positions = []
self.species = []
self.indices = []
self.ispecies=-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be attributes or can they be just local variables? They are only used to instantiate Atoms one L52 and below, or not?

@jan-janssen
Copy link
Member

I am wondering if we can combine this with the concepts we developed for the interactive LAMMPS parser. Here is a current example for the LAMMPS parser inside the atomistics package. https://github.com/pyiron/atomistics/blob/main/atomistics/calculators/lammps/calculator.py#L111

Writing the LAMMPS parser is easier, as LAMMPS already provides Python bindings, still the general principle is:

def parse_lammps(
    file,
    quantities=("energy", "forces", "stress", ...),
    **kwargs,
):
    interactive_getter_dict = {
        "forces": function_to_parse_force_from_line,
        "energy": function_to_parse_energy_from_line,
        "stress": function_to_parse_stress_from_line,
        ...
    }
    result_dict = {q: [] for q in quantities}
    for line in file:
          for q in quantities:
                interactive_getter_dict[q](previous_result=result_dict[q], line=line)
    return result_dict

This approach combines both, the flexibility to parse only for specific properties and the option to parse each line only once.

@pmrv
Copy link
Contributor

pmrv commented Dec 11, 2023

One thing that makes it generally a bit hard to read is that the file status (current line, file object, etc.) is passed around implicitly via attributes. It would be more readable, if that was passed around explicitly via arguments to the parser generator functions, imo, but I can't judge how much rewriting that would be.

@jan-janssen
Copy link
Member

@freyso Are you available to join the pyiron meeting today at 3pm? https://github.com/orgs/pyiron/discussions/207 Then we could start with this topic, that might be faster than the asynchronous discussion and separate developments at different locations.

@samwaseda
Copy link
Member

This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files.

Annoying I don't know, but do we really have so large files in sphinx?

It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this.

But the end of a section is currently recognised by a marker, right? We can do the same with regex.

A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).

This part is anyway quite horrible right now in pyiron, because pyiron currently loads the whole input and output anyway. SPHInX allows @pmrv's lazy loading, but it's not supported elsewhere and is so buggy, so I don't think we should rely on this.

The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line".

I understand the idea, and if python was a human, I would definitely think the way you describe it is how it should work. The reality, however, is that regex is so blazing fast, that anything we do with for loop and string checking becomes totally obsolete. As a matter of fact, reading line by line was what we used to do for SPHInX, but the parsing was slow beyond imagination at that time.

In addition to this, regex makes @jan-janssen's idea also easier to implement, because the output parsers become more modular, because parsing takes place only if the user asks for it, while in your case the parsing must take place again if there's something the user wants to have additionally.

@jan-janssen jan-janssen marked this pull request as draft February 14, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants