new sx structure parser that tolerates unexpected items #1253

freyso · 2023-12-08T18:23:35Z

In part, this is an experiment on parsing.

Idea 1 is to parse while reading the file, instead of reading the entire file first.
Idea 2 is to use a more tabular way of programming, by creating a table of keywords with associated parsing routines.
Whenever the keyword is encountered, the corresponding parsing routine is called.
This is organized in a multi-level way, so certain keywords will simply add additional keywords and their associated parsing routines, which will be removed again from the when the next higher-level keyword appears. This allows to parse larger sections that have multiple subitems. The hope is to produce more readable (and thus maintainable) parsing routines.

I have implemented this for the rarely used sphinx.structure.read_atoms routines, which we need to parse
some old data I created outside of pyiron. The new version supports multiple structures in the same file and ignores unexpected stuff (such as labels) much more reliably than the old version.

samwaseda

ok after having added all the comments I finally understood what it does, but wouldn't it be easier to parse the entire file once and use whatever is needed? Alternatively, I would suggest to do it based solely on regex. I feel that it's very difficult to keep an eye on the consistency on line, lineview, lineno etc.

pyiron_atomistics/sphinx/parser_base.py

pyiron_atomistics/sphinx/structure.py

pyiron_atomistics/sphinx/parser_base.py

samwaseda · 2023-12-11T08:25:15Z

pyiron_atomistics/sphinx/parser_base.py

+        self.line = filehandle.__iter__ ()
+        self.lineview = ''
+        self.filename = filename
+        self.lineno = 0


I'm very uneasy with defining this kind of stuff outside of __init__, basically because if this class is used without calling parse, people would only see AttributeError.

I think the whole while could be replaced by

with open(file) as filehandle: for i, line in enumerate(filehandle): self.lineno = i ...

I think the whole while could be replaced by

with open(file) as filehandle: for i, line in enumerate(filehandle): self.lineno = i ...

No, because sometimes I need multiple lines at the same time. More importantly, the reading of additional lines can happen elsewhere than in the main parser loop.

I'm very uneasy with defining this kind of stuff outside of __init__, basically because if this class is used without calling parse, people would only see AttributeError.

Anyone who implements a parser with such an engine and never calls "parse" didn't get the idea of the engine. Initializing in the parse function would allow to parse multiple times with the same object (if this makes sense...).

Ah, I missed the read_until parts. Well, I'd want at least the with statement, because currently I don't see the file ever being closed at all.

filehandle was implicitly closed by python upon destructing the (local var) filehandle upon leaving the routine. I made that more explicit now, also removing the parse-time only properties.

freyso · 2023-12-11T10:32:09Z

ok after having added all the comments I finally understood what it does, but wouldn't it be easier to parse the entire file once and use whatever is needed?

This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files. It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this.

A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).

Alternatively, I would suggest to do it based solely on regex. I feel that it's very difficult to keep an eye on the consistency on line, lineview, lineno etc.

The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line".

pmrv · 2023-12-11T10:42:45Z

I like the new architecture and agree that going line by line only once is important for both speed and memory consumption. But I do have some python nits, that I'll add later.

pmrv · 2023-12-11T10:48:10Z

pyiron_atomistics/sphinx/parser_base.py

+    class my_parser(keyword_tree_parser):
+        def __init__(self,file)
+            super ().__init__({
+                    "key1" : self.parse_key1,
+                    "key2" : self.parse_key2  })
+            self.parse (file)
+
+    """
+    def __init__ (self,keylevels=[]):
+        if isinstance(keylevels,dict):
+            keylevels = [ keylevels ]
+        elif not isinstance(keylevels,list):
+            raise TypeError
+        self.keylevels = keylevels


Expecting sub classes and base classes to have different signatures for their __init__ is imo problematic because it violates the substitution principle. If sub classes anyway should call parse in their init, can the base class init not be something like

def __init__(self, file, keylevels=[]): if isinstance(keylevels,dict): keylevels = [ keylevels ] elif not isinstance(keylevels,list): raise TypeError self.keylevels = keylevels self.parse(file)

then subclasses can just to

super.__init__(file=file, keylevels=...)

I thought about your suggestion, but I am not convinced. If calling the base class initializer means to do the parsing, the base class initializer must be called at the end of the derived class's init . This is contrary to what most people do elsewhere.

Fair, but then I would also keep it out of the sub classes __init__. An alternative would be to have the actual parsers not be sub classes of the keyword tree parser at all, but to implement a has-a relationship instead.

class StructureParser: def __init__(self, file): ... # setup self._parser = KeywordTreeParser(keylevels=...) self._parser.parse(file)

pyiron_atomistics/sphinx/parser_base.py

pmrv · 2023-12-11T11:13:38Z

pyiron_atomistics/sphinx/parser_base.py

+                        self._cleanup(keymap)
+                        res = func ()
+                        if isinstance(res,GeneratorType):
+                            res.send (None)
+                            keymap['%finalize!'] = res
+                        break


I like the use of generators, but I wonder if this cannot be streamlined a bit.

If I understand correctly, _cleanup is called here first, because we cannot know if the func from a previous iteration didn't install some new keylevels. Can we not require that by providing a method on the base parser that allows temporary modification of keylevels but also does the clean up as a context manager. The parser functions/generators func can then use this manager around their yield statement.

from contextlib import contextmanager class keyword_tree_parser: ... @contextmanager def install_keylevel(self, new_keylevels): self.keylevels.append(new_keylevels) yield self.keylevels.pop(-1)

and sub classes would use this

class sub(keyword_tree_parser): ... def my_parse_method(self): ... # some setup with self.install_keylevels({'cell': self.parse_cell}): yield ... # some processing

If we also require all keylevel functions to be generators, the hot part of the loop can then just be

res = func() res.send(None)

The back and forth between the parser generator functions and the base class is not quite clear to me, so maybe I'm missing some steps.

pmrv · 2023-12-11T11:15:44Z

pyiron_atomistics/sphinx/structure.py

+        self.positions = []
+        self.species = []
+        self.indices = []
+        self.ispecies=-1


Do these need to be attributes or can they be just local variables? They are only used to instantiate Atoms one L52 and below, or not?

…disappear within parse() routine

jan-janssen · 2023-12-11T11:18:32Z

I am wondering if we can combine this with the concepts we developed for the interactive LAMMPS parser. Here is a current example for the LAMMPS parser inside the atomistics package. https://github.com/pyiron/atomistics/blob/main/atomistics/calculators/lammps/calculator.py#L111

Writing the LAMMPS parser is easier, as LAMMPS already provides Python bindings, still the general principle is:

def parse_lammps(
    file,
    quantities=("energy", "forces", "stress", ...),
    **kwargs,
):
    interactive_getter_dict = {
        "forces": function_to_parse_force_from_line,
        "energy": function_to_parse_energy_from_line,
        "stress": function_to_parse_stress_from_line,
        ...
    }
    result_dict = {q: [] for q in quantities}
    for line in file:
          for q in quantities:
                interactive_getter_dict[q](previous_result=result_dict[q], line=line)
    return result_dict

This approach combines both, the flexibility to parse only for specific properties and the option to parse each line only once.

pmrv · 2023-12-11T11:18:36Z

One thing that makes it generally a bit hard to read is that the file status (current line, file object, etc.) is passed around implicitly via attributes. It would be more readable, if that was passed around explicitly via arguments to the parser generator functions, imo, but I can't judge how much rewriting that would be.

jan-janssen · 2023-12-11T11:21:07Z

@freyso Are you available to join the pyiron meeting today at 3pm? https://github.com/orgs/pyiron/discussions/207 Then we could start with this topic, that might be faster than the asynchronous discussion and separate developments at different locations.

pyiron_atomistics/sphinx/parser_base.py

samwaseda · 2023-12-11T12:28:58Z

This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files.

Annoying I don't know, but do we really have so large files in sphinx?

It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this.

But the end of a section is currently recognised by a marker, right? We can do the same with regex.

A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).

This part is anyway quite horrible right now in pyiron, because pyiron currently loads the whole input and output anyway. SPHInX allows @pmrv's lazy loading, but it's not supported elsewhere and is so buggy, so I don't think we should rely on this.

The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line".

I understand the idea, and if python was a human, I would definitely think the way you describe it is how it should work. The reality, however, is that regex is so blazing fast, that anything we do with for loop and string checking becomes totally obsolete. As a matter of fact, reading line by line was what we used to do for SPHInX, but the parsing was slow beyond imagination at that time.

In addition to this, regex makes @jan-janssen's idea also easier to implement, because the output parsers become more modular, because parsing takes place only if the user asks for it, while in your case the parsing must take place again if there's something the user wants to have additionally.

new sx structure parser that tolerates unexpected items

3ad3bd1

samwaseda reviewed Dec 11, 2023

View reviewed changes

pmrv reviewed Dec 11, 2023

View reviewed changes

pyiron_atomistics/sphinx/parser_base.py Outdated Show resolved Hide resolved

pmrv reviewed Dec 11, 2023

View reviewed changes

1. camel case class names 2. having parse-time properties appear and …

9ff02a2

…disappear within parse() routine

format black

f7a1358

samwaseda reviewed Dec 11, 2023

View reviewed changes

pyiron_atomistics/sphinx/parser_base.py Show resolved Hide resolved

use numpy as np

d7244bf

add no-op finalize stub in KeywordTreeParser

ba083f9

jan-janssen marked this pull request as draft February 14, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new sx structure parser that tolerates unexpected items #1253

new sx structure parser that tolerates unexpected items #1253

freyso commented Dec 8, 2023

samwaseda left a comment

samwaseda Dec 11, 2023

pmrv Dec 11, 2023

freyso Dec 11, 2023

freyso Dec 11, 2023

pmrv Dec 11, 2023

freyso Dec 11, 2023

freyso commented Dec 11, 2023

pmrv commented Dec 11, 2023

pmrv Dec 11, 2023

freyso Dec 11, 2023

pmrv Dec 11, 2023

pmrv Dec 11, 2023

pmrv Dec 11, 2023

jan-janssen commented Dec 11, 2023

pmrv commented Dec 11, 2023

jan-janssen commented Dec 11, 2023

samwaseda commented Dec 11, 2023

new sx structure parser that tolerates unexpected items #1253

Are you sure you want to change the base?

new sx structure parser that tolerates unexpected items #1253

Conversation

freyso commented Dec 8, 2023

samwaseda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freyso commented Dec 11, 2023

pmrv commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-janssen commented Dec 11, 2023

pmrv commented Dec 11, 2023

jan-janssen commented Dec 11, 2023

samwaseda commented Dec 11, 2023