The library fastparse-memoize
adds the memoize
method to fastparse
's parsers.
When fastparse
processes grammars that require backtracking, parsing rules may be applied repeatedly at the same place in the input.
In most cases, the parsing results will also be the same.
Memoizing means that a parser will not be tried again at the same place; instead, the previous parsing result is fetched from the cache and is immediately available.
- Import the
Memoize
symbol and useMemoize.parse
instead offastparse.parse
. UseMemoize.parseInputRaw
instead offastparse.parseInputRaw
. - Import the
Memoize.MemoizeParser
symbol and use.memoize
on selected parsing rules.
Consider a simple language consisting of the digit 1
, the +
operation, and parentheses.
Example strings in that language are 1+1
and 1+(1+1+(1))+(1+1)
.
This language is parsed by the following straightforwardly written grammar:
import fastparse._
def program1[$: P]: P[String] = P(expr1 ~ End).!
def expr1[$: P] = P(plus1 | other1)
def plus1[$: P]: P[_] = P(other1 ~ "+" ~ expr1)
def other1[$: P]: P[_] = P("1" | ("(" ~ expr1 ~ ")"))
val n = 30
val input = "(" * n + "1" + ")" * n // The string ((( ... (((1))) ... ))) is the input for the parser.
fastparse.parse(input, program1(_)) // Very slow.
This program works but is exponentially slow on certain valid expressions, such as ((((1))))
with many parentheses.
The reason for slowness is that the parser tries plus1
before other1
.
For the string ((((1))))
, the plus1
rule will first try applying other1
but will eventually fail because it will not find the symbol +
.
When plus1
fails, it backtracks one symbol and retries other1
again.
So, the amount of work for other1
is doubled on every backtracking attempt.
This leads to other1
.
The exponential slowness is already apparent with
JVM warmup does not lead to any speedup of the parsing, because the slowness is algorithmic.
The repeated parsing with the rule other1
can be avoided if the resuls of the parsing are memoized.
The revised code attaches a .memoize
call to other1
but leaves all other parsing rules unchanged:
import io.chymyst.fastparse.Memoize
import io.chymyst.fastparse.Memoize.MemoizeParser
def program2[$: P]: P[String] = P(expr2 ~ End).!
def expr2[$: P] = P(plus2 | other2)
def plus2[$: P]: P[_] = P(other2 ~ "+" ~ expr2)
def other2[$: P]: P[_] = P("1" | ("(" ~ expr2 ~ ")")).memoize
val n = 30
val input = "(" * n + "1" + ")" * n // The string ((( ... (((1))) ... ))) is the input for the parser.
Memoize.parse(input, program1(_)) // Very fast.
The parsing rule such as other1
is a function of type P[_] => P[_]
.
When that parsing rule is tried, its argument of type P[_]
contains the current parsing context, including the current position in the input text.
The result is an updated parsing context, including the information about success or failure.
The entire updated parsing context is cached.
Whenever the same parsing rule is tried again at the same position in the input text, the cached result is returned immediately.
This works for most rules that do not have user-visible side effects.
-
For some
fastparse
grammars, memoization of certain rules will lead to incorrect parsing results (while memoizing other rules is perfectly fine). To maintain correctness of parsing, memoization should be not be applied indiscriminately to all rules. After adding somememoize
calls, the resulting grammar should be tested against a comprehensive set of positive and negative parsing examples. -
It will not be always obvious which parsing rules should be selected for memoization. Performance may improve after memoizing some rules but not after memoizing some other rules. Memoization should be stress-tested on realistic parsing examples to verify that parsing performance is actually improved.
-
Currently,
fastparse
has no protection against stack overflow. Each memoization call introduces an extra function call between rule invocations. This will make stack overflow occur earlier when parsing deeply nested input using memoized rules. For this reason, afastparse
grammar with many memoized rules may actually create the stack overflow error on much smaller inputs than the same grammar without memoization.
The conclusion is that memoization should be applied only to a carefully selected, small subset of parsing rules and tested comprehensively (both for parsing correctness and for performance). The memoized rules should be selected as the smallest set of rules such that the parsing performance improves after memoization while correctness is maintained.