-
-
Notifications
You must be signed in to change notification settings - Fork 442
Parser rewriting #4313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Parser rewriting #4313
Conversation
Thank you @Nikita-str for putting the time and effort into writing this up. I support moving to a trampoline style parsing with a loop and state machine. The recursive descent parsing doesn’t suit rust so much (no tail call optimisation) and causes multiple issues. I understand that statically it’s not as easy to find sub parse nodes, but I hope we can improve debugging but having a debug_log which prints which parser is currently running when a flag is passed (we don’t have this today). Do you know if this can be worked on incrementally? Or do you plan to just keep updating this? Maybe maintainers can help if they have access, we could have a check list at the top and keep rebasing as we go along (the parser doesn’t beg touched as much these days so I don’t think it will conflict that often). im interested to hear from the other maintainers. |
I think, incrementally is not an option, because if any node A call sub-parse-node B outside of
That would be good, actually. We can do it in parallel. It's a very easily separable task, and with a few extra hands this can be done in 1-2 days. |
I feel like this may be over engineering something that should have a very simple solution. Seeing the original issue, the parser works fine on release mode, so there must be some set of optimizations that the compiler is doing to reduce the amount of recursion. My suggestion would be that we should try to look at the assembly of a release build, then carefully try to force the compiler to apply the same optimizations in debug mode. |
@jedel1043 To temporarily solve it you even need really splitting the functions (after making some structure size less -- what seems good to me; only structures' size reduction is not enough). For more details see this PR: first part of commits is seems ok to me, and second part is ugly (because you need splitting the function to solve the problem with But maybe, in current PR you need to do something like the splitting (you need to continue execution from |
If the problem is the size of ast nodes, I think we should just stop having ast nodes in the stack, and use something like |
I don’t think having heap allocated nodes would help with performance of the parser as we need to visit them a lot so it would create a new problem with back and forth trips to the heap (this would need to be benchmarked). The idea of specialized allocation for nodes on the AST also seems odd if we drop the whole tree anyway once we’ve moved onto the VM stage, unless I’m misunderstanding something.
I’m guessing the compiler is aggressively inlining during release mode so we don’t end up with deep stacks. So if we wanted to recreate this in debug mode we would need to add I’d be open to seeing how a loop & state machine fairs in performance to what we have today. |
It's not for performance though. The problem that we have right now is that our AST nodes use too much space in the stack, which gives little room for function calls. Moving every AST node to the heap makes it such that more recursive calls can be made. We would obviously still need to rewrite the parser to be smarter about recursive operators ( |
Oh yeah I know the original issue isn’t about performance. I was saying that to move nodes to the heap could end up harming parser performance despite fixing this specific issue. The trade off wouldn’t be worth it. instead, as you say we would be best finding ways to make the parser smarter around recursive operations. Or have a parser that wasn’t subject to these problems. The answer may be a bit of both. |
It could also improve performance! My thought is that right now any inner expression needs to be allocated on the heap, which distributes expressions throughout the heap, causing a lot of cache misses when you try to access related expressions. If we use bumpalo, related expressions will live near each other, making multiple accesses very cheap if the expressions fit in the same cache line. |
This Pull Request potentially(if it will be done) closes #4301 & #4089 & #1402 and allow to forget about stack overflow in the parser.
The main idea
The main idea is next:
ControlFlow
with next variants:Done
orSubParse{parse_node, continuation_point}
ControlFlow
sControlFlow
So parsing will be looks like:
Some more details & explanation
Simple rewrite case (actually this case can be simplified by adding
Pass
command toControlFlow
).Next one:
boa/core/parser/src/parser/statement/declaration/mod.rs
Lines 64 to 90 in 27bdda5
Transforms to this:
boa/core/parser/src/parser/statement/declaration/mod.rs
Lines 94 to 127 in 27bdda5
On first call of a parse node the
continue_point
is equal to0
. So we ignore firstif
and do all actions as before, except for not going deeper but return from the parse call (by callingparse_cmd![[SUB PARSE]: node]
: it changestate
and return validControlFlow
) intoParseLoop
and thenParseLoop
call the sub-parsing node next.I don't sure -- should I name the
continue_point
s in the function, or stay it like magic numbers? (0, 1, 2, and so on; mostly used only 0 and 1 but each time they have different sense depending on which sub calls were called)Some small pieces of code will be repeated often but they cannot be moved to a function due to different enum variants, so I moved them into macro_rule
parse_cmd!
, but I don't sure if such syntax is acceptable, so that's worth to discuss too.Right now rewritten functions looks like:
Maybe it's better to write as next to make the code easier to understand, I don't sure:
parse
called (it will be called fromParseLoop
always), to find it you have to see where the Node created.Why is it a draft PR?
ParseLoop
then it should be parsed in old way. Now only 3 parse nodes (out of ~100) have been rewritten to check if there will be any problems with this approach (it seems there will not be any).When all nodes will be rewritten with
parse_loop
, theparse
function & trait impl will be removed.I'm expecting some discussion around this approach, and I would prefer to wait for approval before proceeding with rewriting the rest of the nodes, since this looks like a task for several days.