Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve archive node synchronization by adding configurable timeout flag #13852

Open
pengin7384 opened this issue Jan 20, 2025 · 1 comment
Open

Comments

@pengin7384
Copy link
Contributor

pengin7384 commented Jan 20, 2025

Is your feature request related to a problem? Please describe.
I'm operating multiple Base and Optimism archive nodes. During this process, I am encountering persistent issues where block synchronization slows down significantly.

Describe the solution you'd like
I would like to improve the synchronization speed of both the Base Mainnet and Optimism Mainnet archive nodes.

Describe alternatives you've considered
While observing the logs of the op-node during periods of slow synchronization, I noticed the following warning message appearing multiple times:

lvl=warn msg="Engine temporary error" err="temporarily cannot insert new safe block: failed to create new block via forkchoice: context deadline exceeded"

This error occurs when the engine_forkchoiceUpdatedV3 RPC call exceeds the configured timeout (5 seconds) and results in a request timeout. This behavior is intentional and expected, but when the request times out, the op-node retries the same RPC request, which introduces additional delay.

To address this, I experimented by increasing the timeout from 5 seconds to 30 seconds while synchronizing the Base Mainnet archive node. This change proved effective. The experiment involved three Base Mainnet archive nodes:

Blue Node: The node that was synchronizing well (no special settings were applied to it other than displaying the latest block number).
Yellow Node: The node with synchronization issues (timeout increased from 5 seconds to 30 seconds).
Green Node: The node using the official image without any modifications.
After one week of monitoring synchronization, as shown in the attached graph:

Image

The Yellow Node completed synchronization up to the latest block.
The Green Node still had not synchronized because its synchronization speed matched the block generation rate.
Further investigation through trace logs revealed that for blocks containing transactions with hundreds of logs, the engine_forkchoiceUpdatedV3 RPC call could take more than 10 seconds. (DEBUG[01-19|05:13:53.027] Served engine_forkchoiceUpdatedV3 reqid=4941 duration=20.337986076s)

Based on these findings, I suggest the following improvements:
Add a flag to make the timeout adjustable, allowing users to set it according to their needs.
For full node operators, this change may not have a significant impact since they are already synchronizing well.
However, for archive node operators experiencing slow synchronization, this flexible timeout could be beneficial.
Importantly, this change should not negatively impact any operators.

Additional context
The versions used for testing were op-node v1.10.2 and op-geth v1.101411.4, with all other specs and flags remaining the same except for the timeout. The experiment was conducted on an in-house server.

I am running nodes both on my own server and in an AWS EC2 environment, and both environments are experiencing similar synchronization speed degradation. However, one unusual observation is that once the node catches up to the latest block, synchronization degradation does not occur frequently. But when the node is restarted, the blocks start to fall behind again. The proposed changes were helpful in this scenario.

Additionally, I also attempted to run archive nodes with reth, but due to its behavior of syncing multiple blocks at once, I observed that the gap would grow by about 10 blocks, then close again, repeating this cycle. As a result, I'm currently using op-geth.

@pengin7384
Copy link
Contributor Author

pengin7384 commented Jan 20, 2025

I’ve created a #13853 PR that implements the proposed changes. It adds a configurable timeout flag for the L2 engine to improve synchronization flexibility. Please have a look and let me know your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant