Lessons Learned From Recent Outage of Polygon PoS
Earlier this month, Polygon PoS chain experienced an outage that resulted in a number of valuable lessons. After a thorough examination, the core team is ready to share its findings and explain what is being done to prevent it from happening in the future.
But first, the team would like to acknowledge the disruption this outage has caused to all the users and developers who rely on Polygon. At the same time, it was the support and cooperation from the broader community including ecosystem contributors, validators and infrastructure providers that allowed the team to fix the issue in the shortest time possible. We are building this together, making both the network and the ecosystem supporting it more resilient with each iteration.
The Heimdall chain had an issue with the state-sync mechanism. In an attempt to fix it, the sequence of software releases and a second bug resulted in both services going offline. Over the course of several days, the team released multiple updates and worked closely with all the stakeholders in a coordinated rollout of the changes. There was no loss of user funds or data.
For a more technical review of the events, see this forum post. Here is a simplified timeline of the events:
March 10, 4:49 (UTC) An alert arrives indicating there is an issue with the state-sync mechanism on the Polygon PoS Mainnet.The team discovers that the Heimdall layer, which handles state-sync communications between bridge contracts on the Ethereum mainnet and Polygon PoS, has a data size check error which allows a large transaction to clog the state-sync mechanism.March 10, 10:10 (UTC) Heimdall v0.2.6 is pushed out, fixing the size check by reducing the limit from 100kb to 50kb. The incremental release caused a state-mismatch between different Heimdall versions which caused the chain to halt, something that could have been avoided with a hard fork.March 10, 16:20 (UTC) As soon as the team discovered there was an issue that would result in upcoming downtime, the team shared a forum post advising of an impending downtime, alerting users and giving them an hour and a half heads-up.Bor, the block-producing layer of Polygon PoS, depends on Heimdall to select a committee of producers out of all validators and a set of blocks for which they will do the work, collectively known as spans. Bor also halts as a result.March 10, 17:50 (UTC) Polygon PoS goes offline.The team decides to work on a new Heimdall release while also pushing a hotfix for Bor that includes hardcoded spans.March 10, 22:53 (UTC) Bor v.0.2.14 hotfix is released. Block production resumes, but bridges and PolygonScan remain offline.March 11, 1:30 (UTC) A forum update about the Bor hotfix taking effect.March 11, 09:20 (UTC) PolygonScan functionality restored.March 12, 08:07 (UTC) Validators are instructed to upgrade to Heimdall v0.2.7, which includes an overwrite of the 50 hard coded spans and introduces a rollback of Heimdall to a previously working block.Internal and external RPC nodes come back online.March 13, 6:58 (UTC) Polygon PoS is stable. Binance is upgrading its nodes to resume deposits and withdrawals.March 13, 18:15 (UTC) Forum post on bridge operations returning to normal.March 17, 4:22 (UTC) Validators are instructed to update to Heimdall v0.2.8, a hard fork which will allow the chain to create new spans normally.Heimdall is updated at block number 8664000.
This incident provided an opportunity for the team to reflect and find new approaches to challenges, paving the way for Polygon to become a truly battle-tested network. Both the network and the community come out stronger from the experience. Here are the lessons learned and what we are working on to improve:
- The team is working on Heimdall to make it more robust by adding dynamic transaction gas limit, increasing the block gas limit and adding bulk state-sync transactions to make the mechanism more robust.
- In the long term, a redesign of the Heimdall/Bor architecture is in the works that will loosen the tight coupling between the bridge mechanism and the consensus and core system of the chain. This will be implemented in the next version of the chain, tentatively codenamed v3, which will merge the Heimdall and Bor nodes and chain and remove the span mechanism.
- The team plans to perform more rigorous testing of all code changes going forward, institute internal audits for all releases, and introduce more peer reviews for complex changes and upgrades. There will also be changes to Mumbai testnet to make it a more suitable testing ground.
- On the communications side, a special team has been assembled to focus on streamlining communication and building a framework to respond to unforeseen incidents more effectively.
- The same team is also exploring a status page and social presence that allows people to quickly get a sense of the status of the network. A number of seasoned new hires are also bringing with them expertise in technical contingency management. This will allow us to continuously improve our release and communication processes to help mitigate similar issues in the future.
Last but not least, the team also appreciates our community who always do their part to get the word out and share our updates. Special thanks goes out to Pete Kim at Coinbase and the teams at VitWit and Informal Systems.
Let’s bring the world to Ethereum!