Empowering the Blockchain Community: Automated Snapshot Optimization and Ecosystem Advancements

A deep dive into architecture design, strategic decision making, and leveraging tools like AWX, Terraform and Cloudflare R2

Polygon Labs
January 4, 2024
Technical Blog
Image source: Dribbble

Chaindata snapshots are pivotal checkpoints that capture the current state of a blockchain network and are fundamental for efficiently syncing new nodes to the network. Polygon Labs transitioned from manual snapshotting to an automated system to streamline accessibility and enable faster onboarding for aspiring network participants. This post encapsulates the team’s process and learnings.

Snapshots allow node operators to quickly catch up to the latest validated transactions and consensus. By encapsulating the entire history of the blockchain, these snapshots significantly reduce the time and resources required for new nodes to sync with the network. This enables quicker node participation and fosters further network decentralization. Snapshots play a vital role in ensuring accessibility and scalability. They benefit the broader blockchain ecosystem by facilitating smoother onboarding of new participants and contributing to the network's overall resilience and robustness.

Historically, DevOps engineers could spend up to 20 hours/week managing periodic snapshots. Moreover, hosting large snapshots (about 1.5Tb for Bor mainnet) on S3 led to exponentially-growing data egress charges as public downloads cost credits (>$100,000 per month).

By leveraging internal AWX (open-source Ansible Tower) service, Polygon Labs saw an opportunity to setup a scheduled ansible-playbook job that pauses Bor/Heimdall clients, prunes Bor chaindata, zstd compresses data, uploads final output files to Cloudflare R2 (no data egress fees charged for public downloads), and restarts all blockchain clients, automatically. Since these snapshots are public-facing and essential for bootstrapping new nodes that can join the Polygon PoS network, clear metrics and monitors had to be created to ensure snapshot system health.

Architecture Design for Automated Snapshotting

Polygon Labs leveraged an internal AWX deployment platform for its scheduled job capabilities and ability to easily trigger ansible-playbook jobs on several remote blockchain nodes. Scheduled jobs take monthly pruned snapshots of Bor/Heimdall chaindata and upload the final compressed files to Cloudflare R2. Find the most recent snapshot download links and technical docs here.

E2E automatic snapshotting system design flow

For redundancy, Polygon Labs has two types of nodes running under the hood: a) nodes dedicated to monthly pruning snapshots (mainnet and mumbai) and b) nodes dedicated to EBS volume optimized snapshots. Optimizing for fast-recovery and minimal sync-time for internal use cases, EBS volume snapshots go from zero resources to fully-synced Polygon PoS fullnodes in less than 2 hours (i.e. no download / decompression steps). To be cloud agnostic, Polygon Labs offers tar.zst compressed chaindata for public download.

Technical Stack

  • Ansible
  • For initial node configuration / pausing nodes / pruning Bor data / compressing chaindata / uploading final files to Cloudflare R2 / restarting nodes (via ssm)
  • AWX (Ansible Tower): serves as the deployment controller node
  • offers scheduled job service that orchestrates snapshot creation process on all remote blockchain nodes
  • provides powerful developer UI and API for launching new fullnodes / devnets / etc
  • Terraform Cloud
  • For initial node-resource creation, resource state management, triggered from within ansible-playbook runs via AWX
  • Automatically mounts optimized chaindata EBS volume on new node creation
  • Able to go from 0 resources to fully synced mainnet Polygon PoS fullnode in < 2 hours
  • Cloudflare R2
  • Offers $0/mo data egress fees for public downloads of compressed chaindata files
  • yields significant cost savings vs S3 where all public snapshot downloads cost credits
  • leverage rclone package for auth and rapid data transfer to Cloudflare R2
  • tar and zstd compression
  • For fast, tightly-packed compression algorithms which replace tar+gzip for snapshotting after deep-dive analysis
  • Datadog monitoring
  • Systemd service on each node periodically emits Heimdall/Bor sync status metrics
  • emission of all metrics to Datadog allows for the calculation in real-time of the rate of head block increase on each node and ensure uptime SLA
  • dashboards and monitors setup to immediately provide alerts if node is falling behind or not behaving as expected

Deploying With AWX: A Powerful Controller

Also known as open-source Ansible Tower, AWX offers a highly-flexible developer UI allowing engineers to deploy any Github commit to a variety of cloud environments (AWS, GCP, Azure). Devs can easily configure or trigger ansible-playbook jobs via Github webhooks, UI, or the AWX REST API. 

Within ansible playbooks, Terraform cloud is triggered for resource state locking. Via AWX, Polygon Labs is able to fully configure and start nodes automatically via an SSM connection that runs a series of Ansible role tasks. Ansible’s integration with EC2 dynamic inventory allows for nimble discovery and access to nodes created and tagged by Terraform.

Polygon Labs runs AWX via ECS (Elastic Container Service) and Fargate and is managed by our DevTools team. It is highly-available and sits behind an application load balancer for traffic routing. In terms of security, AWX offers highly-granular RBAC which safeguards user actions in the AWX UI and AWX API. It supports Github SSO login at the organizational level, allowing for simple developer access and a clear permissioning system. 

Ultimately, AWX has proved powerful for deploying individual Polygon PoS fullnodes, but also entire blockchain devnet environments that closely emulate real-world, server-based blockchain architectures, as depicted below:

AWX controller node high-level system design.

Case Study: Rapid Fullnode Creation

EBS optimized snapshots give Polygon Labs the ability to standup, start, and fully sync Polygon PoS fullnodes in less than 1-2 hours. The process is entirely one-click via AWX. Devs can easily configure machine type, Bor / Heimdall version, and much more via AWX custom deployment params at launch time. 

AWX scheduled ansible-playbook jobs take new EBS volume chaindata snapshots every 8 hours, so data staleness is at a minimum. These snapshots are ‘fast-snapshot restore’ optimized so all chaindata is fully initialized for any derived EBS volumes. There’s potential to share these optimized snapshots with the community in the future so the entire blockchain community can stand up and fully sync Polygon PoS nodes in just a few hours.

Rapid PoS V1 Fullnode Creation, Sync Time Benchmarking:

Optimizing Compression for Enhanced User Experience

Polygon Labs compared the performance of several compression algorithms, including tar gzip (with pigz), tar zstd, and tar lz4, to determine which one is the most suitable for our use case. The focus was on minimizing data transfer fees and reducing download and extraction times for end users.

Algorithm Performance

The tests were conducted on an Ubuntu Linux machine (m5d.4xlarge). The test data was Polygon PoS V1 testnet Bor data with a raw size of 221 GB. Here's a summary of the findings:

Compression Takeaways

While not drastically different in compression rate, tar zstd showed a statistically significant 8% decrease in total compressed file size. This becomes meaningful when considering Polygon PoS data which is roughly ~1.3 TB compressed; zstd yields an ultimate file size reduction of ~100 GB for Bor data.

Interestingly, while tar zstd takes 40% longer to compress, it decompresses 11% faster than tar gzip. This is important for optimizing the user experience and speeding up decompression on their end. For Polygon PoS Bor data, tar zstd ultimately saves end users >1hr of total decompression time. For all these reasons, the pros outweigh the cons and tar zstd was selected as the snapshotting compression system.

Re-evaluating Data Hosting Providers

Initially, Polygon Labs relied on S3 for data hosting, where Data Transfer - Internet (Out) contributed 72.16% of the monthly costs. As we began exploring Cloudflare R2, we discovered it offered minimal data transfer fees for public downloads, a feature that had the potential to dramatically reduce our data egress bill—especially given our heavy Bor mainnet chaindata sitting at ~1.5Tb compressed.

Exponentially growing data egress costs on S3 from public download of chaindata snapshots.

Monitoring and Alerting With Datadog

To have real-time metrics on the health of all snapshot nodes (beyond system specs and looking more closely at head block rate of increase, Bor/Heimdall ‘fully synced’ or not) a lightweight systemd service was created that periodically emits vital client metrics to Datadog. Once metrics are ingested, real-time node health dashboards and monitors are able to be created. Monitors are connected to an internal channel and easily configurable with on-call paging services so engineers can take action immediately if any irregularities are happening.

Balancing Performance and Cost: Optimization Results

Saving engineering time

An engineer's time is incredibly valuable. The time required for an engineer to manually start and oversee the snapshotting process has been reduced from ~20 hours/week to a mere 0.5 hours/week (assuming minimal Datadog dashboard review and any on-call related maintenance). This 97.5% decrease in time allocation allows engineers to focus on more pressing tasks, while maintaining confidence in the system's efficiency. Moreover, for end users, downloading and extracting chaindata is now a simple one-line terminal command, for example:

curl -L https://snapshot-download.polygon.technology/snapdown.sh | bash -s -- --network mainnet --client bor --extract-dir chaindata --validate-checksum true

Smart cost-saving decisions

Polygon Labs implemented two major cost-optimization changes: 1) transitioning from S3 to Cloudflare R2 for snapshot storage, and 2) switching from gzip to zstd for the chaindata compression algorithm. Moving snapshots to Cloudflare R2 means eliminating data egress fees for public downloads of blockchain data sets. Upon analysis, we found that the current S3 data egress fees for snapshot downloads is roughly ~$120,000/month. With Cloudflare R2 and the new automatic snapshotting system, Polygon Labs expects to pay ~$18,000/month, an approximate 83% decrease in monthly data egress charges and annual savings totaling >$1.1MM.

Projected cost savings switching to Cloudflare R2

In addition, adopting zstd compression led to more tightly compressed files, offering compounded benefits as we regularly upload snapshots to R2. While the increase in compression may not be drastic, it is expected to save end users >1hr in total download and extraction time, providing a more seamless and efficient user experience.

Furthermore, to enhance user experience and reduce costs, Polygon Labs introduced a snapshotting update that splits large chaindata files into 25GB parts. This allows the enablement of Cloudflare caching on these smaller file parts and continues to drive down fees related to public request traffic.

Next Steps for Further Optimization

Open-Sourcing Bootstrap Playbooks

There is potential to make public the ansible-playbooks and Terraform IaC discussed in this post for provisioning and configuring new Polygon PoS fullnodes from scratch. 

Community Snapshot Aggregator Model

As announced on our forum in Q4 2023, Polygon Labs will pursue a more community-driven snapshotting experience moving forward. This opens the door for third-party data providers and validators to provide their own snapshots to the Web3 community. 

An aggregator page will display all the valid snapshots (Heimdall / Bor / Erigon, Mumbai/mainnet) across all providers opting-in, and the community can decide which snapshots are most reliable for their purposes. External snapshot providers are free to push additional features such as Ethereum mainnet snapshots, daily incremental snaps, and beyond. Overall, this strategic shift will significantly increase reliability and redundancy of the Polygon PoS chaindata snapshotting system.

Conclusion

The journey to automated blockchain snapshots was both challenging and rewarding. Through innovative architecture design, strategic decision-making, and leveraging powerful tools like AWX, Terraform, and Cloudflare R2, Polygon Labs successfully optimized the snapshotting process, making it more efficient, cost-effective, and timely.

Special acknowledgements to John Hilliard (head of DevTools), Vince Reed (head of DevOps), and Piyush Maloo (former DevOps Engineer) for the feedback and collaboration necessary to pull this off. Polygon Labs's focus on finding an optimal balance between performance and cost led to significant improvements in reducing snapshot staleness, saving valuable engineer time, and slashing data egress costs. Ultimately, this process created a system that not only benefits our internal operations but also provides enhanced value to the entire blockchain developer community.

Get Involved!

Check out the technical blog for more in-depth content on the workings of Polygon protocols. Tune into the blog and our social channels to keep up with updates about the Polygon ecosystem.

Together, we can build an equitable future for all through the mass adoption of Web3.

Website | Twitter | Developer Twitter | Forum | Telegram | Reddit | Discord | Instagram | Facebook | LinkedIn

More from blogs